CN111767243A - Data processing method, related device and computer readable medium - Google Patents

Data processing method, related device and computer readable medium Download PDF

Info

Publication number
CN111767243A
CN111767243A CN202010521841.2A CN202010521841A CN111767243A CN 111767243 A CN111767243 A CN 111767243A CN 202010521841 A CN202010521841 A CN 202010521841A CN 111767243 A CN111767243 A CN 111767243A
Authority
CN
China
Prior art keywords
convolution
data block
processing
block
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010521841.2A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202010521841.2A priority Critical patent/CN111767243A/en
Publication of CN111767243A publication Critical patent/CN111767243A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/17Interprocessor communication using an input/output type connection, e.g. channel, I/O port
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Image Processing (AREA)

Abstract

The embodiment of the invention discloses a computing device, which comprises: the system comprises a processor, a memory and a bus, wherein the processor is connected with the memory through the bus, the memory is used for storing instructions, and the processor is used for calling the instructions stored in the memory and executing a specific data processing method so as to improve the data processing performance and efficiency.

Description

Data processing method, related device and computer readable medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method, a related device, and a computer-readable medium.
Background
Convolution (conv) operation is a multiply-accumulate calculation of weight (weight) and input data (input). The processor is obtained by repeatedly calculating the product of the weight and the input data and performing accumulation calculation. Because register resources of the processor are limited, weight values and input data need to be loaded repeatedly in the convolution operation process, which brings a large amount of Input Output (IO) access stocks, causes IO bottlenecks, and affects the calculation efficiency of the processor.
Disclosure of Invention
The embodiment of the invention provides a data processing method, which can solve the problems of large IO access and storage amount, IO bottleneck, influence on computing efficiency and the like in the prior art.
In a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:
acquiring a convolution data block;
distributing the convolution data block to the m processing kernels to obtain convolution weights of the m processing kernels, wherein the convolution weights belong to one part of the convolution data block, and m is a positive integer;
and moving the convolution data block in the depth direction of the whole image data block, determining the input data of the m processing kernels, and performing convolution operation on the image data block of each processing kernel and the corresponding convolution weight to call the m processing kernels to realize the convolution operation of the image data block and the convolution data block so as to obtain a convolution result block.
In a second aspect, an embodiment of the present invention provides a computing apparatus, which includes a unit or a module for executing the method of the first aspect.
In a third aspect, an embodiment of the present invention provides a computing chip, where a computing cluster including m processing cores is deployed in the computing chip, and the computing chip is configured to execute the method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides another computing device, including a processor, a memory, and a bus, where the processor and the memory are connected through the bus, the memory is used to store instructions, and the processor is used to call the instructions stored in the memory, so as to execute the method of the first aspect.
In a fifth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the method of the first aspect.
By implementing the embodiment of the invention, the convolution data block can be pre-distributed to the m processing kernels, and then convolution operation is carried out by moving the convolution weight of the image data block and the m processing kernels, so that convolution operation between the image data block and the convolution data block is realized. Therefore, the problems that the IO access and storage quantity is large, IO bottleneck occurs, the calculation efficiency is influenced and the like in the prior art can be solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a computing chip according to an embodiment of the present invention.
Fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram illustrating splitting of a convolutional data block according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of loading a convolutional data block according to an embodiment of the present invention.
Fig. 5 is a schematic diagram illustrating a mapping relationship between an image data block and a convolution result block according to an embodiment of the present invention.
Fig. 6 to fig. 13 are specific schematic diagrams illustrating implementation of convolution operations in several processing cores according to embodiments of the present invention.
Fig. 14 is a schematic structural diagram of a computing device according to an embodiment of the present invention.
Fig. 15 is a schematic structural diagram of a computing device according to an embodiment of the present invention.
Fig. 16 is a schematic diagram illustrating a principle of convolution operation according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The dimension calculation of convolution operation is an important problem for defining the structure of the neural network, when the neural network is built by using a deep learning framework such as PyTorch, tenserflow and the like, the input dimension and the output dimension of each layer must be accurately calculated, otherwise, errors are easy to occur, and the related dimension calculation will be described in detail herein.
First, we look at what the convolution operation involves, a convolution operation requires defining the size of the convolution kernel, the padding length of the input image, and the step size of the convolution operation. A schematic diagram of a multiple convolution kernel operation is shown in fig. 16, taking a two-dimensional input as an example. The input data of this example is a three-dimensional data with the number of channels, the third dimension of the input data is the number of channels, two-dimensional images are obtained by scanning with two convolution kernels (filters) (one convolution kernel is used for one three-dimensional data, that is, one image can be obtained by scanning two-dimensional data with a plurality of channels, the convolution kernel is also required to be three-dimensional, and the number of channels is the same as the number of channels of the input data), and two channels of the output data are formed. We describe the specific calculations below.
Assume that the input data size is: w x h
Where w is the width and h is the height. The convolution kernel size is f × f
The length of padding is p (padding), and the step size is s (stride), then after the convolution operation, the output data size:
Figure BDA0002530757880000041
if the input data is three-dimensional data, namely: w x h x c
Where w is the width, h is the height, c is the number of channels (for RGB image input this value is typically 3, in text processing, typically the number of different embedding models).
The convolution kernel at this time is also typically a three-dimensional convolution kernel with channels: f x c
Note that, in general, the number of channels c of the convolution kernel and the number of channels of the input data are identical. Therefore, the output after convolution at this time is still a two-dimensional data with the size:
Figure BDA0002530757880000042
the dimension here is rounded down to prevent the situation where the result is not an integer. If it is desired to output the result with channels, then this time it is operated on using a number of convolution kernels, and the final output data dimensions are:
Figure BDA0002530757880000043
where c' is the number of convolution kernels.
In the process that the computing chip executes the conv2D instruction, the convolution kernel moves in the 2-D direction on the whole input data and does not move in the depth direction, the dimension of the depth of the convolution kernel is equal to the dimension of the depth of the whole input data, and accumulation operation is carried out in the depth direction.
In the process that the computing chip executes the conv3D instruction, the convolution kernel moves in the depth direction on the whole input data, and on the basis, the convolution kernel can move in the 2-D direction on the whole input data and perform accumulation operation in the depth direction.
At present, in convolution (conv) operation, because a large amount of repeatedly loaded weights and input data are needed, bandwidth bottleneck of an IO bus is caused, further, a large amount of IO accesses bring about a large power consumption problem, and meanwhile, the calculation efficiency of a processor is also reduced. Particularly, when the convolution operation (conv3D) is performed in 3D, the convolution operation needs to be repeatedly called to perform the convolution operation of conv3D, so that the problems of IO bottleneck, power consumption, calculation efficiency and the like are further amplified. In order to solve the above problems, the present application proposes a data processing method and a related framework to which the method is applicable.
Fig. 1 is a schematic structural diagram of a computing chip according to an embodiment of the present disclosure. One or more computing clusters are disposed in the computing chip 100 shown in fig. 1, and the present invention is illustrated and described below with reference to one cluster as an example, but not limited thereto. Each computing cluster includes m processing cores 102(ipu core) and a storage space 104 shared by the m processing cores 102, which may also be referred to as a Shared Memory (SM) 104. And m is a positive integer set by a system in a self-defining way.
Each processing core 102 supports independent operation, m processing cores 102 share one memory space 104, and data can be shared in the memory space 104 or temporarily stored. Where m processing cores 102 may support parallel or serial operation. In parallel operation, m processing cores 102 may compute m different pieces of data simultaneously. However, since m processing cores 102 are running simultaneously, each processing core 102 needs to independently input data, so that each processing core 102 needs to independently exchange data with a Double Data Rate (DDR) or a memory space 104. If the input data required to be calculated by each processing core 102 is stored in the DDR, the IO bandwidth from the DDR to the cluster is limited, and the IO speed becomes a bottleneck. If the input data or intermediate data required to be calculated by each processing core 102 is stored in the storage space 104, each processing core 102 can only use 1/m of the size of the storage space 104, and the size of the storage space is usually small and cannot meet the storage requirement of the intermediate data convolved by conv 3D. If the storage space 104 is large enough, the storage requirement of the operation of conv3D can be satisfied. In practical application, however, m processing cores run in parallel and are less applicable.
In serial operation, m processing cores 102 may communicate with each other through memory space 104. That is, m processing cores 102 inside each cluster can perform inter-core pipeline communication inside the cluster, and the output of each processing core 102 is the input of the next processing core 102. The data flow of the m processing cores 102 is specifically shown as an arrow in the figure. Therefore, the cluster is taken as a unit, m parts of input data need to be transmitted in the conventional convolution operation process, only 1 part of input data needs to be transmitted, the bandwidth overhead between the DDR and the cluster is greatly reduced, the IO bottleneck is favorably reduced, the IO access and storage capacity is large, and the like.
The storage space 104 plays a role in transfer, maximally exerts the sharing attribute of the storage space, and equivalently, completely opens the connection among the m processing cores 102, so that each processing core 102 does not need to wait for the data required to be calculated, which is beneficial to improving the calculation efficiency of the processing cores 102 and the calculation performance of each processing core 102.
It should be noted that m processing cores inside the cluster of the present invention run serially in a pipeline, and the current processing core needs to use the operation data output by the previous processing core. However, for each processing core, after processing one operation, the corresponding input data can be loaded to enter the next operation.
In practical applications, the computing chip may be deployed in a computing device, where the computing device includes but is not limited to a smart phone (such as an Android phone, an IOS phone, and the like), a personal computer, a tablet computer, a palmtop computer, a Mobile Internet device (MID, Mobile Internet Devices), or a wearable smart device, and the embodiment of the present invention is not limited thereto.
Fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present invention. The method is applied to the computing chip shown in fig. 1, and the method shown in fig. 2 comprises the following implementation steps:
s201, countingThe computation chip obtains the convolution data block. The size of the convolution data block is A1×B1×C1。A1For the height of the convolved data block, B1For the width of the convolved data block, C1Is the depth of the convolved data block.
S202, distributing the convolution data block to the m processing kernels to obtain convolution weights of the m processing kernels, wherein the convolution weights belong to one part of the convolution data block, and m is a positive integer.
The calculation chip can divide the convolution data block into m parts in advance to obtain m parts of convolution weights, and the m parts of convolution weights are correspondingly deployed or distributed to corresponding processing cores. In the convolution operation, each processing core can use the convolution weight value deployed by the processing core to perform the convolution operation. The splitting method of the present invention is not limited, for example, splitting according to a certain proportion or splitting on average.
As a possible implementation, the computation chip equally distributes the convolution data blocks to m processing kernels to obtain convolution weights of the m processing kernels. The convolution weights of m processing kernels jointly form a convolution data block, and the number of the convolution weights of the first m-1 processing kernels is equal to that of the convolution weights of the first m-1 processing kernels
Figure BDA0002530757880000061
The number of convolution weights of the last processing kernel is
Figure BDA0002530757880000062
For example, taking 3 × 3 × 3 and m ═ 4 as an example of a convolution data block (also referred to as a convolution kernel), 4 processing kernels are processing kernel 1 to processing kernel 4, respectively, the convolution kernel includes 27 convolution weights, and the computing chip can equally allocate the convolution kernels to the 4 processing kernels. Fig. 3 shows a splitting schematic diagram of a possible convolutional data block. As shown in fig. 3, each minicube represents a convolution weight, and for easy viewing and understanding, the present invention can abstract a minicube in a three-dimensional space into a small square in a two-dimensional plane, as shown in detail. Each small square (rectangle) also represents a convolution weight.
In the process of assigning convolution weights, 7 convolution weights, specifically, 7 convolutions 1 as illustrated, are assigned to the processing core 1. In processing core 2, 7 convolution weights, such as 7 2 in the illustration, will be assigned. In processing core 3, 7 convolution weights, such as 7 3 in the illustration, will be assigned. In the processing core 4, 6 convolution weights, 6 4 as shown, will be assigned. Where 1, 2, 3, and 4 in the figure are used only to distinguish between values assigned to 4 different processing cores, and do not represent the value of each convolution weight.
Invention AiDenotes height, BiDenotes the width, CiIndicating depth, i being a positive integer. Data block Ai×Bi×CiViewed in different directions, it can be seen as being composed of multiple two-dimensional planar data blocks along a certain viewing direction. For example, taking the view direction as the depth direction as an example, the data block Ai×Bi×CiIn particular consisting of CiA ofi×BiAnd (4) plane data block composition. Similarly, when the computing chip loads a data block, a plurality of plane data blocks in the view direction may also be sequentially loaded along the view direction.
For example, taking the data block as 2 × 3 × 3 as an example, please refer to fig. 4 to show a specific diagram of data block loading. As shown in fig. 4, 3 2 × 3 flat data blocks are included in the depth direction (depth), specifically, depth is 1, 2, and 3 as shown. When the computing chip loads data, a plane data block with depth equal to 1 layer may be loaded first. When loading the planar data block, each data contained in the planar data block may be loaded in the height direction, that is, each data is loaded in the height direction by a fixed width, for example, a gray rectangle exemplifies the order in which each data is loaded. And after the depth is loaded into the plane data blocks of 1 layer, sequentially loading the plane data blocks of each layer along the depth direction according to the data loading principle.
S203, moving the convolution data block in the depth direction of the whole image data block, determining the input data of the m processing cores, and performing convolution operation on the image data block of each processing core and the corresponding convolution weight to call the m processing cores to realize the convolution operation of the image data block and the convolution data block, so as to obtain a convolution result block. Wherein the content of the first and second substances,the image data block and the convolution data block are both A in size2×B2×C2。A2Denotes height, B2Denotes the width, C2Indicating the depth.
The calculation chip performs convolution operation with the respective convolution weights of the m processing kernels by moving the image data block along the depth direction to obtain a convolution result block. For each processing core, all the plane data blocks contained in the image data block can be moved along the depth direction, and the convolution operation is performed on the plane data blocks and the convolution weight of the processing core to obtain an intermediate output block. Optionally, each processing core caches the computed intermediate output block in the storage space 104 for use by the next processing core, specifically participating in the computation of the next processing core.
In one embodiment, the computational chip determines, according to the convolution weights of m processing cores, the number p of layers of the convolution data block occupied by the convolution weight of each processing core in the depth direction. Then, according to the number p of layers corresponding to the m processing cores, C contained in the image data block is sequentially moved and loaded along the depth direction2A plane data block A2×B2And carrying out convolution operation on the convolution weights corresponding to the m processing kernels to obtain a convolution result block. In conv3D, the number p of layers is required to be a natural number less than 3, and may be 0, 1 or 2.
For any plane data block (for example, jth plane data block) in the convolution result block along the depth direction, there are 3 input plane data blocks involved in movement and calculation in the m processing cores, and specifically, the input plane data blocks may be the jth-1 plane data block, the jth plane data block, and the jth +1 plane data block included in the image data block. Specifically, the computing chip may sequentially determine, according to the number of layers corresponding to the m processing cores, an input data block that needs to be loaded by each processing core, and perform convolution operation with the convolution weights of the m processing cores, to obtain a jth planar data block in the convolution result block, where the input data block is derived from the 3 planar data blocks.
The following explains the respective operations of the m processing cores by taking the calculation of the jth plane data block in the convolution result block as an example. For the 1 st processing core, according to the number p of layers corresponding to the 1 st processing core and the layer sequence, determining an input data block required to be loaded by the 1 st processing core from (3 plane data blocks) the j-1 th, j-th and j +1 th plane data blocks included in the image data block. And then the 1 st processing core performs convolution operation by using the input data block and the convolution weight of the processing core to obtain the jth intermediate convolution block operated by the 1 st processing core. Optionally, the jth intermediate volume block of the 1 st processing core operation may be further cached in the storage space 104 for use by the 2 nd processing core.
For the 2 nd processing core, the input data block required to be loaded by the 2 nd processing core may also be determined from (3 plane data blocks) the j-1 th, j-1 th and j +1 th plane data blocks included in the image data block according to the layer number p and the layer sequence corresponding to the 2 nd processing core. And then the 2 nd processing core performs convolution operation by using the input data block and the convolution weight of the processing core to obtain the jth intermediate output block operated by the 2 nd processing core. Then, a jth intermediate volume block operated by a 1 st processing core is obtained from the storage space 104, specifically, the jth intermediate volume block may be migrated from the storage space 104 to a 2 nd processing core for storage, and then the jth intermediate volume block is obtained to perform a summation operation with a jth-1 intermediate output block operated by the 2 nd processing core, so as to obtain a jth intermediate volume block operated by the 2 nd processing core. Optionally, the jth intermediate volume block of the 2 nd processing core operation may be further cached in the storage space 104 for use by the 3 rd processing core.
By analogy, the jth intermediate volume block (the jth intermediate volume block operated by the nth processing core) output by the nth processing core calculation is the jth plane data block in the convolution data blocks. It can be seen that, for the ith processing core, according to the number p of layers corresponding to the ith processing core and the layer sequence, the input data block required to be loaded by the ith processing core is determined from the j-1, the jth and the j +1 th plane data blocks included in the image data block, where j is less than or equal to C2Is a positive integer of (1). When j is 1, the 0 th plane data block may be a plane data block consisting of all 0 s, also referred to as complementary addpadd. Then using convolution of the input data block with the i-th processing kernelPerforming convolution operation on the weight to obtain a jth intermediate output block of the ith processing core; the jth intermediate volume block operated by the ith-1 processing core is obtained from the storage space, then the jth intermediate volume block operated by the ith-1 processing core and the jth intermediate output block of the ith processing core are summed to obtain the jth intermediate volume block operated by the ith processing core, and the jth intermediate volume block is cached in the storage space 104 to be used by the next (i +1) processing core, wherein i is a positive integer less than or equal to m. When i is 1, the jth intermediate output block of the 1 st processing core is the jth intermediate volume block operated by the 1 st processing core, and the jth intermediate volume block of the (i-1) th processing core does not exist or is 0.
To facilitate a better understanding of the invention, the following description will refer to the convolutional data block A1×B1×C1For 3 × 3 × 3, m is 4 for example, 4 processing kernels can be respectively represented as core _1, core _2, core _3 and core _4, referring to the example described in fig. 3, 7 convolution weights of 1 corresponding position are pre-deployed in core _1, and the 7 convolution weights are all located at the 1 st layer (i.e. depth is 1, 1 layer) of the convolution data block in the depth direction, 7 convolution weights of 2 corresponding positions are pre-deployed in core _2, and the 7 convolution kernels are located at the 1 st layer and the 2 nd layer of the convolution data block in the depth direction (i.e. depth is 1 and 2, 2 layers), 7 convolution weights of 3 corresponding convolution positions are pre-deployed in core _3, and the 7 convolution weights are located at the 2 nd layer and the 3 rd layer of the convolution data block in the depth direction (i.e. depth is 2 and 3, 2 layers, 4 layers are pre-deployed in the convolution data block in the depth direction, and the 7 convolution weights are located at the 6 th layer (i.e. depth is 1, 2, 4 layers).
Image data block A2×B2×C2In the depth direction including C2A plane data block A2×B2. Whether C2In conv3D, each layer of plane data blocks is obtained by using convolution operation of 3 plane data blocks. For example with C2For example, the image data block input to the computing chip includes 8 (layer) plane data blocks a2×B2Similarly convolving the result of the operationThe block also contains 8 (layer) plane data blocks a in the depth direction2×B2. Fig. 5 is a diagram illustrating a mapping relationship between an image data block and a convolution result block.
As shown in fig. 5, the computing chip (m processing cores in the computing chip) may add a layer of 0 (shown as adddad, i.e. a layer of plane data block a composed of all 0) to the head and the tail of the image data block in the depth direction2×B2). First layer addPad, plane data block A of layer 1 and layer 2 in image data block2×B2The layer 1 (plane) data block A in the convolution result block can be calculated2×B2. Layer 1, layer 2 and layer 3 plane data block A in image data block2×B2The layer 2 plane data block A in the convolution result block can be calculated2×B2. And so on, the j-1 th, j-1 th and j +1 th plane data blocks A in the image data block2×B2Calculating to obtain the j-th layer plane data block A in the convolution result block2×B2. Optionally, the number of layers C in the depth direction of the image data block2Is not enough to calculate C in the block of convolution results2A plane data block A2×B2Then, the calculation of conv3D may be implemented in a complementary adpdad manner.
As can be seen from the above description of fig. 3 and 5, the computing chip utilizes m processing cores to realize the convolution data block 3 × 3 × 3 and the image data block a2×B2×C2During the convolution operation, the calculating chip needs to sequentially move and load the image data blocks according to the layer number and the layer sequence of the convolution weight values of the m processing kernels, which correspondingly occupy the convolution data blocks in the depth direction, so as to realize the convolution operation of the convolution data blocks and the image data blocks and obtain a convolution result block A2×B2×C2The convolution result block includes C in the depth direction2A plane data block A2×B2. Illustratively, the 2 nd plane data block A in the convolution result block is calculated with 4 processing cores2×B2For example, 4 processing cores need to utilize image data block A2×B2×C2The 1 st, 2 nd and 3 rd plane data blocks a divided along the depth direction2×B2And calculating to obtain the 2 nd plane data block A in the convolution result block2×B2
In a specific implementation, for the 1 st processing core (core _1), a plane data block a with depth equal to 1 layer in the image data block needs to be used2×B2As an input data block, convolution operation is performed by combining 7 convolution weights (i.e., 7 convolution weights at positions corresponding to 7 1 s in fig. 3) deployed in core _1 to obtain the 2 nd intermediate convolution block a of core _12×B2Fig. 6 shows a schematic diagram of a possible convolution operation. As shown in fig. 6, the input data block (input) is a plane data block a with depth equal to 1 layer in the image data block2×B2Convolution operation is performed with the convolution weight value in core _1, and the 2 nd intermediate volume block A of core _1 is output (output shown in the figure)2×B2. Optionally, core _1 may be the 2 nd intermediate volume block A of core _12×B2Buffered in memory space 104 for use by core _ 2.
For the second processing core (core _2), a plane data block a with depth 1 and 2 layers in the image data block is used2×B2As an input data block, convolution operation is performed by combining 7 convolution weights (i.e., 7 convolution weights at positions corresponding to 7 2 s in fig. 3) deployed in core _2 to obtain the 2 nd intermediate output block a of core _22×B2. And further with the 2 nd intermediate volume block A of core _12×B2Performing summation operation to obtain the 2 nd intermediate volume block A of core _22×B2. Referring to fig. 7, another possible convolution operation is shown. As shown in fig. 7, the input data block input is a planar data block a having 1 depth and 2 layers, respectively2×B2Convolution operation is carried out on the convolution weight of the core _2, and the 2 nd intermediate output block A of the output core _2 is output2×B2. Then, the 2 nd intermediate volume block A of core _1 is obtained from the storage space 1042×B2And with the 2 nd intermediate output block A of core _22×B2Performing summation operation to obtain the second of core _22 intermediate volume blocks A2×B2. Optionally, core _2 may merge the 2 nd intermediate volume block A of core _22×B2Buffered in memory space 104 for use by core _ 3.
For the third processing core (core _3), the plane data block a with depth of 2 and 3 layers in the image data block is used2×B2As an input data block, convolution operation is performed by combining 7 convolution weights (i.e., 7 convolution weights at positions corresponding to 7 convolution weights 3 in fig. 3) deployed in core _3 to obtain the 2 nd intermediate output block a of core _32×B2. And further with the 2 nd intermediate volume block A of core _22×B2The summation operation is carried out to obtain the 2 nd intermediate volume block A of core _32×B2. Referring to fig. 8, another possible convolution operation is shown. As shown in fig. 8, the input data block input is a planar data block a having 2 and 3 layers, respectively2×B2Convolution operation is carried out on the convolution weight of the core _3, and the 2 nd intermediate output block A of the output core _3 is output2×B2. Then, the 2 nd intermediate volume block A of core _2 is obtained from the storage space 1042×B2And with the 2 nd intermediate output block A of core _32×B2The summation operation is carried out to obtain the 2 nd intermediate volume block A of core _32×B2. Optionally, core _3 may merge the 2 nd intermediate volume block A of core _32×B2Buffered into memory space 104 for use by core _ 4.
For the fourth processing core (core _4), a plane data block a with depth of 3 layers in the image data block is used2×B2As an input data block, convolution operation is performed by combining 7 convolution weights (i.e., 7 convolution weights at positions corresponding to 7 4 in fig. 3) deployed in core _4 to obtain the 2 nd intermediate output block a of core _42×B2. And further with the 2 nd intermediate volume block A of core _32×B2Performing summation operation to obtain the 2 nd intermediate volume block A of core _42×B2. Referring to fig. 9, another possible convolution operation is shown. As shown in fig. 9, the input data block input is a planar data block a with depth being 3 layers2×B2Convolution operation is carried out on the convolution weight of the core _4, and the 2 nd intermediate output block A of the output core _4 is output2×B2. Then, the 2 nd intermediate volume block A of core _3 is obtained from the storage space 1042×B2And with the 2 nd intermediate output block A of core _42×B2Performing summation operation to obtain the 2 nd intermediate volume block A of core _42×B2. At this time, the 2 nd intermediate volume block A of core _42×B2Namely a 2 nd layer (a) plane data block A contained in a convolution result block which is calculated and output by 4 processing cores of a calculation chip2×B2. Optionally, core _4 may cache the 2 nd intermediate volume block of core _4 (i.e., the 2 nd plane data block in the convolution result block) into the DDR of the compute chip.
Similarly, the computing chip can move sequentially and load C of the image data block along the depth direction2A plane data block A2×B2Performing convolution operation with the convolution weight of each of the m processing kernels to obtain a convolution value containing C2A plane data block A2×B2The block of convolution results. In this example, the specific implementation of the computing chip obtaining the jth plane data block in the convolution data blocks through 4 processing core computations is as follows:
for core _1, a plane data block A with depth j-1 layer in the image data block is used2×B2As an input data block, convolution operation is performed by combining 7 convolution weights deployed in core _1 to obtain an intermediate convolution block A of core _12×B2. Please refer to fig. 10, which illustrates a schematic diagram of a possible convolution operation. As shown in fig. 10, the input data block input is a planar data block a with depth j-1 layer in the image data block2×B2Convolution operation is carried out on the convolution weight value in the core _1, and the jth intermediate volume block A of the output putcore _1 is output2×B2. Optionally, core _1 may merge the jth intermediate volume block A of core _12×B2Buffered in memory space 104 for use by core _ 2.
For core _2, a plane data block A with depth j-1 and j layers in the image data block is used2×B2As an input data block, performing convolution operation by combining 7 convolution weights deployed in core _2 to obtain the jth intermediate output block A of core _22×B2. And further with the jth intermediate volume block A of core _12×B2Performing summation operation to obtain the jth intermediate volume block A of core _22×B2. Please refer to fig. 11, which illustrates a schematic diagram of a possible convolution operation. As shown in fig. 11, the input data block input is a planar data block a of the j-1 layer and the j-th layer in the image data block2×B2Convolution operation is carried out on the convolution weight value in the core _2, and the jth intermediate output block A of the output core _2 is output2×B2. Further, the jth intermediate volume block A of core _1 is obtained from the storage space 1042×B2And with the jth intermediate output block A of core _22×B2The sum is obtained to get the jth intermediate volume block A of core _22×B2. Optionally, core _2 may merge the jth intermediate volume block A of core _22×B2Buffered in memory space 104 for use by core _ 3.
For core _3, the depth j and the plane data block a of j +1 layer in the image data block are used2×B2As an input data block, performing convolution operation by combining 7 convolution weights deployed in core _3 to obtain the jth intermediate output block A of core _32×B2. And further with the jth intermediate volume block A of core _22×B2Performing summation operation to obtain the jth intermediate volume block A of core _32×B2. Please refer to fig. 12, which illustrates a schematic diagram of a possible convolution operation. As shown in fig. 12, the input data block input is a planar data block a of the depth j layer and the j +1 th layer in the image data block2×B2Convolution operation is carried out on the convolution weight value in the core _3, and the jth intermediate output block A of the output core _3 is output2×B2. Further, the jth intermediate volume block A of core _2 is obtained from the storage space 1042×B2And with the jth intermediate output block A of core _32×B2Summing to obtain jth intermediate volume block A of core _32×B2. Optionally, core \u3 may accumulate the jth intermediate volume block A of core _32×B2Buffered into memory space 104 for use by core _ 4.
For core _4, a plane data block a with depth j +1 layer in the image data block is used2×B2As an input data block, performing convolution operation by combining 7 convolution weights deployed in core _4 to obtain the jth intermediate output block A of core _42×B2. And further with the jth intermediate volume block A of core _32×B2Performing summation operation to obtain the jth intermediate volume block A of core _42×B2. Please refer to fig. 13, which illustrates a schematic diagram of a possible convolution operation. As shown in fig. 13, the input data block input is a planar data block a with depth j +1 layer in the image data block2×B2Convolution operation is carried out on the convolution weight value in the core _4, and the jth intermediate output block A of the output core _4 is output2×B2. Further, the jth intermediate volume block A of core _3 is obtained from the storage space 1042×B2And with the jth intermediate output block A of core _42×B2The sum is obtained to get the jth intermediate volume block A of core _42×B2That is, j is the jth plane data block in the convolution result block calculated by 4 processing cores of the calculation chip, wherein j is less than or equal to C2Is a positive integer of (1). According to the above calculation principle, the calculation chip can calculate the value containing C2A plane data block A2×B2The block of convolution results.
In order to more intuitively understand the convolution operation performed by the 4 processing kernels in the above example by moving the image data block along the depth direction and the convolution weights of the 4 processing kernels, the following table format shows the specific implementation process. Table 1 below shows a schematic flow chart of the convolution operation implemented by 4 processing cores.
TABLE 1
Figure BDA0002530757880000131
As in table 1 above, depth ═ k, denotes the kth plane data block a in the depth direction in the image data block2×B2K is less than or equal to C2Is a positive integer of (1). P _ i represents an intermediate volume block (may also be referred to as an i-th intermediate volume block) whose depth is i calculated by the processing core. Core _ n->Core _ (n +1) indicates that the operation result of the processing Core n is transferred to the processing Core n +1 for the processing Core n +1 to operate.
In table 1, a small square under each processing core actually indicates that the processing core performs convolution operation on the k-th plane data block in the image data block and the convolution weight of the processing core to obtain the i-th intermediate volume block of the processing core. For example, a small square box with 1(P _2) is recorded under the core _1, a plane data block with depth being 1 in the image data block (i.e., the first plane data block in the image data block) is specifically represented as an input data block of the core _1, and the convolution operation is performed with the convolution weight of the core _1 to obtain the 2 nd intermediate convolution block of the core _1, which is recorded as P _ 2. The table can realize the acquisition of the jth plane data block of the convolution data block in a pipeline mode in 4 convolution kernels according to a table diagonal line flow. For example, the diagonal flow in table 1 specifically shows a specific schematic flow in which the computation chip performs an internal pipeline operation by using 4 processing cores, and obtains the 2 nd plane data block (i.e., the plane data block with depth equal to 2, which is shown in table as output 2) in the convolution data block through the computation.
It should be noted that each row in the above table corresponds to one logic slice (also referred to as a logic time segment, hereinafter referred to as a time period). Each row of information in the table indicates the operation performed by the computing chip during that time period. For example, the first row represents: the computing chip loads the plane data block with depth of 1 in the image data block, that is, loads the 1 st plane data block in the image data block to the storage space 104SM in this period. The second row represents: and in the period, the computing chip loads the 2 nd plane data block in the image data block, and the Core _1 performs convolution operation by using the 1 st plane data block in the image data block and the convolution weight of the Core _1 to obtain a second middle convolution block of the Core _ 1. Similarly derived, the meaning represented by each row is known.
It can be seen that the pipelining between the processing cores, where m processing cores and the memory space 104(SM) are all running at the same time, can make full use of the whole of one compute clusterAnd (4) partial resources. Inter-core data transfer and temporary storage of input data blocks are enabled via the memory space 104. Compared with the prior art, the method can convert the input data volume to the original value
Figure BDA0002530757880000141
And the intermediate data calculated by the processing core does not need to be stored in the DDR, so that the problems of IO bottleneck, large IO access and storage quantity, influence on the calculation efficiency and the like are avoided.
In practical applications, each operation of the processing core in each cluster is designed with a corresponding instruction, such as a scalar instruction, a vector instruction, an IO instruction of DDR, an IO instruction of memory space, and the like. Cluster calls a corresponding instruction to realize the data processing method provided by the invention.
By implementing the embodiment of the invention, the problems of large IO access and storage quantity, IO bottleneck, influence on calculation efficiency and the like in the prior art can be solved.
Based on the foregoing method embodiments, a computing device and a computing apparatus to which the present invention is applicable are described below. Fig. 14 is a schematic structural diagram of a computing device according to an embodiment of the present invention. At least one computing cluster is deployed in the computing apparatus 100 shown in fig. 14, and the computing cluster includes m processing cores. The computing device 100 includes an acquisition unit 102, a distribution unit 104, and a convolution unit 106. Wherein the content of the first and second substances,
the obtaining unit 102 is configured to obtain a convolutional data block;
the allocating unit 104 is configured to allocate the convolutional data block to the m processing kernels, so as to obtain convolutional weights of the m processing kernels, where the convolutional weights belong to a part of the convolutional data block, and m is a positive integer;
the convolution unit 106 is configured to move a convolution data block in a depth direction of the entire image data block, determine input data of the m processing cores, perform convolution operation on the image data block of each processing core and a corresponding convolution weight, and call the m processing cores to perform convolution operation on the image data block and the convolution data block, so as to obtain a convolution result block.
In some embodiments, the dispensing unit 104 is specifically adapted forAveragely distributing the convolution data blocks to the m processing kernels to obtain convolution weights of the m processing kernels; wherein if the size of the convolution data block is A1×B1×C1Then the number of convolution weights for each of the processing cores is
Figure BDA0002530757880000151
A1Is the height of the convolved data block, B1For the width of the convolved data block, C1Is the depth of the block of convolved data.
In still other embodiments, the allocating unit 104 is specifically configured to determine, according to the convolution weights of the m processing kernels, a number p of layers that the convolution weight of each processing kernel occupies the convolution data block in the depth direction; sequentially moving and loading the plane data blocks contained in the image data blocks along the depth direction according to the number p of layers corresponding to the m processing kernels, and performing convolution operation corresponding to the convolution weights of the m processing kernels; wherein the size of the image data block is A2×B2×C2The image data block comprises C2A plane data block A2×B2,A2Is the height of the image data block, B2Is the width of the image data block, C2Is the depth of the image data block.
In still other embodiments, the number of layers p is a natural number less than or equal to 2.
In still other embodiments, the block of convolution results includes C2A plane data block A2×B2The allocating unit 104 is specifically configured to sequentially move and load the jth-1, jth, and jth +1 planar data blocks included in the image data block according to the number p of layers corresponding to the m processing kernels, and perform convolution operation on convolution weights corresponding to the m processing kernels to obtain a jth planar data block a in the convolution result block2×B2J is less than or equal to C2Is a positive integer of (1).
In still other embodiments, the compute cluster further includes the m processing cores in commonA shared memory space for caching intermediate volume blocks of each of the processing core operations for use by a next processing core. The allocating unit 104 is specifically configured to invoke the m processing cores to execute the following steps to obtain a jth planar data block a in the convolution result block2×B2
Determining, for an ith processing core, an input data block that needs to be loaded by the ith processing core according to the number p of layers corresponding to the ith processing core, where the input data block is one of j-1 th, j-1 th and j +1 th plane data blocks included in the image data block;
performing convolution operation by using the convolution weight of the input data block and the ith processing core to obtain a jth intermediate output block of the ith processing core;
acquiring a jth intermediate volume block operated by the ith-1 processing core from the storage space;
and summing the jth intermediate volume block operated by the ith-1 processing core and the jth intermediate output block to obtain the jth intermediate volume block operated by the ith processing core, and caching the jth intermediate volume block to the storage space for the ith +1 processing core to use, wherein i is a positive integer less than or equal to m.
In practical applications, each module or unit involved in the apparatus in the embodiments of the present invention may be specifically implemented by a software program or hardware. When implemented by a software program, each module or unit related to the apparatus is a software module or a software unit, and when implemented by hardware, each module or unit related to the apparatus may be implemented by an application-specific integrated circuit (ASIC), or a Programmable Logic Device (PLD), which may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof, which is not limited in the present invention.
It should be noted that fig. 14 is only one possible implementation manner of the embodiment of the present invention, and in practical applications, more or less components may be included in the computing device, which is not limited herein. For the content that is not shown or described in the embodiment of the present invention, reference may be made to the relevant explanation in the foregoing method embodiment, which is not described herein again.
Fig. 15 is a schematic structural diagram of a computing device according to an embodiment of the present invention. The computing device 600 shown in fig. 15 includes one or more processors 601, a communication interface 602, and a memory 603, and the processors 601, the communication interface 602, and the memory 603 may be connected by a bus, and may also implement communication by other means such as wireless transmission. The embodiment of the present invention is exemplified by being connected through a bus 604, wherein the memory 603 is used for storing instructions, and the processor 601 is used for executing the instructions stored by the memory 603. The memory 603 stores program code, and the processor 601 may call the program code stored in the memory 603 to perform the following operations:
acquiring a convolution data block;
distributing the convolution data block to the m processing kernels to obtain convolution weights of the m processing kernels, wherein the convolution weights belong to one part of the convolution data block, and m is a positive integer;
and moving the convolution data block in the depth direction of the whole image data block, determining the input data of the m processing kernels, and performing convolution operation on the image data block of each processing kernel and the corresponding convolution weight to call the m processing kernels to realize the convolution operation of the image data block and the convolution data block so as to obtain a convolution result block.
In some embodiments, the processor 601 is specifically configured to perform the following operations: averagely distributing the convolution data blocks to the m processing kernels to obtain convolution weights of the m processing kernels; wherein if the size of the convolution data block is A1×B1×C1Then the number of convolution weights for each of the processing cores is
Figure BDA0002530757880000171
A1Is the height of the convolved data block, B1For the width of the convolved data block, C1To the rollThe depth of the data block is integrated.
In still other embodiments, the processor 601 is specifically configured to perform the following operations: determining the number p of layers of the convolution data block occupied by the convolution weight of each processing core in the depth direction according to the convolution weights of the m processing cores; sequentially moving and loading the plane data blocks contained in the image data blocks along the depth direction according to the number p of layers corresponding to the m processing kernels, and performing convolution operation corresponding to the convolution weights of the m processing kernels; wherein the size of the image data block is A2×B2×C2The image data block comprises C2A plane data block A2×B2,A2Is the height of the image data block, B2Is the width of the image data block, C2Is the depth of the image data block.
In still other embodiments, the number of layers p is a natural number less than or equal to 2.
In still other embodiments, the block of convolution results includes C2A plane data block A2×B2The processor 601 is specifically configured to perform the following operations: sequentially moving and loading the jth-1, jth and j +1 plane data blocks contained in the image data block according to the number p of layers corresponding to the m processing kernels, and performing convolution operation on convolution weights corresponding to the m processing kernels to obtain a jth plane data block A in the convolution result block2×B2J is less than or equal to C2Is a positive integer of (1).
In still other embodiments, the computing cluster further includes a storage space shared by the m processing cores, and the storage space is used for caching an intermediate volume block operated by each processing core for use by a next processing core. The processor 601 is specifically configured to perform the following operations:
calling the m processing cores to execute the following steps to obtain the jth plane data block A in the convolution result block2×B2
Determining, for an ith processing core, an input data block that needs to be loaded by the ith processing core according to the number p of layers corresponding to the ith processing core, where the input data block is one of j-1 th, j-1 th and j +1 th plane data blocks included in the image data block;
performing convolution operation by using the convolution weight of the input data block and the ith processing core to obtain a jth intermediate output block of the ith processing core;
acquiring a jth intermediate volume block operated by the ith-1 processing core from the storage space;
and summing the jth intermediate volume block operated by the ith-1 processing core and the jth intermediate output block to obtain the jth intermediate volume block operated by the ith processing core, and caching the jth intermediate volume block to the storage space for the ith +1 processing core to use, wherein i is a positive integer less than or equal to m.
For the content that is not shown or not described in the embodiment of the present invention, reference may be made to the related explanation in any one of the embodiments of fig. 1 to fig. 13, which is not described herein again.
It should be understood that, in the embodiment of the present invention, the Processor 601 may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The communication interface 602 may be a wired interface (e.g., an ethernet interface) or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other units or equipment devices. For example, in the embodiment of the present application, the communication interface 602 may be specifically configured to obtain a static instruction block or a dynamic instruction block.
The Memory 603 may include a Volatile Memory (Volatile Memory), such as a Random Access Memory (RAM); the Memory may also include a Non-volatile Memory (Non-volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); the memory may also comprise a combination of memories of the kind described above. The memory may be configured to store a set of program codes for facilitating the processor to call the program codes stored in the memory to implement the functions of the above-mentioned functional modules involved in the embodiments of the present invention.
It should be noted that fig. 15 is only one possible implementation manner of the embodiment of the present invention, and in practical applications, the computing device may further include more or less components, which is not limited herein. For the content that is not shown or described in the embodiment of the present invention, reference may be made to the relevant explanation in the foregoing method embodiment, which is not described herein again.
Embodiments of the present invention also provide a computer-readable storage medium, which stores instructions that, when executed on a processor, implement the method flow illustrated in fig. 2.
Embodiments of the present invention further provide a computer program product, where when the computer program product runs on a processor, the method flow shown in fig. 2 is implemented.
The computer readable storage medium may be an internal storage unit of the computing device according to any of the foregoing embodiments, for example, a hard disk or a memory of the computing device. The computer readable storage medium may also be an external storage device of the computing device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computing device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the client. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computing device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the terminal device and the unit described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed terminal device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A data processing method is applied to a computing chip with at least one computing cluster deployed, wherein the computing cluster comprises m processing cores, and the method comprises the following steps:
acquiring a convolution data block;
distributing the convolution data block to the m processing kernels to obtain convolution weights of the m processing kernels, wherein the convolution weights belong to one part of the convolution data block, and m is a positive integer;
and moving the convolution data block in the depth direction of the whole image data block, determining the input data of the m processing kernels, and performing convolution operation on the image data block of each processing kernel and the corresponding convolution weight to call the m processing kernels to realize the convolution operation of the image data block and the convolution data block so as to obtain a convolution result block.
2. The method of claim 1, wherein the assigning the block of convolved data to the m processing kernels, and wherein obtaining convolution weights for the m processing kernels comprises:
averagely distributing the convolution data blocks to the m processing kernels to obtain convolution weights of the m processing kernels;
wherein if the size of the convolution data block is A1×B1×C1Then the number of convolution weights for each of the processing cores is
Figure FDA0002530757870000011
A1Is the height of the convolved data block, B1For the width of the convolved data block, C1Is the depth of the block of convolved data.
3. The method of claim 1, wherein the convolving with the convolution weights of the m processing kernels the moving image data block comprises:
determining the number p of layers of the convolution data block occupied by the convolution weight of each processing core in the depth direction according to the convolution weights of the m processing cores;
sequentially moving and loading the plane data blocks contained in the image data blocks along the depth direction according to the number p of layers corresponding to the m processing kernels, and performing convolution operation corresponding to the convolution weights of the m processing kernels;
wherein the size of the image data block is A2×B2×C2The image data block comprises C2A plane data block A2×B2,A2Is the height of the image data block, B2Is the width of the image data block, C2Is the depth of the image data block.
4. The method of claim 3, wherein the number of layers p is a natural number less than or equal to 2.
5. The method of claim 3, wherein the block of convolution results includes C2A plane data block A2×B2Moving and loading the plane data blocks included in the image data block in sequence along the depth direction according to the number p of layers corresponding to the m processing kernels, wherein performing convolution operation corresponding to the convolution weights of the m processing kernels includes:
sequentially moving and loading the jth-1, jth and j +1 plane data blocks contained in the image data block according to the number p of layers corresponding to the m processing kernels, and performing convolution operation on convolution weights corresponding to the m processing kernels to obtain a jth plane data block A in the convolution result block2×B2J is less than or equal to C2Is a positive integer of (1).
6. The method of claim 5, wherein the compute cluster further includes a memory space shared by the m processing cores, the memory space for caching an intermediate volume block for each of the processing core operations for use by a next one of the processing cores,
sequentially moving and loading the jth-1, jth and j +1 plane data blocks contained in the image data block according to the number p of layers corresponding to the m processing kernels, and performing convolution operation on convolution weights corresponding to the m processing kernels to obtain a jth plane data block A in the convolution result block2×B2The method comprises the following steps:
calling the m processing cores to execute the following steps to obtain the jth plane data block A in the convolution result block2×B2
Determining, for an ith processing core, an input data block that needs to be loaded by the ith processing core according to the number p of layers corresponding to the ith processing core, where the input data block is one of j-1 th, j-1 th and j +1 th plane data blocks included in the image data block;
performing convolution operation by using the convolution weight of the input data block and the ith processing core to obtain a jth intermediate output block of the ith processing core;
acquiring a jth intermediate volume block operated by the ith-1 processing core from the storage space;
and summing the jth intermediate volume block operated by the ith-1 processing core and the jth intermediate output block to obtain the jth intermediate volume block operated by the ith processing core, and caching the jth intermediate volume block to the storage space for the ith +1 processing core to use, wherein i is a positive integer less than or equal to m.
7. A computing device, wherein the computing device is deployed with at least one computing cluster, wherein the computing cluster comprises m processing cores, wherein the device comprises an acquisition unit, an allocation unit and a convolution unit,
the acquiring unit is used for acquiring a convolution data block;
the allocation unit is configured to allocate the convolutional data block to the m processing kernels to obtain convolutional weights of the m processing kernels, where the convolutional weights belong to a part of the convolutional data block, and m is a positive integer;
the convolution unit is configured to move a convolution data block in a depth direction of the entire image data block, determine input data of the m processing kernels, perform convolution operation on the image data block of each processing kernel and a corresponding convolution weight, and call the m processing kernels to perform convolution operation on the image data block and the convolution data block, so as to obtain a convolution result block.
8. A computing chip, characterized in that it is deployed with a compute cluster comprising m processing cores, for executing the data processing method of any of the preceding claims 1-6.
9. A computing device comprising a processor, a memory, and a bus, the processor and the memory being connected by the bus, the memory for storing instructions, the processor for invoking the instructions stored in the memory for performing the method of any of claims 1-6 above.
10. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-6.
CN202010521841.2A 2020-06-09 2020-06-09 Data processing method, related device and computer readable medium Pending CN111767243A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010521841.2A CN111767243A (en) 2020-06-09 2020-06-09 Data processing method, related device and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010521841.2A CN111767243A (en) 2020-06-09 2020-06-09 Data processing method, related device and computer readable medium

Publications (1)

Publication Number Publication Date
CN111767243A true CN111767243A (en) 2020-10-13

Family

ID=72720638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010521841.2A Pending CN111767243A (en) 2020-06-09 2020-06-09 Data processing method, related device and computer readable medium

Country Status (1)

Country Link
CN (1) CN111767243A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837922A (en) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 Computing device, data processing method and related product
CN115187918A (en) * 2022-09-14 2022-10-14 中广核贝谷科技有限公司 Method and system for identifying moving object in monitoring video stream

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357894A1 (en) * 2016-06-10 2017-12-14 Apple Inc. Data packing for convolution of artificial neural networks
CN107862650A (en) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 The method of speed-up computation two dimensional image CNN convolution
CN108304923A (en) * 2017-12-06 2018-07-20 腾讯科技(深圳)有限公司 Convolution algorithm processing method and Related product
CN109871951A (en) * 2019-03-06 2019-06-11 苏州浪潮智能科技有限公司 A kind of deep learning processor and electronic equipment
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110647975A (en) * 2018-06-27 2020-01-03 龙芯中科技术有限公司 Data processing method, device, equipment and medium
CN110930290A (en) * 2019-11-13 2020-03-27 东软睿驰汽车技术(沈阳)有限公司 Data processing method and device
CN111199273A (en) * 2019-12-31 2020-05-26 深圳云天励飞技术有限公司 Convolution calculation method, device, equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357894A1 (en) * 2016-06-10 2017-12-14 Apple Inc. Data packing for convolution of artificial neural networks
CN107862650A (en) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 The method of speed-up computation two dimensional image CNN convolution
CN108304923A (en) * 2017-12-06 2018-07-20 腾讯科技(深圳)有限公司 Convolution algorithm processing method and Related product
CN110647975A (en) * 2018-06-27 2020-01-03 龙芯中科技术有限公司 Data processing method, device, equipment and medium
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN109871951A (en) * 2019-03-06 2019-06-11 苏州浪潮智能科技有限公司 A kind of deep learning processor and electronic equipment
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110930290A (en) * 2019-11-13 2020-03-27 东软睿驰汽车技术(沈阳)有限公司 Data processing method and device
CN111199273A (en) * 2019-12-31 2020-05-26 深圳云天励飞技术有限公司 Convolution calculation method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837922A (en) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 Computing device, data processing method and related product
WO2023045446A1 (en) * 2021-09-26 2023-03-30 寒武纪(西安)集成电路有限公司 Computing apparatus, data processing method, and related product
CN115187918A (en) * 2022-09-14 2022-10-14 中广核贝谷科技有限公司 Method and system for identifying moving object in monitoring video stream
CN115187918B (en) * 2022-09-14 2022-12-13 中广核贝谷科技有限公司 Method and system for identifying moving object in monitoring video stream

Similar Documents

Publication Publication Date Title
US20230325348A1 (en) Performing concurrent operations in a processing element
CN112214726B (en) Operation accelerator
US9886418B2 (en) Matrix operands for linear algebra operations
CN116541647A (en) Operation accelerator, processing method and related equipment
CN112084038B (en) Memory allocation method and device of neural network
KR20190066473A (en) Method and apparatus for processing convolution operation in neural network
US20210019594A1 (en) Convolutional neural network accelerating device and method
EP3093757B1 (en) Multi-dimensional sliding window operation for a vector processor
US20210019606A1 (en) Cellular neural network integrated circuit having multiple convolution layers of duplicate weights
CN111767243A (en) Data processing method, related device and computer readable medium
CN112668708A (en) Convolution operation device for improving data utilization rate
WO2022041188A1 (en) Accelerator for neural network, acceleration method and device, and computer storage medium
CN110490308B (en) Design method of acceleration library, terminal equipment and storage medium
CN113826118A (en) High throughput neural network operation using inter-layer memory layout transforms
CN111210004B (en) Convolution calculation method, convolution calculation device and terminal equipment
CN113837922A (en) Computing device, data processing method and related product
CN111767246B (en) Data processing method, related equipment and computer readable medium
KR20200043617A (en) Artificial neural network module and scheduling method thereof for highly effective operation processing
CN109416743B (en) Three-dimensional convolution device for identifying human actions
JP7108702B2 (en) Processing for multiple input datasets
US20210019602A1 (en) Using and training cellular neural network integrated circuit having multiple convolution layers of duplicate weights in performing artificial intelligence tasks
US11915338B2 (en) Loading apparatus and method for convolution with stride or dilation of 2
CN110929854B (en) Data processing method and device and hardware accelerator
CN113434813A (en) Matrix multiplication method based on neural network and related device
US11544213B2 (en) Neural processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination