WO2021147567A1

WO2021147567A1 - Convolutional operation method and chip

Info

Publication number: WO2021147567A1
Application number: PCT/CN2020/136383
Authority: WO
Inventors: 王维伟; 罗飞
Original assignee: 北京希姆计算科技有限公司
Priority date: 2020-01-21
Filing date: 2020-12-15
Publication date: 2021-07-29
Also published as: CN113222136A

Abstract

A convolutional operation method and a chip. The convolutional operation method comprises: a processing core acquires a convolutional operation subtask, where the convolutional operation subtask comprises a storage address of input data and a storage address of weighted sub-data, and the convolutional operation subtask is a part of a convolutional operation (S301); the processing core acquires the input data and the weighted sub-data from a system storage space on the basis of the storage address of the input data and of the storage address of the weighted sub-data, where the weighted sub-data is a part of weighted data of the convolutional operation; and the processing core executes the convolutional operation subtask on the basis of the input data and of the weighted sub-data to produce output sub-data (S303). By means of the method, with the weighted data being divided into multiple pieces of weighted sub-data and assigned to multiple processing cores for performing the convolutional operation, solved is the technical problem of poor convolutional operation parallelization and low efficiency in the prior art.

Description

Convolution operation method and chip

This disclosure refers to a Chinese patent application named "Convolutional Operation Method and Chip" filed on January 21, 2020, with an application number of 202010070481.9, which is fully incorporated into this application by reference.

Technical field

The present disclosure relates to the field of neural network computing, and in particular to a convolution operation method and chip.

Background technique

With the development of science and technology, human society is rapidly entering the era of intelligence. An important feature of the intelligent age is that people have more and more types of data, the amount of data they can obtain is larger and larger, and the speed of processing data is getting higher and higher.

The chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as TPU (tensor processing unit, tensor processor), etc., they can play a higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.

Neural network is an important model of artificial intelligence, and its core is convolution calculation. The existing technical solutions generally have two solutions when processing convolution operations:

(1) Overall calculation scheme: This scheme is used in a single-core CPU. According to the calculation formula of convolution, the single-core realizes the point-by-point multiplication and accumulation of input data and weight data to obtain the final result.

(2) Multi-threaded parallel splitting scheme: This scheme is used in the GPU to split the convolution into multiple threads for parallel operation. All data and weights are split into the number of operations in the number of threads, and these shares are divided The convolution is completed after all runs.

However, the processing granularity of solution (1) is too coarse, and the entire convolution is realized with one processing core, and the parallelization is poor; it cannot meet the requirements in applications with high delay requirements. If you want to reduce the delay, you need to improve the calculation of the processing core. Ability, hardware cost is high. The split granularity of the input data and weight data of the scheme (2) is too fine, the splitting process is complicated, and a complicated scheduler needs to be designed, which is inefficient and expensive.

Summary of the invention

The content of the invention is provided in order to introduce concepts in a brief form, and these concepts will be described in detail in the following specific embodiments. The content of the invention is not intended to identify the key features or essential features of the technical solution that is claimed, nor is it intended to be used to limit the scope of the technical solution that is claimed.

In order to solve the above-mentioned technical problems in performing convolution calculations in the prior art, embodiments of the present disclosure propose the following technical solutions:

In a first aspect, an embodiment of the present disclosure provides a convolution operation method used in a chip including multiple processing cores, which is characterized in that it includes:

The processing core obtains a convolution operation subtask, where the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is a part of the convolution operation;

The processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the result of the convolution operation Input data, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution of the plurality of convolution kernels Accumulate core

The processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.

Further, the method further includes:

The processing core stores the sub-output data in the system storage space in sequence.

Further, the number of the convolution kernels in the sub-weight data is determined by the number of the processing kernels.

Further, the size of the sub-weight data is related to the size of the storage space of the processing core.

Further, the sub-output data is sub-output data in the depth direction of the output data.

In the second aspect, embodiments of the present disclosure provide a convolution operation method, including:

Obtaining input data and weight data required in the convolution operation;

Dividing the weight data into multiple sub-weight data, where the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;

Input the input data and the multiple sub-weight data into multiple processing cores to perform the convolution operation to obtain multiple sub-output data;

Combining the multiple sub-output data to obtain output data.

In a third aspect, an embodiment of the present disclosure provides a chip including a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method described in the first aspect above to complete the convolution operation.

In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run The convolution operation method described in any one of the foregoing first aspect or second aspect is realized at a time.

In a fifth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned first aspect or Any one of the convolution operation methods in the second aspect.

In a sixth aspect, embodiments of the present disclosure provide a computer program product, wherein the computer program product is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the aforementioned first aspect or second aspect Any one of the convolution operation methods described in.

In a seventh aspect, an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in the third aspect.

The embodiment of the present disclosure discloses a convolution operation method and chip. The convolution operation method includes: the processing core obtains a convolution operation subtask, wherein the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is Part of the convolution operation; the processing core obtains input data and sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the input of the convolution operation Data, wherein the sub-weight data is a part of the weight data of a convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel among the plurality of convolution kernels; The processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data. Through the above method, the weight data is divided into multiple sub-weight data and assigned to multiple processing cores to perform convolution operations in parallel, which solves the technical problems of poor parallelization and low efficiency of convolution calculations in the prior art.

The above description is only an overview of the technical solutions of the present disclosure. In order to understand the technical means of the present disclosure more clearly, they can be implemented in accordance with the content of the specification, and to make the above and other objectives, features and advantages of the present disclosure more obvious and understandable. In the following, the preferred embodiments are cited in conjunction with the drawings, and the detailed description is as follows.

Description of the drawings

The above and other features, advantages, and aspects of the embodiments of the present disclosure will become more apparent in conjunction with the accompanying drawings and with reference to the following specific implementations. Throughout the drawings, the same or similar reference signs indicate the same or similar elements. It should be understood that the drawings are schematic and the originals and elements are not necessarily drawn to scale.

Figure 1 is a schematic diagram of the process of convolution operation;

2 is a schematic diagram of the structure of a chip that executes the convolution operation method provided by an embodiment of the present disclosure;

FIG. 3 is a flowchart of a convolution operation method provided by an embodiment of the disclosure;

4 is a schematic diagram of the operation of a convolution operation method provided by an embodiment of the disclosure;

Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure.

Detailed ways

Hereinafter, embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. On the contrary, these embodiments are provided for Have a more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are only used for exemplary purposes, and are not used to limit the protection scope of the present disclosure.

It should be understood that the various steps recorded in the method embodiments of the present disclosure may be executed in a different order, and/or executed in parallel. In addition, method implementations may include additional steps and/or omit to perform the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and its variations as used herein are open-ended includes, that is, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments." Related definitions of other terms will be given in the following description.

It should be noted that the concepts of “first” and “second” mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.

It should be noted that the modifications of “a” and “a plurality of” mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless otherwise clearly indicated in the context, they should be understood as “one or Multiple".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes, and are not used to limit the scope of these messages or information.

Figure 1 is a schematic diagram of the convolution operation process. As shown in Figure 1, the size of the input data (ie, the input feature map) of the convolution operation is Win*Hin*Cin, where Win represents the width of the input data, Hin represents the height of the input data, and Cin represents the depth of the input data. In the weight data (that is, one or more convolution kernels), there are a total of Cout convolution kernels, and the size of each convolution kernel is Kw*Kh*Cin, where Kw represents the width of the convolution kernel, and Kh represents the size of the convolution kernel. Height, Cin represents the depth of the convolution kernel. In the convolution process, each convolution kernel will slide on the input data, and the corresponding input data will be multiplied and accumulated by the corresponding element at each sliding position, and the output data corresponding to this convolution kernel will be extracted Elements (ie feature points on the output feature map); since there are Cout convolution kernels in the weight data, each convolution kernel will be multiplied and accumulated by the corresponding element with the input data at the same position to obtain Cout output data Elements; Cout output data elements form an element of output data with depth on the output data. The depth of the output data element is Cout; all convolution kernels will slide over the entire input data, and each sliding position is obtained An element with a depth of Cout to get the entire output data.

For a certain element at a certain depth l (1<=l<=Cout), the formula for multiplying and accumulating it is as follows:

Dout is an element with depth in the output data, and its superscript l indicates that the depth at the output depth is l; Din refers to the data block of the input data corresponding to the convolution kernel, and its superscript i corresponds to the depth of the input data , J and k respectively correspond to the width and height of the convolution kernel; w is the element in the convolution kernel, that is, the weight in the neural network calculation, and its superscripts l and i correspond to the depth of the output data and the depth of the input data, respectively.

The present disclosure divides the operations that can be independently performed in the convolution operation into multiple subtasks, each subtask has its corresponding input data and subweight data; the subtasks are allocated to processing cores in a chip including multiple processing cores Perform separately.

FIG. 2 is a schematic structural diagram of a chip that executes the convolution operation method provided by an embodiment of the present disclosure. As shown in FIG. 2, the chip is a chip with a multi-core architecture, which includes multiple processing cores C ₁ , C ₂ ... C _M , and the multiple processing cores are capable of independently processing tasks. The processing core can run independently according to its own program and does not need to accept task distribution from the scheduler. The program of the processing core can be dynamically updated by the server, or it can be written into the processing core after the processing core is started, or it can be automatically updated from the system's memory space according to its own initialization program during the operation of the processing core.

FIG. 3 is a flowchart of a convolution operation method provided by an embodiment of the disclosure. The convolution operation method in the embodiment of the present disclosure is used in a chip including multiple processing cores as shown in FIG. 2. The following method is described by taking one of the multiple processing cores as an example. include:

Step S301, the processing core obtains a convolution operation subtask, where the convolution operation subtask includes the storage address of the input data and the storage address of the subweight data, and the convolution operation subtask is a part of the convolution operation. ；

In this step, the processing core obtains a convolution operation subtask, the convolution operation subtask is a part of the convolution operation, and the convolution operation subtask and the convolution operation subtasks of other processing cores are The order of operations is not related to each other.

The convolution operation subtask includes a storage address of input data and a storage address of sub-weight data required by the convolution subtask, wherein the storage address is a storage address of a system storage space. It is understandable that the storage address of the input data and the storage address of the sub-weight data are the start storage address and the end storage address of the data, or the storage address of the input data and the storage address of the sub-weight data are The initial storage address. At this time, the subtask of the convolution operation also needs to include the size information of the input data and the sub-weight data.

Step S302: The processing core obtains input data and sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the input data of the convolution operation, wherein The sub-weight data is a part of weight data of a convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel among the plurality of convolution kernels;

The processing core itself has a storage space in the processing core for storing the convolution operation subtasks and the input data and subweight data required by the convolution operation subtasks. In this step, the processing core obtains the input data and the sub-weight data from the storage space of the system according to the storage address of the input data and the storage address of the sub-weight data obtained in step S301, and stores them in the storage space of the processing core.

It can be understood that the weight data includes multiple convolution kernels. As shown in FIG. 1, the complete weight data includes Cout convolution kernels, because the calculation of each convolution kernel and the input data is independent of each other. Therefore, the multiple convolution kernels in the weight data can be divided into multiple groups, and each group can use a processing kernel to perform convolution operations separately.

Optionally, the number of convolution kernels in the sub-weight data is determined by the number of processing cores. Exemplarily, the number of sub-weight data is equal to the number of processing cores. As shown in Figure 4, the number of processing cores of the chip is N, which are respectively C ₁ , C ₂ ... _CN , then the weight data is divided into N parts, if it is divided equally, each sub-weight The data includes Cout/N convolution kernels. It should be noted that in this case, Cout/N is a positive integer. If Cout/N is not a positive integer, then the convolution included in each sub-weight data Core can be set to

Then the convolution kernel in the sub-weight data obtained by one of the processing cores is insufficient

Piece. As shown in Figure 4, assuming that Cout/N is a positive integer, the first to Cout/N convolution kernels can be used as the first sub-weight data, and the first to (Cout/N+1) to (2Cout/N) As the second sub-weight data, ..., the ((N-1)*Cout/N+1)th to Coutth convolution kernels are used as the Nth sub-weight data. It is understandable that the number of sub-weight data and the number of processing cores may not be equal. For example, in certain scenarios, some processing cores in the chip are performing other tasks and cannot perform convolution operations. At this time, the input data and the weight data can be divided according to the number of processing cores actually available in the chip, which will not be repeated here.

Optionally, the size of the sub-weight data is related to the size of the storage space of the processing core. In the above optional embodiment, the storage space size of the processing core itself is not considered, which may cause the size of the sub-weight data to be mismatched with the storage space of the processing core, which in turn causes the processing core to perform the subtask of the convolution operation. Time efficiency is low. At this time, an appropriate value can be calculated according to the size of the storage space of each processing core, and each piece of sub-weight data can be divided according to this value. At this time, the size of the sub-weight data obtained by each processing core can be different, and the weight data is not Evenly divided, but divided according to the storage capacity of each available processing core. Exemplarily, when calculating the size of the storage space of the processing core itself, it is necessary to subtract the space required by the program corresponding to the convolution operation subtask and the space occupied by the input data from the available space in the storage space of the processing core According to the size of the remaining storage space, the processing core is divided into appropriate sub-weight data. Alternatively, for a processing core with a small storage space, the sub-weight data can be further divided into multiple parts, and the processing core calculates a part of the corresponding sub-output data according to one of them each time. For processing cores with smaller storage space, the process of calculating sub-output data is a serial process. When sub-weight data is further divided, it can be divided equally according to the storage space of the processing core itself, and each sub-weight The size of the data is not larger than the storage space, or the size of each copy is set to the size of the storage space for division. Of course, dividing the weighted data according to the size of the storage space avoids the problem of re-dividing and improves the efficiency of data calculation.

Step S303: The processing core executes the subtask of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.

After the processing core obtains the input data and sub-weight data required by its own convolution operation subtask, it calculates the multiplication and accumulation sum of the input data and the sub-weight data in the order of the convolution operation to obtain the sub-output data. The specific calculation process can be seen in Figure 1. The calculation process of the convolution operation subtask of a single processing core is the same as the usual convolution operation process, except that the convolution cores involved in the calculation in the single processing core are no longer Cout, and It is the number of convolution kernels of the sub-weight data determined according to the method described in step S302, and the sub-weight data is slidingly calculated on the input data according to the calculated step size to multiply and accumulate to obtain the sub-output data. As shown in Figure 4, N processing cores respectively calculate the multiplication and accumulation sum of the sub-weight data and the input data to obtain N sub-output data numbered 1-N.

Through the above steps S301-S303, the processing core has completed the subtasks of the convolution operation assigned to itself. However, the final output data has not yet been obtained at this time, so the method also includes:

Step S304: The processing core stores the sub-output data in the system storage space in order. The sub-output data obtained by the above convolution operation method are all sub-output data of the output data. According to the above description, it can be known that the multiple sub-output data are partial data of the complete output data in the depth direction, and do not need to undergo other operations. , You only need to store it in the system storage space according to the depth storage order of the output data. As shown in Figure 4, the processing core C ₁ calculates the first sub-output data of the output data, the processing core C ₂ calculates the second sub-output data of the output data,..., the processing core C _N calculates the N-th sub-output data of the output data. , The processing core only needs to store the sub-output data in the storage space according to the pre-set storage space address in its own program to obtain the complete output data. The storage address of each sub-output data is different from that in the output data. The position in the depth direction is related.

Another embodiment of the present disclosure provides yet another convolution operation method, and the convolution operation method includes:

Obtaining input data and weight data required in the convolution operation;

Combining the multiple sub-output data to obtain output data.

In the above example, the process of dividing the weight data into multiple sub-weight data is also included, and the specific division process may be the same as that described in step S302, which will not be repeated here. In addition, it can be understood that the above division process can be a logical division process, that is, only the storage space of the weight data is divided to obtain the starting storage address and ending storage address of each weight data in the system storage space, so that all The processing core can obtain the sub-weight data without actually dividing the data into multiple pieces.

Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure. As shown in Figure 5, the chip includes two processing cores C ₁ and C ₂ , the width and height of the input data are the same: Win=Hin=8, the depth of the input data is: Cin=4; the width and height of the output data are the same : Wout=Hout=6, the depth of the output data is: Cout=8; the width and height of the convolution kernel are the same: Kw=Kh=3, the depth of the convolution kernel is: Cin=4, the number of convolution kernels is :Cout=8, the sliding step length of the convolution kernel is 1. As shown in FIG. 5, in this example, the weight data is divided equally according to the number of processing cores, that is, divided into two sub-weight data according to the number sequence of the convolution kernel, that is, it includes 4 volumes numbered 1-4 The first sub-weight data of the product kernel, and the second sub-weight data including the 4 convolution kernels numbered 5-8; the first sub-weight data and input data are sent to C ₁ for convolution calculation, and The second sub-weight data and input data are sent to C ₂ for convolution calculation. C ₁ and C ₂ perform convolution operations in parallel, and output a sub-output data respectively. The size of each sub-output data is 6*6*4, and the _{output of C 1} is the sub-output data whose depth is 1-4 in the output data. , C ₂ outputs the sub-output data with a depth of 5-8 in the output data, and the two sub-output data are stored in the order of depth to obtain the complete output data.

The embodiment of the present disclosure discloses a convolution operation method and chip. The convolution operation method includes: the processing core obtains a convolution operation subtask, wherein the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is Part of the convolution operation; the processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input The data is the input data of the convolution operation, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes multiple convolution kernels, and the sub-weight data is the multiple At least one convolution kernel in the convolution kernel; the processing kernel executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data. Through the above method, the weight data is divided into multiple sub-weight data and assigned to multiple processing cores to perform convolution operations in parallel, which solves the technical problems of poor parallelization and low efficiency of convolution calculations in the prior art.

The embodiment of the present disclosure also provides a chip including a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method to complete the convolution operation.

An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can implement The convolution operation method described in any one of the foregoing embodiments.

The embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments. The convolution operation method.

An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can perform any of the convolution operations in the foregoing embodiments method.

An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.

The flowcharts and block diagrams in the drawings of the present disclosure illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown one after another can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.

The units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.

The functions described above in this document may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that can be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.

In the context of the present disclosure, a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

Claims

A convolution operation method used in a chip including multiple processing cores, characterized in that it includes:

The processing core obtains a convolution operation subtask, where the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is a part of the convolution operation;

The processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the result of the convolution operation Input data, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution of the plurality of convolution kernels Accumulate core

The processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
The convolution operation method according to claim 1, wherein the method further comprises:

The processing core stores the sub-output data in the system storage space in sequence.
The convolution operation method according to claim 1 or 2, characterized in that:

The number of convolution kernels in the sub-weight data is determined by the number of processing cores.
The convolution operation method according to any one of claims 1-3, characterized in that:

The size of the sub-weight data is related to the size of the storage space of the processing core.
The convolution operation method according to any one of claims 1 to 4, characterized in that:

The sub-output data is sub-output data in the depth direction of the output data.
A chip comprising a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method in claims 1-5 to complete the convolution operation.
A convolution operation method, characterized in that it comprises:

Obtaining input data and weight data required in the convolution operation;

Dividing the weight data into multiple sub-weight data, where the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;

Input the input data and the multiple sub-weight data into multiple processing cores to perform the convolution operation to obtain multiple sub-output data;

Combining the multiple sub-output data to obtain output data.
An electronic device, comprising: a memory for storing computer readable instructions; and one or more processors for running the computer readable instructions, so that the processor implements claims 1-5 or rights when running The convolution operation method described in claim 7.
A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the convolution operation described in claims 1-5 or claim 7. method.
A computing device, characterized by comprising the chip described in claim 6.