WO2021147567A1 - Convolutional operation method and chip - Google Patents

Convolutional operation method and chip Download PDF

Info

Publication number
WO2021147567A1
WO2021147567A1 PCT/CN2020/136383 CN2020136383W WO2021147567A1 WO 2021147567 A1 WO2021147567 A1 WO 2021147567A1 CN 2020136383 W CN2020136383 W CN 2020136383W WO 2021147567 A1 WO2021147567 A1 WO 2021147567A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sub
convolution operation
convolution
weight data
Prior art date
Application number
PCT/CN2020/136383
Other languages
French (fr)
Chinese (zh)
Inventor
王维伟
罗飞
Original Assignee
北京希姆计算科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京希姆计算科技有限公司 filed Critical 北京希姆计算科技有限公司
Publication of WO2021147567A1 publication Critical patent/WO2021147567A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of neural network computing, and in particular to a convolution operation method and chip.
  • the chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as TPU (tensor processing unit, tensor processor), etc., they can play a higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.
  • CPU Central Processing Unit, central processing unit
  • TPU tensor processing unit, tensor processor
  • Neural network is an important model of artificial intelligence, and its core is convolution calculation.
  • the existing technical solutions generally have two solutions when processing convolution operations:
  • Multi-threaded parallel splitting scheme This scheme is used in the GPU to split the convolution into multiple threads for parallel operation. All data and weights are split into the number of operations in the number of threads, and these shares are divided The convolution is completed after all runs.
  • an embodiment of the present disclosure provides a convolution operation method used in a chip including multiple processing cores, which is characterized in that it includes:
  • the processing core obtains a convolution operation subtask, where the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is a part of the convolution operation;
  • the processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the result of the convolution operation Input data, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution of the plurality of convolution kernels Accumulate core
  • the processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  • the method further includes:
  • the processing core stores the sub-output data in the system storage space in sequence.
  • the number of the convolution kernels in the sub-weight data is determined by the number of the processing kernels.
  • the size of the sub-weight data is related to the size of the storage space of the processing core.
  • sub-output data is sub-output data in the depth direction of the output data.
  • embodiments of the present disclosure provide a convolution operation method, including:
  • the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;
  • an embodiment of the present disclosure provides a chip including a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method described in the first aspect above to complete the convolution operation.
  • an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run
  • the convolution operation method described in any one of the foregoing first aspect or second aspect is realized at a time.
  • embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned first aspect or Any one of the convolution operation methods in the second aspect.
  • embodiments of the present disclosure provide a computer program product, wherein the computer program product is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the aforementioned first aspect or second aspect Any one of the convolution operation methods described in.
  • an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in the third aspect.
  • the embodiment of the present disclosure discloses a convolution operation method and chip.
  • the convolution operation method includes: the processing core obtains a convolution operation subtask, wherein the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is Part of the convolution operation; the processing core obtains input data and sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the input of the convolution operation Data, wherein the sub-weight data is a part of the weight data of a convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel among the plurality of convolution kernels; The processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  • the weight data is divided into multiple sub-weight data and assigned to multiple processing cores to perform convolution operations in parallel, which solve
  • Figure 1 is a schematic diagram of the process of convolution operation
  • FIG. 2 is a schematic diagram of the structure of a chip that executes the convolution operation method provided by an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a convolution operation method provided by an embodiment of the disclosure.
  • FIG. 4 is a schematic diagram of the operation of a convolution operation method provided by an embodiment of the disclosure.
  • Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure.
  • FIG. 1 is a schematic diagram of the convolution operation process.
  • the size of the input data (ie, the input feature map) of the convolution operation is Win*Hin*Cin, where Win represents the width of the input data, Hin represents the height of the input data, and Cin represents the depth of the input data.
  • the weight data that is, one or more convolution kernels
  • the size of each convolution kernel is Kw*Kh*Cin, where Kw represents the width of the convolution kernel, and Kh represents the size of the convolution kernel. Height, Cin represents the depth of the convolution kernel.
  • each convolution kernel will slide on the input data, and the corresponding input data will be multiplied and accumulated by the corresponding element at each sliding position, and the output data corresponding to this convolution kernel will be extracted Elements (ie feature points on the output feature map); since there are Cout convolution kernels in the weight data, each convolution kernel will be multiplied and accumulated by the corresponding element with the input data at the same position to obtain Cout output data Elements; Cout output data elements form an element of output data with depth on the output data. The depth of the output data element is Cout; all convolution kernels will slide over the entire input data, and each sliding position is obtained An element with a depth of Cout to get the entire output data.
  • Dout is an element with depth in the output data, and its superscript l indicates that the depth at the output depth is l;
  • Din refers to the data block of the input data corresponding to the convolution kernel, and its superscript i corresponds to the depth of the input data , J and k respectively correspond to the width and height of the convolution kernel;
  • w is the element in the convolution kernel, that is, the weight in the neural network calculation, and its superscripts l and i correspond to the depth of the output data and the depth of the input data, respectively.
  • the present disclosure divides the operations that can be independently performed in the convolution operation into multiple subtasks, each subtask has its corresponding input data and subweight data; the subtasks are allocated to processing cores in a chip including multiple processing cores Perform separately.
  • FIG. 2 is a schematic structural diagram of a chip that executes the convolution operation method provided by an embodiment of the present disclosure.
  • the chip is a chip with a multi-core architecture, which includes multiple processing cores C 1 , C 2 ... C M , and the multiple processing cores are capable of independently processing tasks.
  • the processing core can run independently according to its own program and does not need to accept task distribution from the scheduler.
  • the program of the processing core can be dynamically updated by the server, or it can be written into the processing core after the processing core is started, or it can be automatically updated from the system's memory space according to its own initialization program during the operation of the processing core.
  • FIG. 3 is a flowchart of a convolution operation method provided by an embodiment of the disclosure.
  • the convolution operation method in the embodiment of the present disclosure is used in a chip including multiple processing cores as shown in FIG. 2.
  • the following method is described by taking one of the multiple processing cores as an example. include:
  • Step S301 the processing core obtains a convolution operation subtask, where the convolution operation subtask includes the storage address of the input data and the storage address of the subweight data, and the convolution operation subtask is a part of the convolution operation. ;
  • the processing core obtains a convolution operation subtask
  • the convolution operation subtask is a part of the convolution operation
  • the convolution operation subtask and the convolution operation subtasks of other processing cores are The order of operations is not related to each other.
  • the convolution operation subtask includes a storage address of input data and a storage address of sub-weight data required by the convolution subtask, wherein the storage address is a storage address of a system storage space. It is understandable that the storage address of the input data and the storage address of the sub-weight data are the start storage address and the end storage address of the data, or the storage address of the input data and the storage address of the sub-weight data are The initial storage address. At this time, the subtask of the convolution operation also needs to include the size information of the input data and the sub-weight data.
  • Step S302 The processing core obtains input data and sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the input data of the convolution operation, wherein The sub-weight data is a part of weight data of a convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel among the plurality of convolution kernels;
  • the processing core itself has a storage space in the processing core for storing the convolution operation subtasks and the input data and subweight data required by the convolution operation subtasks.
  • the processing core obtains the input data and the sub-weight data from the storage space of the system according to the storage address of the input data and the storage address of the sub-weight data obtained in step S301, and stores them in the storage space of the processing core.
  • the weight data includes multiple convolution kernels.
  • the complete weight data includes Cout convolution kernels, because the calculation of each convolution kernel and the input data is independent of each other. Therefore, the multiple convolution kernels in the weight data can be divided into multiple groups, and each group can use a processing kernel to perform convolution operations separately.
  • the number of convolution kernels in the sub-weight data is determined by the number of processing cores.
  • the number of sub-weight data is equal to the number of processing cores.
  • the number of processing cores of the chip is N, which are respectively C 1 , C 2 ... CN , then the weight data is divided into N parts, if it is divided equally, each sub-weight
  • the data includes Cout/N convolution kernels. It should be noted that in this case, Cout/N is a positive integer. If Cout/N is not a positive integer, then the convolution included in each sub-weight data Core can be set to Then the convolution kernel in the sub-weight data obtained by one of the processing cores is insufficient Piece.
  • the first to Cout/N convolution kernels can be used as the first sub-weight data, and the first to (Cout/N+1) to (2Cout/N) As the second sub-weight data, ..., the ((N-1)*Cout/N+1)th to Coutth convolution kernels are used as the Nth sub-weight data.
  • the number of sub-weight data and the number of processing cores may not be equal. For example, in certain scenarios, some processing cores in the chip are performing other tasks and cannot perform convolution operations. At this time, the input data and the weight data can be divided according to the number of processing cores actually available in the chip, which will not be repeated here.
  • the size of the sub-weight data is related to the size of the storage space of the processing core.
  • the storage space size of the processing core itself is not considered, which may cause the size of the sub-weight data to be mismatched with the storage space of the processing core, which in turn causes the processing core to perform the subtask of the convolution operation.
  • Time efficiency is low.
  • an appropriate value can be calculated according to the size of the storage space of each processing core, and each piece of sub-weight data can be divided according to this value.
  • the size of the sub-weight data obtained by each processing core can be different, and the weight data is not Evenly divided, but divided according to the storage capacity of each available processing core.
  • the processing core when calculating the size of the storage space of the processing core itself, it is necessary to subtract the space required by the program corresponding to the convolution operation subtask and the space occupied by the input data from the available space in the storage space of the processing core According to the size of the remaining storage space, the processing core is divided into appropriate sub-weight data.
  • the sub-weight data can be further divided into multiple parts, and the processing core calculates a part of the corresponding sub-output data according to one of them each time.
  • the process of calculating sub-output data is a serial process.
  • sub-weight data When sub-weight data is further divided, it can be divided equally according to the storage space of the processing core itself, and each sub-weight The size of the data is not larger than the storage space, or the size of each copy is set to the size of the storage space for division.
  • dividing the weighted data according to the size of the storage space avoids the problem of re-dividing and improves the efficiency of data calculation.
  • Step S303 The processing core executes the subtask of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  • the processing core After the processing core obtains the input data and sub-weight data required by its own convolution operation subtask, it calculates the multiplication and accumulation sum of the input data and the sub-weight data in the order of the convolution operation to obtain the sub-output data.
  • the specific calculation process can be seen in Figure 1.
  • the calculation process of the convolution operation subtask of a single processing core is the same as the usual convolution operation process, except that the convolution cores involved in the calculation in the single processing core are no longer Cout, and It is the number of convolution kernels of the sub-weight data determined according to the method described in step S302, and the sub-weight data is slidingly calculated on the input data according to the calculated step size to multiply and accumulate to obtain the sub-output data.
  • N processing cores respectively calculate the multiplication and accumulation sum of the sub-weight data and the input data to obtain N sub-output data numbered 1-N.
  • the processing core has completed the subtasks of the convolution operation assigned to itself.
  • the final output data has not yet been obtained at this time, so the method also includes:
  • Step S304 The processing core stores the sub-output data in the system storage space in order.
  • the sub-output data obtained by the above convolution operation method are all sub-output data of the output data.
  • the multiple sub-output data are partial data of the complete output data in the depth direction, and do not need to undergo other operations. , You only need to store it in the system storage space according to the depth storage order of the output data.
  • the processing core C 1 calculates the first sub-output data of the output data
  • the processing core C 2 calculates the second sub-output data of the output data
  • the processing core C N calculates the N-th sub-output data of the output data.
  • the processing core only needs to store the sub-output data in the storage space according to the pre-set storage space address in its own program to obtain the complete output data.
  • the storage address of each sub-output data is different from that in the output data.
  • the position in the depth direction is related.
  • Another embodiment of the present disclosure provides yet another convolution operation method, and the convolution operation method includes:
  • the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;
  • the process of dividing the weight data into multiple sub-weight data is also included, and the specific division process may be the same as that described in step S302, which will not be repeated here.
  • the above division process can be a logical division process, that is, only the storage space of the weight data is divided to obtain the starting storage address and ending storage address of each weight data in the system storage space, so that all The processing core can obtain the sub-weight data without actually dividing the data into multiple pieces.
  • Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure.
  • the weight data is divided equally according to the number of processing cores, that is, divided into two sub-weight data according to the number sequence of the convolution kernel, that is, it includes 4 volumes numbered 1-4
  • C 1 and C 2 perform convolution operations in parallel, and output a sub-output data respectively.
  • the size of each sub-output data is 6*6*4, and the output of C 1 is the sub-output data whose depth is 1-4 in the output data.
  • C 2 outputs the sub-output data with a depth of 5-8 in the output data, and the two sub-output data are stored in the order of depth to obtain the complete output data.
  • the embodiment of the present disclosure discloses a convolution operation method and chip.
  • the convolution operation method includes: the processing core obtains a convolution operation subtask, wherein the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is Part of the convolution operation; the processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input The data is the input data of the convolution operation, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes multiple convolution kernels, and the sub-weight data is the multiple At least one convolution kernel in the convolution kernel; the processing kernel executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  • the weight data is divided into multiple sub-weight data and assigned to multiple processing cores to perform convolution operations in parallel, which solves the technical problems
  • the embodiment of the present disclosure also provides a chip including a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method to complete the convolution operation.
  • An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can implement The convolution operation method described in any one of the foregoing embodiments.
  • the embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments.
  • the convolution operation method is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments.
  • An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can perform any of the convolution operations in the foregoing embodiments method.
  • An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown one after another can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Product
  • SOC System on Chip
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • the machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

A convolutional operation method and a chip. The convolutional operation method comprises: a processing core acquires a convolutional operation subtask, where the convolutional operation subtask comprises a storage address of input data and a storage address of weighted sub-data, and the convolutional operation subtask is a part of a convolutional operation (S301); the processing core acquires the input data and the weighted sub-data from a system storage space on the basis of the storage address of the input data and of the storage address of the weighted sub-data, where the weighted sub-data is a part of weighted data of the convolutional operation; and the processing core executes the convolutional operation subtask on the basis of the input data and of the weighted sub-data to produce output sub-data (S303). By means of the method, with the weighted data being divided into multiple pieces of weighted sub-data and assigned to multiple processing cores for performing the convolutional operation, solved is the technical problem of poor convolutional operation parallelization and low efficiency in the prior art.

Description

卷积运算方法及芯片Convolution operation method and chip
本公开引用于2020年01月21日递交的名称为“卷积运算方法及芯片”的、申请号为202010070481.9的中国专利申请,其通过引用被全部并入本申请。This disclosure refers to a Chinese patent application named "Convolutional Operation Method and Chip" filed on January 21, 2020, with an application number of 202010070481.9, which is fully incorporated into this application by reference.
技术领域Technical field
本公开涉及神经网络计算领域,尤其涉及一种卷积运算方法及芯片。The present disclosure relates to the field of neural network computing, and in particular to a convolution operation method and chip.
背景技术Background technique
随着科学技术的发展,人类社会正在快速进入智能时代。智能时代的重要特点,是人们获得数据的种类越来越多,获得数据的量越来越大,并且对处理数据的速度要求越来越高。With the development of science and technology, human society is rapidly entering the era of intelligence. An important feature of the intelligent age is that people have more and more types of data, the amount of data they can obtain is larger and larger, and the speed of processing data is getting higher and higher.
芯片是数据处理的基石,它从根本上决定了人们处理数据的能力。从应用领域来看,芯片主要有两条路线:一条是通用芯片路线,例如CPU(Central Processing Unit,中央处理器)等,它们能提供极大的灵活性,但是在处理特定领域算法时有效算力比较低;另一条是专用芯片路线,例如TPU(tensor processing unit,张量处理器)等,它们在某些特定领域,能发挥较高的有效算力,但是面对灵活多变的比较通用的领域,它们处理能力比较差甚至无法处理。The chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as TPU (tensor processing unit, tensor processor), etc., they can play a higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.
神经网络是人工智能的重要模型,其核心为卷积计算。现有的技术方案在处理卷积运算时一般有两种方案:Neural network is an important model of artificial intelligence, and its core is convolution calculation. The existing technical solutions generally have two solutions when processing convolution operations:
(1)整体计算方案:该方案在单核CPU中使用,其按照卷积的计算公式,单核实现输入数据和权重数据的逐点相乘再累加,得到最终结果。(1) Overall calculation scheme: This scheme is used in a single-core CPU. According to the calculation formula of convolution, the single-core realizes the point-by-point multiplication and accumulation of input data and weight data to obtain the final result.
(2)多线程并行拆分方案:该方案在GPU中使用,将卷积拆分成多个线程并行运算,所有的数据和权重拆散成以线程数为单位的运算份数,将这些份数都运行完了则此卷积完成。(2) Multi-threaded parallel splitting scheme: This scheme is used in the GPU to split the convolution into multiple threads for parallel operation. All data and weights are split into the number of operations in the number of threads, and these shares are divided The convolution is completed after all runs.
但是方案(1)处理的处理粒度太粗,用一个处理核实现整个卷积,并行化差;在对延时要求高的应用中无法满足要求,如果要降低延时,需要提高处理核的计算能力,硬件代价大。方案(2)的输入数据和权重数据的拆分粒度太细,拆分的过程复杂,需要设计复杂的调度器,效率低,代价大。However, the processing granularity of solution (1) is too coarse, and the entire convolution is realized with one processing core, and the parallelization is poor; it cannot meet the requirements in applications with high delay requirements. If you want to reduce the delay, you need to improve the calculation of the processing core. Ability, hardware cost is high. The split granularity of the input data and weight data of the scheme (2) is too fine, the splitting process is complicated, and a complicated scheduler needs to be designed, which is inefficient and expensive.
发明内容Summary of the invention
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。The content of the invention is provided in order to introduce concepts in a brief form, and these concepts will be described in detail in the following specific embodiments. The content of the invention is not intended to identify the key features or essential features of the technical solution that is claimed, nor is it intended to be used to limit the scope of the technical solution that is claimed.
为了解决现有技术中的在进行卷积计算的上述技术问题,本公开实施例提出如下技术方案:In order to solve the above-mentioned technical problems in performing convolution calculations in the prior art, embodiments of the present disclosure propose the following technical solutions:
第一方面,本公开实施例提供一种卷积运算方法,用于包括多个处理核的芯片中,其特征在于,包括:In a first aspect, an embodiment of the present disclosure provides a convolution operation method used in a chip including multiple processing cores, which is characterized in that it includes:
所述处理核获取卷积运算子任务,其中所述卷积运算子任务中包括输入数据的存储地址以及子权重数据的存储地址,所述卷积运算子任务是所述卷积运算的一部分;The processing core obtains a convolution operation subtask, where the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is a part of the convolution operation;
所述处理核根据所述输入数据的存储地址以及所述子权重数据的存储地址从系统存储空间中获取所述输入数据和所述子权重数据,其中所述输入数据为所述卷积运算的输入数据,其中所述子权重数据是所述卷积运算的权重数据的一部分,所述权重数据包括多个卷积核,所述子权重数据为所述多个卷积核中的至少一个卷积核;The processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the result of the convolution operation Input data, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution of the plurality of convolution kernels Accumulate core
所述处理核根据所述输入数据和所述子权重数据执行所述卷积运算子任务得到子输出数据。The processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
进一步的,所述方法还包括:Further, the method further includes:
所述处理核将所述子输出数据按照顺序存储到所述系统存储空间中。The processing core stores the sub-output data in the system storage space in sequence.
进一步的,所述子权重数据中所述卷积核的个数由所述处理核的个数确定。Further, the number of the convolution kernels in the sub-weight data is determined by the number of the processing kernels.
进一步的,所述子权重数据的大小与所述处理核的存储空间大小相关。Further, the size of the sub-weight data is related to the size of the storage space of the processing core.
进一步的,所述子输出数据为所述输出数据在深度方向上的子输出数据。Further, the sub-output data is sub-output data in the depth direction of the output data.
第二方面,本公开实施例提供一种卷积运算方法,包括:In the second aspect, embodiments of the present disclosure provide a convolution operation method, including:
获取所述卷积运算中所需要的输入数据以及权重数据;Obtaining input data and weight data required in the convolution operation;
将所述权重数据划分为多个子权重数据,其中所述权重数据包括多个卷积核,所述子权重数据为所述多个卷积核中的至少一个卷积核;Dividing the weight data into multiple sub-weight data, where the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;
将所述输入数据与所述多个子权重数据分别输入多个所述处理核进行所述卷积运算得到多个子输出数据;Input the input data and the multiple sub-weight data into multiple processing cores to perform the convolution operation to obtain multiple sub-output data;
将所述多个子输出数据合并得到输出数据。Combining the multiple sub-output data to obtain output data.
第三方面,本公开实施例提供一种芯片,包括多个处理核,其中所述多个处理核中的至少两个执行上述第一方面所述的卷积运算方法以完成卷积运算。In a third aspect, an embodiment of the present disclosure provides a chip including a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method described in the first aspect above to complete the convolution operation.
第四方面,本公开实施例提供一种电子设备,包括:存储器,用于存储计算机可读指令;以及一个或多个处理器,用于运行所述计算机可读指令,使得所述处理器运行时实现前述第一方面或者第二方面中的任一所述卷积运算方法。In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run The convolution operation method described in any one of the foregoing first aspect or second aspect is realized at a time.
第五方面,本公开实施例提供一种非暂态计算机可读存储介质,其特征在于,该非暂态计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行前述第一方面或者第二方面中的任一所述卷积运算方法。In a fifth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned first aspect or Any one of the convolution operation methods in the second aspect.
第六方面,本公开实施例提供一种计算机程序产品,其中,其特征在于:包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行前述第一方面或者第二方面中的任一所述卷积运算方法。In a sixth aspect, embodiments of the present disclosure provide a computer program product, wherein the computer program product is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the aforementioned first aspect or second aspect Any one of the convolution operation methods described in.
第七方面,本公开实施例提供一种计算装置,其特征在于,包括所述第三方面中的所述的芯片。In a seventh aspect, an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in the third aspect.
本公开实施例公开了一种卷积运算方法及芯片。其中该卷积运算方法包括:所述处理核获取卷积运算子任务,其中所述卷积运算子任务中包括输入数据的存储地址以及子权重数据的存储地址,所述卷积运算子任务是卷积运算的一部分;所述处理核根据所述输入数据的存储地址以及子权重数据的存储地址从系统存储空间中获取输入数据和子权重数据,其中所述输入数据为所述卷积运算的输入数据,其中所述子权重数据是卷积运算的权重数据的一部分,所述权重数据包括多个卷积核,所述子权重数据为所述多个卷积核中的至少一个卷积核;所述处理核根据所述输入数据和子权重数据执行所述卷积运算子任务得到子输出数据。通过上述方法,将权重数据划分为多个子权重数据并分配给多个处理核并行进行卷积运算,解决了现有技术中卷积计算并行化差、效率低的技术问题。The embodiment of the present disclosure discloses a convolution operation method and chip. The convolution operation method includes: the processing core obtains a convolution operation subtask, wherein the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is Part of the convolution operation; the processing core obtains input data and sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the input of the convolution operation Data, wherein the sub-weight data is a part of the weight data of a convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel among the plurality of convolution kernels; The processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data. Through the above method, the weight data is divided into multiple sub-weight data and assigned to multiple processing cores to perform convolution operations in parallel, which solves the technical problems of poor parallelization and low efficiency of convolution calculations in the prior art.
上述说明仅是本公开技术方案的概述,为了能更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为让本公开的上述和其他目的、特征和优点能够更明显易懂,以下特举较佳实施例,并配合附图,详细说明如下。The above description is only an overview of the technical solutions of the present disclosure. In order to understand the technical means of the present disclosure more clearly, they can be implemented in accordance with the content of the specification, and to make the above and other objectives, features and advantages of the present disclosure more obvious and understandable. In the following, the preferred embodiments are cited in conjunction with the drawings, and the detailed description is as follows.
附图说明Description of the drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。The above and other features, advantages, and aspects of the embodiments of the present disclosure will become more apparent in conjunction with the accompanying drawings and with reference to the following specific implementations. Throughout the drawings, the same or similar reference signs indicate the same or similar elements. It should be understood that the drawings are schematic and the originals and elements are not necessarily drawn to scale.
图1为卷积运算的过程示意图;Figure 1 is a schematic diagram of the process of convolution operation;
图2为执行本公开实施例所提供的卷积运算方法的芯片的结构示意图;2 is a schematic diagram of the structure of a chip that executes the convolution operation method provided by an embodiment of the present disclosure;
图3为本公开实施例提供的卷积运算方法的流程图;FIG. 3 is a flowchart of a convolution operation method provided by an embodiment of the disclosure;
图4为本公开实施例提供的卷积运算方法的运算示意图;4 is a schematic diagram of the operation of a convolution operation method provided by an embodiment of the disclosure;
图5为根据本公开实施例的卷积运算方法的一个具体实例。Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Hereinafter, embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. On the contrary, these embodiments are provided for Have a more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are only used for exemplary purposes, and are not used to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps recorded in the method embodiments of the present disclosure may be executed in a different order, and/or executed in parallel. In addition, method implementations may include additional steps and/or omit to perform the illustrated steps. The scope of the present disclosure is not limited in this respect.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。The term "including" and its variations as used herein are open-ended includes, that is, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments." Related definitions of other terms will be given in the following description.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that the concepts of “first” and “second” mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of “a” and “a plurality of” mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless otherwise clearly indicated in the context, they should be understood as “one or Multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes, and are not used to limit the scope of these messages or information.
图1为卷积运算的过程示意图。如图1所述,卷积运算的输入数据(即输入特征图)的大小为Win*Hin*Cin,其中Win表示输入数据的宽度,Hin表示输入数据的高度,Cin表示输入数据的深度。权重数据中(即一个或多个卷积核)一共有Cout个卷积核,每一个卷积核的大小是Kw*Kh*Cin,其中Kw表示卷积核的宽度,Kh表示卷积核的高度,Cin表示卷积核的深度。在卷积过程中,每一个卷积核,都会在输入数据上滑动,在每一个滑动位置上以和其对应的输入数据进行对应元素乘累加,提取出此卷积核对应的输出数据中的元素(即输出特征图上的特征点);由于权重数据中有Cout个卷积核,每一个卷积核均会与此同一位置上的输入数据进行对应元素乘累加,从而得到Cout个输出数据元素;Cout个输出数据元素组成输出数据上的一个带深度的输出数据的元素,此输出数据的元素的深度即Cout;所有卷积核会在整个输入数据上滑动完,每个滑动位置均得到一个深度为Cout的元素,从而得到整个输出数据。Figure 1 is a schematic diagram of the convolution operation process. As shown in Figure 1, the size of the input data (ie, the input feature map) of the convolution operation is Win*Hin*Cin, where Win represents the width of the input data, Hin represents the height of the input data, and Cin represents the depth of the input data. In the weight data (that is, one or more convolution kernels), there are a total of Cout convolution kernels, and the size of each convolution kernel is Kw*Kh*Cin, where Kw represents the width of the convolution kernel, and Kh represents the size of the convolution kernel. Height, Cin represents the depth of the convolution kernel. In the convolution process, each convolution kernel will slide on the input data, and the corresponding input data will be multiplied and accumulated by the corresponding element at each sliding position, and the output data corresponding to this convolution kernel will be extracted Elements (ie feature points on the output feature map); since there are Cout convolution kernels in the weight data, each convolution kernel will be multiplied and accumulated by the corresponding element with the input data at the same position to obtain Cout output data Elements; Cout output data elements form an element of output data with depth on the output data. The depth of the output data element is Cout; all convolution kernels will slide over the entire input data, and each sliding position is obtained An element with a depth of Cout to get the entire output data.
对于处于某一深度l(1<=l<=Cout)上的某个元素,对它进行乘累加的公式如下:For a certain element at a certain depth l (1<=l<=Cout), the formula for multiplying and accumulating it is as follows:
Figure PCTCN2020136383-appb-000001
Figure PCTCN2020136383-appb-000001
Dout是输出数据中的某个带深度的元素,其上标l表示在输出深度上的深度为l;Din是指输入数据对应于卷积核的数据块,其上标i对应输入数据的深度,j和k分别对应卷积核的宽度和高度;w是卷积核中的元素,也就是神经网络计算中的权重,其上标l和i分别对应输出数据的深度和输入数据的深度。Dout is an element with depth in the output data, and its superscript l indicates that the depth at the output depth is l; Din refers to the data block of the input data corresponding to the convolution kernel, and its superscript i corresponds to the depth of the input data , J and k respectively correspond to the width and height of the convolution kernel; w is the element in the convolution kernel, that is, the weight in the neural network calculation, and its superscripts l and i correspond to the depth of the output data and the depth of the input data, respectively.
本公开将卷积运算中可以独立进行的运算拆分成多个子任务,每个子任务有其对应的输入数据和子权重数据;所述子任务被分配给包括多个处理核的芯片中的处理核单独执行。The present disclosure divides the operations that can be independently performed in the convolution operation into multiple subtasks, each subtask has its corresponding input data and subweight data; the subtasks are allocated to processing cores in a chip including multiple processing cores Perform separately.
图2为执行本公开实施例所提供的卷积运算方法的芯片的结构示意图。如图2所示,所述芯片为多核架构的芯片,其中包括多个处理核C 1、C 2……C M,所述的多个处理核都具备独立处理任务的能力。处理核能够按照自身的程序独立运行,不需要接受调度器的任务分发。处理核的程序可以由服务器端动态更新,也可以在处理核启动后写入处理核,或者是在处理核运行的过程中,按照其自身的初始化程序从系统的内存空间中自动更新。 FIG. 2 is a schematic structural diagram of a chip that executes the convolution operation method provided by an embodiment of the present disclosure. As shown in FIG. 2, the chip is a chip with a multi-core architecture, which includes multiple processing cores C 1 , C 2 ... C M , and the multiple processing cores are capable of independently processing tasks. The processing core can run independently according to its own program and does not need to accept task distribution from the scheduler. The program of the processing core can be dynamically updated by the server, or it can be written into the processing core after the processing core is started, or it can be automatically updated from the system's memory space according to its own initialization program during the operation of the processing core.
图3为本公开实施例提供的卷积运算方法的流程图。本公开实施例中的所述卷积运算方法,用于如图2所示的包括多个处理核的芯片中,下面方法以多个处理核中的一个处理核为例进行描述,所述方法包括:FIG. 3 is a flowchart of a convolution operation method provided by an embodiment of the disclosure. The convolution operation method in the embodiment of the present disclosure is used in a chip including multiple processing cores as shown in FIG. 2. The following method is described by taking one of the multiple processing cores as an example. include:
步骤S301,所述处理核获取卷积运算子任务,其中所述卷积运算子任务中包括输入数据的存储地址以 及子权重数据的存储地址,所述卷积运算子任务是卷积运算的一部分;Step S301, the processing core obtains a convolution operation subtask, where the convolution operation subtask includes the storage address of the input data and the storage address of the subweight data, and the convolution operation subtask is a part of the convolution operation. ;
在该步骤中,所述处理核获取卷积运算子任务,所述卷积运算子任务为所述卷积运算的一部分,所述卷积运算子任务与其他处理核的卷积运算子任务在运算顺序上互不相关。In this step, the processing core obtains a convolution operation subtask, the convolution operation subtask is a part of the convolution operation, and the convolution operation subtask and the convolution operation subtasks of other processing cores are The order of operations is not related to each other.
所述卷积运算子任务中包括所述卷积子任务所需要的输入数据的存储地址以及子权重数据的存储地址,其中所述存储地址为系统存储空间的存储地址。可以理解的,所述输入数据的存储地址和所述子权重数据的存储地址为数据的起始存储地址以及结束存储地址,或者所述输入数据的存储地址和所述子权重数据的存储地址为起始存储地址,此时卷积运算子任务中还需要包括所述输入数据和所述子权重数据的大小信息。The convolution operation subtask includes a storage address of input data and a storage address of sub-weight data required by the convolution subtask, wherein the storage address is a storage address of a system storage space. It is understandable that the storage address of the input data and the storage address of the sub-weight data are the start storage address and the end storage address of the data, or the storage address of the input data and the storage address of the sub-weight data are The initial storage address. At this time, the subtask of the convolution operation also needs to include the size information of the input data and the sub-weight data.
步骤S302,所述处理核根据所述输入数据的存储地址以及子权重数据的存储地址从系统存储空间中获取输入数据和子权重数据,其中所述输入数据为所述卷积运算的输入数据,其中所述子权重数据是卷积运算的权重数据的一部分,所述权重数据包括多个卷积核,所述子权重数据为所述多个卷积核中的至少一个卷积核;Step S302: The processing core obtains input data and sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the input data of the convolution operation, wherein The sub-weight data is a part of weight data of a convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel among the plurality of convolution kernels;
所述处理核自身带有处理核内的存储空间,用于存储所述卷积运算子任务以及卷积运算子任务所需要的输入数据和子权重数据。在该步骤中,所述处理核根据步骤S301中得到的输入数据的存储地址以及子权重数据的存储地址从系统存储空间中获取输入数据和子权重数据,并存储在处理核的存储空间中。The processing core itself has a storage space in the processing core for storing the convolution operation subtasks and the input data and subweight data required by the convolution operation subtasks. In this step, the processing core obtains the input data and the sub-weight data from the storage space of the system according to the storage address of the input data and the storage address of the sub-weight data obtained in step S301, and stores them in the storage space of the processing core.
可以理解的,所述权重数据包括多个卷积核,如图1所示,完整的权重数据中包括Cout个卷积核,由于每个卷积核与所述输入数据的计算是相互独立的,因此可以将所述权重数据中的多个卷积核划分成多个分组,每一组可以使用一个处理核单独进行卷积运算。It can be understood that the weight data includes multiple convolution kernels. As shown in FIG. 1, the complete weight data includes Cout convolution kernels, because the calculation of each convolution kernel and the input data is independent of each other. Therefore, the multiple convolution kernels in the weight data can be divided into multiple groups, and each group can use a processing kernel to perform convolution operations separately.
可选的,所述子权重数据中所述卷积核的个数由所述处理核的个数确定。示例性的,所述子权重数据的个数与所述处理核的个数相等。如图4所示,所述芯片的处理核的个数为N,分别为C 1、C 2……C N,则将所述权重数据分为N份,如果是平均划分,则每个子权重数据中包括Cout/N个卷积核,需要注意的是,在这种情况下,需要Cout/N为正整数,如果Cout/N不为正整数,则每一个子权重数据所包括的卷积核可以设置为
Figure PCTCN2020136383-appb-000002
则其中一个所述处理核所获取的子权重数据中的卷积核不足
Figure PCTCN2020136383-appb-000003
个。如图4所示,假设Cout/N为正整数,则可以将第1到Cout/N个卷积核作为第一个子权重数据,将第(Cout/N+1)到(2Cout/N)个卷积核作为第二个子权重数据,……,将第((N-1)*Cout/N+1)到Cout个卷积核作为第N个子权重数据。可以理解的,所述子权重数据的个数与所述处理核的个数可以不相等,例如在某些场景下,芯片中的某些处理核在执行其他任务,无法执行卷积运算,此时可以按照芯片中实际可用的处理核的个数对所述输入数据和所述权重数据进行划分,在此不再赘述。
Optionally, the number of convolution kernels in the sub-weight data is determined by the number of processing cores. Exemplarily, the number of sub-weight data is equal to the number of processing cores. As shown in Figure 4, the number of processing cores of the chip is N, which are respectively C 1 , C 2 ... CN , then the weight data is divided into N parts, if it is divided equally, each sub-weight The data includes Cout/N convolution kernels. It should be noted that in this case, Cout/N is a positive integer. If Cout/N is not a positive integer, then the convolution included in each sub-weight data Core can be set to
Figure PCTCN2020136383-appb-000002
Then the convolution kernel in the sub-weight data obtained by one of the processing cores is insufficient
Figure PCTCN2020136383-appb-000003
Piece. As shown in Figure 4, assuming that Cout/N is a positive integer, the first to Cout/N convolution kernels can be used as the first sub-weight data, and the first to (Cout/N+1) to (2Cout/N) As the second sub-weight data, ..., the ((N-1)*Cout/N+1)th to Coutth convolution kernels are used as the Nth sub-weight data. It is understandable that the number of sub-weight data and the number of processing cores may not be equal. For example, in certain scenarios, some processing cores in the chip are performing other tasks and cannot perform convolution operations. At this time, the input data and the weight data can be divided according to the number of processing cores actually available in the chip, which will not be repeated here.
可选的,所述子权重数据的大小与所述处理核的存储空间大小相关。在上述可选实施例中,没有考虑处理核自身的存储空间大小,这样可能会导致子权重数据的大小与所述处理核的存储空间的大小不匹配,进而导致处理核执行卷积运算子任务时效率较低。此时,可以根据每个处理核的存储空间的大小计算一个合适的值,根据该值来划分每一份子权重数据,此时每一个处理核所获取的子权重数据大小可以不同,权重数据不是平均等分的,而是根据各个可用处理核的存储能力划分的。示例性的,在计算所述处理核自身的存储空间大小时,需要通过处理核的存储空间中的可用空间减去卷积运算子任务所对应的程序所需要的空间以及输入数据所占据的空间大小,根据剩余的存储空间大小来给所述处理核划分合适的子权重数据。或者,对于自身的存储空间较小的处理核,可以将所述子权重数据进一步划分成多份,所述处理核每次根据其中一份计算出对应的子输出数据的一部分,此时对于该存储空间较小的处理核来说,其计算子输出数据的过程是一个串行的过程,在对子权重数据进一步划分时,可以按照所述处理核的自身存储空间进行平分,每一份子权重数据的大小都不大于所述存储空间,或者将每一份的大小设置为存储空间的大小进行划分。当然,根据存储空间的大小划分权重数据,就避免了再次划分的问题,提高了数据计算效率。Optionally, the size of the sub-weight data is related to the size of the storage space of the processing core. In the above optional embodiment, the storage space size of the processing core itself is not considered, which may cause the size of the sub-weight data to be mismatched with the storage space of the processing core, which in turn causes the processing core to perform the subtask of the convolution operation. Time efficiency is low. At this time, an appropriate value can be calculated according to the size of the storage space of each processing core, and each piece of sub-weight data can be divided according to this value. At this time, the size of the sub-weight data obtained by each processing core can be different, and the weight data is not Evenly divided, but divided according to the storage capacity of each available processing core. Exemplarily, when calculating the size of the storage space of the processing core itself, it is necessary to subtract the space required by the program corresponding to the convolution operation subtask and the space occupied by the input data from the available space in the storage space of the processing core According to the size of the remaining storage space, the processing core is divided into appropriate sub-weight data. Alternatively, for a processing core with a small storage space, the sub-weight data can be further divided into multiple parts, and the processing core calculates a part of the corresponding sub-output data according to one of them each time. For processing cores with smaller storage space, the process of calculating sub-output data is a serial process. When sub-weight data is further divided, it can be divided equally according to the storage space of the processing core itself, and each sub-weight The size of the data is not larger than the storage space, or the size of each copy is set to the size of the storage space for division. Of course, dividing the weighted data according to the size of the storage space avoids the problem of re-dividing and improves the efficiency of data calculation.
步骤S303,所述处理核根据所述输入数据和子权重数据执行所述卷积运算子任务得到子输出数据。Step S303: The processing core executes the subtask of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
所述处理核在得到其自身的卷积运算子任务所需要的输入数据和子权重数据之后,按照卷积运算的顺序计算所述输入数据和子权重数据的乘累加和得到子输出数据。具体的计算过程可以参见图1,单个处理核的卷积运算子任务的运算过程与通常的卷积运算过程相同,只是此时单个处理核中参与计算的卷积核不再是Cout个,而是根据步骤S302中所述的方式确定出的子权重数据的卷积核的个数,子权重数据根据计算的步长在所述输入数据上滑动计算乘累加和得到子输出数据。如图4中所示,N个处理核分别计算子权 重数据与输入数据的乘累加和,得到编号为1-N的N个子输出数据。After the processing core obtains the input data and sub-weight data required by its own convolution operation subtask, it calculates the multiplication and accumulation sum of the input data and the sub-weight data in the order of the convolution operation to obtain the sub-output data. The specific calculation process can be seen in Figure 1. The calculation process of the convolution operation subtask of a single processing core is the same as the usual convolution operation process, except that the convolution cores involved in the calculation in the single processing core are no longer Cout, and It is the number of convolution kernels of the sub-weight data determined according to the method described in step S302, and the sub-weight data is slidingly calculated on the input data according to the calculated step size to multiply and accumulate to obtain the sub-output data. As shown in Figure 4, N processing cores respectively calculate the multiplication and accumulation sum of the sub-weight data and the input data to obtain N sub-output data numbered 1-N.
通过上述步骤S301-步骤S303,处理核已经完成了分配给其自身的卷积运算子任务。但是此时还未得到最终的输出数据,因此所述方法,还包括:Through the above steps S301-S303, the processing core has completed the subtasks of the convolution operation assigned to itself. However, the final output data has not yet been obtained at this time, so the method also includes:
步骤S304,所述处理核将所述子输出数据按照顺序存储到所述系统存储空间中。通过上述卷积运算方法得到的均为输出数据的子输出数据,根据上述描述可以知道,所述的多个子输出数据为完整的输出数据在深度方向上的部分数据,且不需要在经过其他运算,只需要将其按照输出数据的深度存储顺序存储到系统存储空间中。如图4所示,处理核C 1计算输出数据的第1个子输出数据,处理核C 2计算输出数据的第2个子输出数据,……,处理核C N计算输出数据的第N个子输出数据,处理核只需要按照自身程序中所预先设定的存储空间地址将所述子输出数据存储到所述存储空间中即可得到完整的输出数据,每一份子输出数据的存储地址与其在输出数据中深度方向上的位置相关。 Step S304: The processing core stores the sub-output data in the system storage space in order. The sub-output data obtained by the above convolution operation method are all sub-output data of the output data. According to the above description, it can be known that the multiple sub-output data are partial data of the complete output data in the depth direction, and do not need to undergo other operations. , You only need to store it in the system storage space according to the depth storage order of the output data. As shown in Figure 4, the processing core C 1 calculates the first sub-output data of the output data, the processing core C 2 calculates the second sub-output data of the output data,..., the processing core C N calculates the N-th sub-output data of the output data. , The processing core only needs to store the sub-output data in the storage space according to the pre-set storage space address in its own program to obtain the complete output data. The storage address of each sub-output data is different from that in the output data. The position in the depth direction is related.
本公开另一实施例提供了又一卷积运算的方法,所述卷积运算方法包括:Another embodiment of the present disclosure provides yet another convolution operation method, and the convolution operation method includes:
获取所述卷积运算中所需要的输入数据以及权重数据;Obtaining input data and weight data required in the convolution operation;
将所述权重数据划分为多个子权重数据,其中所述权重数据包括多个卷积核,所述子权重数据为所述多个卷积核中的至少一个卷积核;Dividing the weight data into multiple sub-weight data, where the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;
将所述输入数据与所述多个子权重数据分别输入多个所述处理核进行所述卷积运算得到多个子输出数据;Input the input data and the multiple sub-weight data into multiple processing cores to perform the convolution operation to obtain multiple sub-output data;
将所述多个子输出数据合并得到输出数据。Combining the multiple sub-output data to obtain output data.
在上述示例中,还包括了将权重数据划分为多个子权重数据的过程,其具体的划分过程可以与步骤S302中所描述的相同,在此不再赘述。另外,可以理解的,上述划分过程可以是逻辑上的划分过程,即仅仅对权重数据的存储空间进行划分,得到每个权重数据在系统存储空间中的起始存储地址以及结束存储地址,使得所述处理核可以获取到子权重数据即可,而不需要真正的将数据划分为多份。In the above example, the process of dividing the weight data into multiple sub-weight data is also included, and the specific division process may be the same as that described in step S302, which will not be repeated here. In addition, it can be understood that the above division process can be a logical division process, that is, only the storage space of the weight data is divided to obtain the starting storage address and ending storage address of each weight data in the system storage space, so that all The processing core can obtain the sub-weight data without actually dividing the data into multiple pieces.
图5为根据本公开实施例的卷积运算方法的一个具体实例。如图5所示,芯片中包括两个处理核C 1和C 2,输入数据的宽度和高度相同:Win=Hin=8,输入数据的深度为:Cin=4;输出数据的宽度和高度相同:Wout=Hout=6,输出数据的深度为:Cout=8;卷积核的宽度和高度相同:Kw=Kh=3,卷积核的深度为:Cin=4,卷积核的个数为:Cout=8,卷积核的滑动步长为1。如图5所示,在该实例中,所述权重数据按照处理核的个数平均划分,即按照卷积核的编号顺序划分为两个子权重数据,即包括编号为1-4的4个卷积核的第一个子权重数据,和包括编号为5-8的4个卷积核的第二个子权重数据;将第一个子权重数据和输入数据送入C 1进行卷积计算,将第二个子权重数据和输入数据送入C 2进行卷积计算。C 1和C 2并行进行卷积运算,并分别输出一个子输出数据,每个子输出数据的大小为6*6*4,其中C 1输出的为输出数据中深度为1-4的子输出数据,C 2输出的为输出数据中深度为5-8的子输出数据,将两个子输出数据按照深度的顺序存储即得到完整的输出数据。 Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure. As shown in Figure 5, the chip includes two processing cores C 1 and C 2 , the width and height of the input data are the same: Win=Hin=8, the depth of the input data is: Cin=4; the width and height of the output data are the same : Wout=Hout=6, the depth of the output data is: Cout=8; the width and height of the convolution kernel are the same: Kw=Kh=3, the depth of the convolution kernel is: Cin=4, the number of convolution kernels is :Cout=8, the sliding step length of the convolution kernel is 1. As shown in FIG. 5, in this example, the weight data is divided equally according to the number of processing cores, that is, divided into two sub-weight data according to the number sequence of the convolution kernel, that is, it includes 4 volumes numbered 1-4 The first sub-weight data of the product kernel, and the second sub-weight data including the 4 convolution kernels numbered 5-8; the first sub-weight data and input data are sent to C 1 for convolution calculation, and The second sub-weight data and input data are sent to C 2 for convolution calculation. C 1 and C 2 perform convolution operations in parallel, and output a sub-output data respectively. The size of each sub-output data is 6*6*4, and the output of C 1 is the sub-output data whose depth is 1-4 in the output data. , C 2 outputs the sub-output data with a depth of 5-8 in the output data, and the two sub-output data are stored in the order of depth to obtain the complete output data.
本公开实施例公开了一种卷积运算方法及芯片。其中该卷积运算方法包括:所述处理核获取卷积运算子任务,其中所述卷积运算子任务中包括输入数据的存储地址以及子权重数据的存储地址,所述卷积运算子任务是所述卷积运算的一部分;所述处理核根据所述输入数据的存储地址以及所述子权重数据的存储地址从系统存储空间中获取所述输入数据和所述子权重数据,其中所述输入数据为所述卷积运算的输入数据,其中所述子权重数据是所述卷积运算的权重数据的一部分,所述权重数据包括多个卷积核,所述子权重数据为所述多个卷积核中的至少一个卷积核;所述处理核根据所述输入数据和所述子权重数据执行所述卷积运算子任务得到子输出数据。通过上述方法,将权重数据划分为多个子权重数据并分配给多个处理核并行进行卷积运算,解决了现有技术中卷积计算并行化差、效率低的技术问题。The embodiment of the present disclosure discloses a convolution operation method and chip. The convolution operation method includes: the processing core obtains a convolution operation subtask, wherein the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is Part of the convolution operation; the processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input The data is the input data of the convolution operation, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes multiple convolution kernels, and the sub-weight data is the multiple At least one convolution kernel in the convolution kernel; the processing kernel executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data. Through the above method, the weight data is divided into multiple sub-weight data and assigned to multiple processing cores to perform convolution operations in parallel, which solves the technical problems of poor parallelization and low efficiency of convolution calculations in the prior art.
本公开实施例还提供了一种芯片,包括多个处理核,其中所述多个处理核中的至少两个执行所述卷积运算方法以完成卷积运算。The embodiment of the present disclosure also provides a chip including a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method to complete the convolution operation.
本公开实施例还提供了一种电子设备,包括:存储器,用于存储计算机可读指令;以及一个或多个处理器,用于运行所述计算机可读指令,使得所述处理器运行时实现前述实施例中的任一所述卷积运算方法。An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can implement The convolution operation method described in any one of the foregoing embodiments.
本公开实施例还提供了一种非暂态计算机可读存储介质,其特征在于,该非暂态计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行前述实施例中的任一所述卷积运算方法。The embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments. The convolution operation method.
本公开实施例提供一种计算机程序产品,其中,其特征在于:包括计算机指令,当所述计算机指令被 计算设备执行时,所述计算设备可以执行前述实施例中的任一所述卷积运算方法。An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can perform any of the convolution operations in the foregoing embodiments method.
本公开实施例提供一种计算装置,其特征在于,包括前述实施例中的任一所述的芯片。An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.
本公开附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the drawings of the present disclosure illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown one after another can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described above in this document may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that can be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

Claims (10)

  1. 一种卷积运算方法,用于包括多个处理核的芯片中,其特征在于,包括:A convolution operation method used in a chip including multiple processing cores, characterized in that it includes:
    所述处理核获取卷积运算子任务,其中所述卷积运算子任务中包括输入数据的存储地址以及子权重数据的存储地址,所述卷积运算子任务是所述卷积运算的一部分;The processing core obtains a convolution operation subtask, where the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is a part of the convolution operation;
    所述处理核根据所述输入数据的存储地址以及所述子权重数据的存储地址从系统存储空间中获取所述输入数据和所述子权重数据,其中所述输入数据为所述卷积运算的输入数据,其中所述子权重数据是所述卷积运算的权重数据的一部分,所述权重数据包括多个卷积核,所述子权重数据为所述多个卷积核中的至少一个卷积核;The processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the result of the convolution operation Input data, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution of the plurality of convolution kernels Accumulate core
    所述处理核根据所述输入数据和所述子权重数据执行所述卷积运算子任务得到子输出数据。The processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  2. 如权利要求1所述的卷积运算方法,其特征在于,所述方法还包括:The convolution operation method according to claim 1, wherein the method further comprises:
    所述处理核将所述子输出数据按照顺序存储到所述系统存储空间中。The processing core stores the sub-output data in the system storage space in sequence.
  3. 如权利要求1或2所述的卷积运算方法,其特征在于:The convolution operation method according to claim 1 or 2, characterized in that:
    所述子权重数据中所述卷积核的个数由所述处理核的个数确定。The number of convolution kernels in the sub-weight data is determined by the number of processing cores.
  4. 如权利要求1-3中任一项所述的卷积运算方法,其特征在于:The convolution operation method according to any one of claims 1-3, characterized in that:
    所述子权重数据的大小与所述处理核的存储空间大小相关。The size of the sub-weight data is related to the size of the storage space of the processing core.
  5. 如权利要求1-4中任一项所述的卷积运算方法,其特征在于:The convolution operation method according to any one of claims 1 to 4, characterized in that:
    所述子输出数据为所述输出数据在深度方向上的子输出数据。The sub-output data is sub-output data in the depth direction of the output data.
  6. 一种芯片,包括多个处理核,其中所述多个处理核中的至少两个执行所述权利要求1-5中的卷积运算方法以完成卷积运算。A chip comprising a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method in claims 1-5 to complete the convolution operation.
  7. 一种卷积运算方法,其特征在于,包括:A convolution operation method, characterized in that it comprises:
    获取所述卷积运算中所需要的输入数据以及权重数据;Obtaining input data and weight data required in the convolution operation;
    将所述权重数据划分为多个子权重数据,其中所述权重数据包括多个卷积核,所述子权重数据为所述多个卷积核中的至少一个卷积核;Dividing the weight data into multiple sub-weight data, where the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;
    将所述输入数据与所述多个子权重数据分别输入多个所述处理核进行所述卷积运算得到多个子输出数据;Input the input data and the multiple sub-weight data into multiple processing cores to perform the convolution operation to obtain multiple sub-output data;
    将所述多个子输出数据合并得到输出数据。Combining the multiple sub-output data to obtain output data.
  8. 一种电子设备,包括:存储器,用于存储计算机可读指令;以及一个或多个处理器,用于运行所述计算机可读指令,使得所述处理器运行时实现权利要求1-5或者权利要求7所述卷积运算方法。An electronic device, comprising: a memory for storing computer readable instructions; and one or more processors for running the computer readable instructions, so that the processor implements claims 1-5 or rights when running The convolution operation method described in claim 7.
  9. 一种非暂态计算机可读存储介质,其特征在于,该非暂态计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行权利要求1-5或者权利要求7所述卷积运算方法。A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the convolution operation described in claims 1-5 or claim 7. method.
  10. 一种计算装置,其特征在于,包括权利要求6中所述的芯片。A computing device, characterized by comprising the chip described in claim 6.
PCT/CN2020/136383 2020-01-21 2020-12-15 Convolutional operation method and chip WO2021147567A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010070481.9A CN113222136A (en) 2020-01-21 2020-01-21 Convolution operation method and chip
CN202010070481.9 2020-01-21

Publications (1)

Publication Number Publication Date
WO2021147567A1 true WO2021147567A1 (en) 2021-07-29

Family

ID=76991794

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136383 WO2021147567A1 (en) 2020-01-21 2020-12-15 Convolutional operation method and chip

Country Status (2)

Country Link
CN (1) CN113222136A (en)
WO (1) WO2021147567A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837922A (en) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 Computing device, data processing method and related product
CN115858178B (en) * 2023-02-21 2023-06-06 芯砺智能科技(上海)有限公司 Method, device, medium and equipment for sharing resources in convolution calculation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862650A (en) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 The method of speed-up computation two dimensional image CNN convolution
CN107885700A (en) * 2017-12-29 2018-04-06 中国人民解放军国防科技大学 Multi-core implementation method for large-scale matrix convolution
US20180114114A1 (en) * 2016-10-21 2018-04-26 Nvidia Corporation Systems and methods for pruning neural networks for resource efficient inference
CN108416434A (en) * 2018-02-07 2018-08-17 复旦大学 The circuit structure accelerated with full articulamentum for the convolutional layer of neural network
CN109165734A (en) * 2018-07-11 2019-01-08 中国人民解放军国防科技大学 Matrix local response normalization vectorization implementation method
CN110009103A (en) * 2019-03-26 2019-07-12 深兰科技(上海)有限公司 A kind of method and apparatus of deep learning convolutional calculation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190051697A (en) * 2017-11-07 2019-05-15 삼성전자주식회사 Method and apparatus for performing devonvolution operation in neural network
CN110473137B (en) * 2019-04-24 2021-09-14 华为技术有限公司 Image processing method and device
CN110689115B (en) * 2019-09-24 2023-03-31 安徽寒武纪信息科技有限公司 Neural network model processing method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114114A1 (en) * 2016-10-21 2018-04-26 Nvidia Corporation Systems and methods for pruning neural networks for resource efficient inference
CN107862650A (en) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 The method of speed-up computation two dimensional image CNN convolution
CN107885700A (en) * 2017-12-29 2018-04-06 中国人民解放军国防科技大学 Multi-core implementation method for large-scale matrix convolution
CN108416434A (en) * 2018-02-07 2018-08-17 复旦大学 The circuit structure accelerated with full articulamentum for the convolutional layer of neural network
CN109165734A (en) * 2018-07-11 2019-01-08 中国人民解放军国防科技大学 Matrix local response normalization vectorization implementation method
CN110009103A (en) * 2019-03-26 2019-07-12 深兰科技(上海)有限公司 A kind of method and apparatus of deep learning convolutional calculation

Also Published As

Publication number Publication date
CN113222136A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
Auten et al. Hardware acceleration of graph neural networks
Li et al. Quantum supremacy circuit simulation on Sunway TaihuLight
US9053067B2 (en) Distributed data scalable adaptive map-reduce framework
Busato et al. An efficient implementation of the Bellman-Ford algorithm for Kepler GPU architectures
WO2018099084A1 (en) Method, device, chip and system for training neural network model
WO2021147567A1 (en) Convolutional operation method and chip
US20140333638A1 (en) Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units
US9164690B2 (en) System, method, and computer program product for copying data between memory locations
WO2017076296A1 (en) Method and device for processing graph data
TW202147188A (en) Method of training neural network model and related product
CN109033439B (en) The treating method and apparatus of stream data
WO2021072732A1 (en) Matrix computing circuit, apparatus and method
CN110069502A (en) Data balancing partition method and computer storage medium based on Spark framework
CN110659278A (en) Graph data distributed processing system based on CPU-GPU heterogeneous architecture
CN113222125A (en) Convolution operation method and chip
CN112784973A (en) Convolution operation circuit, device and method
CN114746871A (en) Neural network training using dataflow graphs and dynamic memory management
Ma et al. FPGA-based AI smart NICs for scalable distributed AI training systems
CN113222099A (en) Convolution operation method and chip
Mana A feature based comparison study of big data scheduling algorithms
Schmidt et al. Load-balanced parallel constraint-based causal structure learning on multi-core systems for high-dimensional data
WO2015143708A1 (en) Method and apparatus for constructing suffix array
WO2021218492A1 (en) Task allocation method and apparatus, electronic device, and computer readable storage medium
Jaspers Acceleration of read alignment with coherent attached FPGA coprocessors
CN114283046A (en) Point cloud file registration method and device based on ICP algorithm and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916090

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20916090

Country of ref document: EP

Kind code of ref document: A1