WO2021147567A1 - Procédé d'opération de convolution et puce - Google Patents

Procédé d'opération de convolution et puce Download PDF

Info

Publication number
WO2021147567A1
WO2021147567A1 PCT/CN2020/136383 CN2020136383W WO2021147567A1 WO 2021147567 A1 WO2021147567 A1 WO 2021147567A1 CN 2020136383 W CN2020136383 W CN 2020136383W WO 2021147567 A1 WO2021147567 A1 WO 2021147567A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sub
convolution operation
convolution
weight data
Prior art date
Application number
PCT/CN2020/136383
Other languages
English (en)
Chinese (zh)
Inventor
王维伟
罗飞
Original Assignee
北京希姆计算科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京希姆计算科技有限公司 filed Critical 北京希姆计算科技有限公司
Publication of WO2021147567A1 publication Critical patent/WO2021147567A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of neural network computing, and in particular to a convolution operation method and chip.
  • the chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as CPU (Central Processing Unit, central processing unit), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as TPU (tensor processing unit, tensor processor), etc., they can play a higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.
  • CPU Central Processing Unit, central processing unit
  • TPU tensor processing unit, tensor processor
  • Neural network is an important model of artificial intelligence, and its core is convolution calculation.
  • the existing technical solutions generally have two solutions when processing convolution operations:
  • Multi-threaded parallel splitting scheme This scheme is used in the GPU to split the convolution into multiple threads for parallel operation. All data and weights are split into the number of operations in the number of threads, and these shares are divided The convolution is completed after all runs.
  • an embodiment of the present disclosure provides a convolution operation method used in a chip including multiple processing cores, which is characterized in that it includes:
  • the processing core obtains a convolution operation subtask, where the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is a part of the convolution operation;
  • the processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the result of the convolution operation Input data, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution of the plurality of convolution kernels Accumulate core
  • the processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  • the method further includes:
  • the processing core stores the sub-output data in the system storage space in sequence.
  • the number of the convolution kernels in the sub-weight data is determined by the number of the processing kernels.
  • the size of the sub-weight data is related to the size of the storage space of the processing core.
  • sub-output data is sub-output data in the depth direction of the output data.
  • embodiments of the present disclosure provide a convolution operation method, including:
  • the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;
  • an embodiment of the present disclosure provides a chip including a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method described in the first aspect above to complete the convolution operation.
  • an embodiment of the present disclosure provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to execute the computer-readable instructions to cause the processor to run
  • the convolution operation method described in any one of the foregoing first aspect or second aspect is realized at a time.
  • embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the aforementioned first aspect or Any one of the convolution operation methods in the second aspect.
  • embodiments of the present disclosure provide a computer program product, wherein the computer program product is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the aforementioned first aspect or second aspect Any one of the convolution operation methods described in.
  • an embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in the third aspect.
  • the embodiment of the present disclosure discloses a convolution operation method and chip.
  • the convolution operation method includes: the processing core obtains a convolution operation subtask, wherein the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is Part of the convolution operation; the processing core obtains input data and sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the input of the convolution operation Data, wherein the sub-weight data is a part of the weight data of a convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel among the plurality of convolution kernels; The processing core executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  • the weight data is divided into multiple sub-weight data and assigned to multiple processing cores to perform convolution operations in parallel, which solve
  • Figure 1 is a schematic diagram of the process of convolution operation
  • FIG. 2 is a schematic diagram of the structure of a chip that executes the convolution operation method provided by an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a convolution operation method provided by an embodiment of the disclosure.
  • FIG. 4 is a schematic diagram of the operation of a convolution operation method provided by an embodiment of the disclosure.
  • Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure.
  • FIG. 1 is a schematic diagram of the convolution operation process.
  • the size of the input data (ie, the input feature map) of the convolution operation is Win*Hin*Cin, where Win represents the width of the input data, Hin represents the height of the input data, and Cin represents the depth of the input data.
  • the weight data that is, one or more convolution kernels
  • the size of each convolution kernel is Kw*Kh*Cin, where Kw represents the width of the convolution kernel, and Kh represents the size of the convolution kernel. Height, Cin represents the depth of the convolution kernel.
  • each convolution kernel will slide on the input data, and the corresponding input data will be multiplied and accumulated by the corresponding element at each sliding position, and the output data corresponding to this convolution kernel will be extracted Elements (ie feature points on the output feature map); since there are Cout convolution kernels in the weight data, each convolution kernel will be multiplied and accumulated by the corresponding element with the input data at the same position to obtain Cout output data Elements; Cout output data elements form an element of output data with depth on the output data. The depth of the output data element is Cout; all convolution kernels will slide over the entire input data, and each sliding position is obtained An element with a depth of Cout to get the entire output data.
  • Dout is an element with depth in the output data, and its superscript l indicates that the depth at the output depth is l;
  • Din refers to the data block of the input data corresponding to the convolution kernel, and its superscript i corresponds to the depth of the input data , J and k respectively correspond to the width and height of the convolution kernel;
  • w is the element in the convolution kernel, that is, the weight in the neural network calculation, and its superscripts l and i correspond to the depth of the output data and the depth of the input data, respectively.
  • the present disclosure divides the operations that can be independently performed in the convolution operation into multiple subtasks, each subtask has its corresponding input data and subweight data; the subtasks are allocated to processing cores in a chip including multiple processing cores Perform separately.
  • FIG. 2 is a schematic structural diagram of a chip that executes the convolution operation method provided by an embodiment of the present disclosure.
  • the chip is a chip with a multi-core architecture, which includes multiple processing cores C 1 , C 2 ... C M , and the multiple processing cores are capable of independently processing tasks.
  • the processing core can run independently according to its own program and does not need to accept task distribution from the scheduler.
  • the program of the processing core can be dynamically updated by the server, or it can be written into the processing core after the processing core is started, or it can be automatically updated from the system's memory space according to its own initialization program during the operation of the processing core.
  • FIG. 3 is a flowchart of a convolution operation method provided by an embodiment of the disclosure.
  • the convolution operation method in the embodiment of the present disclosure is used in a chip including multiple processing cores as shown in FIG. 2.
  • the following method is described by taking one of the multiple processing cores as an example. include:
  • Step S301 the processing core obtains a convolution operation subtask, where the convolution operation subtask includes the storage address of the input data and the storage address of the subweight data, and the convolution operation subtask is a part of the convolution operation. ;
  • the processing core obtains a convolution operation subtask
  • the convolution operation subtask is a part of the convolution operation
  • the convolution operation subtask and the convolution operation subtasks of other processing cores are The order of operations is not related to each other.
  • the convolution operation subtask includes a storage address of input data and a storage address of sub-weight data required by the convolution subtask, wherein the storage address is a storage address of a system storage space. It is understandable that the storage address of the input data and the storage address of the sub-weight data are the start storage address and the end storage address of the data, or the storage address of the input data and the storage address of the sub-weight data are The initial storage address. At this time, the subtask of the convolution operation also needs to include the size information of the input data and the sub-weight data.
  • Step S302 The processing core obtains input data and sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input data is the input data of the convolution operation, wherein The sub-weight data is a part of weight data of a convolution operation, the weight data includes a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel among the plurality of convolution kernels;
  • the processing core itself has a storage space in the processing core for storing the convolution operation subtasks and the input data and subweight data required by the convolution operation subtasks.
  • the processing core obtains the input data and the sub-weight data from the storage space of the system according to the storage address of the input data and the storage address of the sub-weight data obtained in step S301, and stores them in the storage space of the processing core.
  • the weight data includes multiple convolution kernels.
  • the complete weight data includes Cout convolution kernels, because the calculation of each convolution kernel and the input data is independent of each other. Therefore, the multiple convolution kernels in the weight data can be divided into multiple groups, and each group can use a processing kernel to perform convolution operations separately.
  • the number of convolution kernels in the sub-weight data is determined by the number of processing cores.
  • the number of sub-weight data is equal to the number of processing cores.
  • the number of processing cores of the chip is N, which are respectively C 1 , C 2 ... CN , then the weight data is divided into N parts, if it is divided equally, each sub-weight
  • the data includes Cout/N convolution kernels. It should be noted that in this case, Cout/N is a positive integer. If Cout/N is not a positive integer, then the convolution included in each sub-weight data Core can be set to Then the convolution kernel in the sub-weight data obtained by one of the processing cores is insufficient Piece.
  • the first to Cout/N convolution kernels can be used as the first sub-weight data, and the first to (Cout/N+1) to (2Cout/N) As the second sub-weight data, ..., the ((N-1)*Cout/N+1)th to Coutth convolution kernels are used as the Nth sub-weight data.
  • the number of sub-weight data and the number of processing cores may not be equal. For example, in certain scenarios, some processing cores in the chip are performing other tasks and cannot perform convolution operations. At this time, the input data and the weight data can be divided according to the number of processing cores actually available in the chip, which will not be repeated here.
  • the size of the sub-weight data is related to the size of the storage space of the processing core.
  • the storage space size of the processing core itself is not considered, which may cause the size of the sub-weight data to be mismatched with the storage space of the processing core, which in turn causes the processing core to perform the subtask of the convolution operation.
  • Time efficiency is low.
  • an appropriate value can be calculated according to the size of the storage space of each processing core, and each piece of sub-weight data can be divided according to this value.
  • the size of the sub-weight data obtained by each processing core can be different, and the weight data is not Evenly divided, but divided according to the storage capacity of each available processing core.
  • the processing core when calculating the size of the storage space of the processing core itself, it is necessary to subtract the space required by the program corresponding to the convolution operation subtask and the space occupied by the input data from the available space in the storage space of the processing core According to the size of the remaining storage space, the processing core is divided into appropriate sub-weight data.
  • the sub-weight data can be further divided into multiple parts, and the processing core calculates a part of the corresponding sub-output data according to one of them each time.
  • the process of calculating sub-output data is a serial process.
  • sub-weight data When sub-weight data is further divided, it can be divided equally according to the storage space of the processing core itself, and each sub-weight The size of the data is not larger than the storage space, or the size of each copy is set to the size of the storage space for division.
  • dividing the weighted data according to the size of the storage space avoids the problem of re-dividing and improves the efficiency of data calculation.
  • Step S303 The processing core executes the subtask of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  • the processing core After the processing core obtains the input data and sub-weight data required by its own convolution operation subtask, it calculates the multiplication and accumulation sum of the input data and the sub-weight data in the order of the convolution operation to obtain the sub-output data.
  • the specific calculation process can be seen in Figure 1.
  • the calculation process of the convolution operation subtask of a single processing core is the same as the usual convolution operation process, except that the convolution cores involved in the calculation in the single processing core are no longer Cout, and It is the number of convolution kernels of the sub-weight data determined according to the method described in step S302, and the sub-weight data is slidingly calculated on the input data according to the calculated step size to multiply and accumulate to obtain the sub-output data.
  • N processing cores respectively calculate the multiplication and accumulation sum of the sub-weight data and the input data to obtain N sub-output data numbered 1-N.
  • the processing core has completed the subtasks of the convolution operation assigned to itself.
  • the final output data has not yet been obtained at this time, so the method also includes:
  • Step S304 The processing core stores the sub-output data in the system storage space in order.
  • the sub-output data obtained by the above convolution operation method are all sub-output data of the output data.
  • the multiple sub-output data are partial data of the complete output data in the depth direction, and do not need to undergo other operations. , You only need to store it in the system storage space according to the depth storage order of the output data.
  • the processing core C 1 calculates the first sub-output data of the output data
  • the processing core C 2 calculates the second sub-output data of the output data
  • the processing core C N calculates the N-th sub-output data of the output data.
  • the processing core only needs to store the sub-output data in the storage space according to the pre-set storage space address in its own program to obtain the complete output data.
  • the storage address of each sub-output data is different from that in the output data.
  • the position in the depth direction is related.
  • Another embodiment of the present disclosure provides yet another convolution operation method, and the convolution operation method includes:
  • the weight data includes multiple convolution kernels, and the sub-weight data is at least one convolution kernel among the multiple convolution kernels;
  • the process of dividing the weight data into multiple sub-weight data is also included, and the specific division process may be the same as that described in step S302, which will not be repeated here.
  • the above division process can be a logical division process, that is, only the storage space of the weight data is divided to obtain the starting storage address and ending storage address of each weight data in the system storage space, so that all The processing core can obtain the sub-weight data without actually dividing the data into multiple pieces.
  • Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure.
  • the weight data is divided equally according to the number of processing cores, that is, divided into two sub-weight data according to the number sequence of the convolution kernel, that is, it includes 4 volumes numbered 1-4
  • C 1 and C 2 perform convolution operations in parallel, and output a sub-output data respectively.
  • the size of each sub-output data is 6*6*4, and the output of C 1 is the sub-output data whose depth is 1-4 in the output data.
  • C 2 outputs the sub-output data with a depth of 5-8 in the output data, and the two sub-output data are stored in the order of depth to obtain the complete output data.
  • the embodiment of the present disclosure discloses a convolution operation method and chip.
  • the convolution operation method includes: the processing core obtains a convolution operation subtask, wherein the convolution operation subtask includes a storage address of input data and a storage address of subweight data, and the convolution operation subtask is Part of the convolution operation; the processing core obtains the input data and the sub-weight data from the system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the input The data is the input data of the convolution operation, wherein the sub-weight data is a part of the weight data of the convolution operation, the weight data includes multiple convolution kernels, and the sub-weight data is the multiple At least one convolution kernel in the convolution kernel; the processing kernel executes the sub-task of the convolution operation according to the input data and the sub-weight data to obtain sub-output data.
  • the weight data is divided into multiple sub-weight data and assigned to multiple processing cores to perform convolution operations in parallel, which solves the technical problems
  • the embodiment of the present disclosure also provides a chip including a plurality of processing cores, wherein at least two of the plurality of processing cores execute the convolution operation method to complete the convolution operation.
  • An embodiment of the present disclosure also provides an electronic device, including: a memory, configured to store computer-readable instructions; and one or more processors, configured to run the computer-readable instructions, so that the processor can implement The convolution operation method described in any one of the foregoing embodiments.
  • the embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments.
  • the convolution operation method is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute any of the foregoing embodiments.
  • An embodiment of the present disclosure provides a computer program product, which is characterized by including computer instructions, and when the computer instructions are executed by a computing device, the computing device can perform any of the convolution operations in the foregoing embodiments method.
  • An embodiment of the present disclosure provides a computing device, which is characterized by including the chip described in any of the foregoing embodiments.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown one after another can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure can be implemented in software or hardware. Among them, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Product
  • SOC System on Chip
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • the machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

La présente invention concerne un procédé d'opération de convolution et une puce. Le procédé d'opération de convolution comprend les étapes suivantes : un cœur de traitement acquiert une sous-tâche d'opération de convolution, la sous-tâche d'opération de convolution comprenant une adresse de stockage de données d'entrée et une adresse de stockage de sous-données pondérées, et la sous-tâche d'opération de convolution étant une partie d'une opération de convolution (S301) ; le cœur de traitement acquiert les données d'entrée et les sous-données pondérées à partir d'un espace de stockage de système sur la base de l'adresse de stockage des données d'entrée et de l'adresse de stockage des sous-données pondérées, les sous-données pondérées étant une partie de données pondérées de l'opération de convolution ; et le cœur de traitement exécute la sous-tâche d'opération de convolution sur la base des données d'entrée et des sous-données pondérées pour produire des sous-données de sortie (S303). Au moyen du procédé, les données pondérées étant divisées en de multiples éléments de sous-données pondérées et attribuées à de multiples cœurs de traitement pour effectuer l'opération de convolution, le problème technique de la parallélisation de l'opération de convolution médiocre et de l'efficacité faible de l'état de la technique est résolu.
PCT/CN2020/136383 2020-01-21 2020-12-15 Procédé d'opération de convolution et puce WO2021147567A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010070481.9A CN113222136A (zh) 2020-01-21 2020-01-21 卷积运算方法及芯片
CN202010070481.9 2020-01-21

Publications (1)

Publication Number Publication Date
WO2021147567A1 true WO2021147567A1 (fr) 2021-07-29

Family

ID=76991794

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136383 WO2021147567A1 (fr) 2020-01-21 2020-12-15 Procédé d'opération de convolution et puce

Country Status (2)

Country Link
CN (1) CN113222136A (fr)
WO (1) WO2021147567A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837922A (zh) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 计算装置、数据处理方法及相关产品
CN115858178B (zh) * 2023-02-21 2023-06-06 芯砺智能科技(上海)有限公司 一种卷积计算中资源共享的方法、装置、介质及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862650A (zh) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 加速计算二维图像cnn卷积的方法
CN107885700A (zh) * 2017-12-29 2018-04-06 中国人民解放军国防科技大学 一种大规模矩阵卷积的多核实现方法
US20180114114A1 (en) * 2016-10-21 2018-04-26 Nvidia Corporation Systems and methods for pruning neural networks for resource efficient inference
CN108416434A (zh) * 2018-02-07 2018-08-17 复旦大学 针对神经网络的卷积层与全连接层进行加速的电路结构
CN109165734A (zh) * 2018-07-11 2019-01-08 中国人民解放军国防科技大学 一种矩阵局部响应归一化的向量化实现方法
CN110009103A (zh) * 2019-03-26 2019-07-12 深兰科技(上海)有限公司 一种深度学习卷积计算的方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190051697A (ko) * 2017-11-07 2019-05-15 삼성전자주식회사 뉴럴 네트워크의 디컨벌루션 연산을 수행하는 장치 및 방법
CN110473137B (zh) * 2019-04-24 2021-09-14 华为技术有限公司 图像处理方法和装置
CN110689115B (zh) * 2019-09-24 2023-03-31 安徽寒武纪信息科技有限公司 神经网络模型处理方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114114A1 (en) * 2016-10-21 2018-04-26 Nvidia Corporation Systems and methods for pruning neural networks for resource efficient inference
CN107862650A (zh) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 加速计算二维图像cnn卷积的方法
CN107885700A (zh) * 2017-12-29 2018-04-06 中国人民解放军国防科技大学 一种大规模矩阵卷积的多核实现方法
CN108416434A (zh) * 2018-02-07 2018-08-17 复旦大学 针对神经网络的卷积层与全连接层进行加速的电路结构
CN109165734A (zh) * 2018-07-11 2019-01-08 中国人民解放军国防科技大学 一种矩阵局部响应归一化的向量化实现方法
CN110009103A (zh) * 2019-03-26 2019-07-12 深兰科技(上海)有限公司 一种深度学习卷积计算的方法和装置

Also Published As

Publication number Publication date
CN113222136A (zh) 2021-08-06

Similar Documents

Publication Publication Date Title
Auten et al. Hardware acceleration of graph neural networks
Li et al. Quantum supremacy circuit simulation on Sunway TaihuLight
US9053067B2 (en) Distributed data scalable adaptive map-reduce framework
Busato et al. An efficient implementation of the Bellman-Ford algorithm for Kepler GPU architectures
WO2021147567A1 (fr) Procédé d'opération de convolution et puce
US20140333638A1 (en) Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units
US9164690B2 (en) System, method, and computer program product for copying data between memory locations
WO2017076296A1 (fr) Procédé et dispositif de traitement de données de graphique
CN113435682A (zh) 分布式训练的梯度压缩
TW202147188A (zh) 神經網路模型的訓練方法和相關産品
CN109033439B (zh) 流式数据的处理方法和装置
WO2021072732A1 (fr) Circuit, appareil et procédé de calcul matriciel
CN110069502A (zh) 基于Spark架构的数据均衡分区方法及计算机存储介质
CN110659278A (zh) 基于cpu-gpu异构架构的图数据分布式处理系统
CN113222125A (zh) 卷积运算方法及芯片
CN112784973A (zh) 卷积运算电路、装置以及方法
Ma et al. FPGA-based AI smart NICs for scalable distributed AI training systems
CN113222099A (zh) 卷积运算方法及芯片
Mana A feature based comparison study of big data scheduling algorithms
Schmidt et al. Load-balanced parallel constraint-based causal structure learning on multi-core systems for high-dimensional data
WO2015143708A1 (fr) Procédé et appareil de construction d'un ensemble de suffixes
WO2021218492A1 (fr) Procédé et appareil d'attribution de tâche, dispositif électronique et support d'enregistrement lisible par ordinateur
Jaspers Acceleration of read alignment with coherent attached FPGA coprocessors
US10210136B2 (en) Parallel computer and FFT operation method
CN114691142A (zh) 执行程序的编译方法、芯片、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916090

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20916090

Country of ref document: EP

Kind code of ref document: A1