CN113222136A - Convolution operation method and chip - Google Patents

Convolution operation method and chip Download PDF

Info

Publication number
CN113222136A
CN113222136A CN202010070481.9A CN202010070481A CN113222136A CN 113222136 A CN113222136 A CN 113222136A CN 202010070481 A CN202010070481 A CN 202010070481A CN 113222136 A CN113222136 A CN 113222136A
Authority
CN
China
Prior art keywords
sub
convolution operation
data
weight data
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010070481.9A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Simm Computing Technology Co ltd
Original Assignee
Beijing Simm Computing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Simm Computing Technology Co ltd filed Critical Beijing Simm Computing Technology Co ltd
Priority to CN202010070481.9A priority Critical patent/CN113222136A/en
Priority to PCT/CN2020/136383 priority patent/WO2021147567A1/en
Publication of CN113222136A publication Critical patent/CN113222136A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the disclosure discloses a convolution operation method and a chip. The convolution operation method comprises the following steps: the processing core acquires a convolution operation subtask, wherein the convolution operation subtask comprises a storage address of input data and a storage address of sub-weight data, and the convolution operation subtask is a part of convolution operation; the processing core acquires the input data and the sub-weight data from a system storage space according to the storage address of the input data and the storage address of the sub-weight data, wherein the sub-weight data is a part of the weight data of the convolution operation; and the processing core executes the convolution operation subtask according to the input data and the sub-weight data to obtain sub-output data. By the method, the weight data are divided into a plurality of sub-weight data and distributed to a plurality of processing kernels to be subjected to convolution operation in parallel, and the technical problems of poor parallelization and low efficiency of convolution operation in the prior art are solved.

Description

Convolution operation method and chip
Technical Field
The present disclosure relates to the field of neural network computing, and in particular, to a convolution operation method and a chip.
Background
With the development of science and technology, the human society is rapidly entering the intelligent era. The important characteristics of the intelligent era are that people obtain more and more data, the quantity of the obtained data is larger and larger, and the requirement on the speed of processing the data is higher and higher.
Chips are the cornerstone of data processing, which fundamentally determines the ability of people to process data. From the application field, the chip mainly has two routes: one is a general chip route, such as a CPU (Central Processing Unit), which provides great flexibility but is less computationally efficient in Processing domain-specific algorithms; the other is a special chip route, such as TPU (tensor processing unit), which can exert higher effective computing power in some specific fields, but in the case of flexible and versatile more general fields, the processing capability is poor or even impossible.
Neural networks are important models of artificial intelligence, and the core of the neural networks is convolution calculation. The prior art generally has two schemes when handling convolution operation:
(1) and (3) an overall calculation scheme: the scheme is used in a single-core CPU, and the single core realizes point-by-point multiplication and accumulation of input data and weight data according to a convolution calculation formula to obtain a final result.
(2) A multithreading parallel splitting scheme: the scheme is used in a GPU, the convolution is divided into a plurality of threads for parallel operation, all data and weight are divided into operation parts taking the thread number as a unit, and the convolution is completed after all the operation parts are finished.
But the processing granularity processed by the scheme (1) is too coarse, the whole convolution is realized by using one processing core, and the parallelization is poor; the requirement cannot be met in the application with high requirement on the time delay, if the time delay is reduced, the computing capacity of a processing core needs to be improved, and the hardware cost is high. The splitting granularity of the input data and the weight data in the scheme (2) is too fine, the splitting process is complex, a complex scheduler needs to be designed, the efficiency is low, and the cost is high.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In order to solve the above technical problem of performing convolution calculation in the prior art, the embodiment of the present disclosure provides the following technical solutions:
in a first aspect, an embodiment of the present disclosure provides a convolution operation method, which is used in a chip including multiple processing cores, and includes:
the processing core acquires a convolution operation subtask, wherein the convolution operation subtask comprises a storage address of input data and a storage address of sub-weight data, and the convolution operation subtask is a part of convolution operation;
the processing core acquires the input data and the sub-weight data from a system storage space according to a storage address of the input data and a storage address of the sub-weight data, wherein the input data is input data of the convolution operation, the sub-weight data is a part of weight data of the convolution operation, the weight data comprises a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel of the plurality of convolution kernels;
and the processing core executes the convolution operation subtask according to the input data and the sub-weight data to obtain sub-output data.
Further, the method further comprises:
and the processing core stores the sub-output data into the system storage space in sequence.
Further, the number of the convolution kernels in the sub-weight data is determined by the number of the processing kernels.
Further, the size of the sub-weight data is related to the size of the storage space of the processing core.
Further, the sub-output data is sub-output data of the output data in the depth direction.
In a second aspect, an embodiment of the present disclosure provides a convolution operation method, including:
acquiring input data and weight data required in the convolution operation;
dividing the weight data into a plurality of pieces of sub-weight data, wherein the weight data comprises a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel in the plurality of convolution kernels;
respectively inputting the input data and the plurality of sub-weight data into a plurality of processing cores to carry out convolution operation to obtain a plurality of sub-output data;
and combining the sub-output data to obtain output data.
In a third aspect, an embodiment of the present disclosure provides a chip, which includes a plurality of processing cores, where at least two of the plurality of processing cores execute the convolution operation method described in the first aspect above to complete convolution operation.
In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory for storing computer readable instructions; and one or more processors configured to execute the computer-readable instructions, so that the processors implement the convolution operation method according to any one of the first aspect and the second aspect when the processors execute the convolution operation method.
In a fifth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute the convolution operation method according to any one of the first aspect and the second aspect.
In a sixth aspect, the present disclosure provides a computer program product, wherein: comprising computer instructions which, when executed by a computing device, may perform the convolution operation method of any one of the preceding first or second aspects.
In a seventh aspect, an embodiment of the present disclosure provides a computing device, including the chip in the third aspect.
The embodiment of the disclosure discloses a convolution operation method and a chip. The convolution operation method comprises the following steps: the processing core acquires a convolution operation subtask, wherein the convolution operation subtask comprises a storage address of input data and a storage address of sub-weight data, and the convolution operation subtask is part of convolution operation; the processing core acquires input data and sub-weight data from a system storage space according to a storage address of the input data and a storage address of the sub-weight data, wherein the input data is the input data of the convolution operation, the sub-weight data is a part of weight data of the convolution operation, the weight data comprises a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel in the plurality of convolution kernels; and the processing core executes the convolution operation subtask according to the input data and the sub-weight data to obtain sub-output data. By the method, the weight data are divided into a plurality of sub-weight data and distributed to a plurality of processing kernels to be subjected to convolution operation in parallel, and the technical problems of poor parallelization and low efficiency of convolution operation in the prior art are solved.
The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
FIG. 1 is a schematic diagram of a convolution operation;
FIG. 2 is a schematic diagram of a chip for performing a convolution operation method according to an embodiment of the disclosure;
fig. 3 is a flowchart of a convolution operation method provided in an embodiment of the present disclosure;
fig. 4 is an operation diagram of a convolution operation method according to an embodiment of the present disclosure;
FIG. 5 is a diagram illustrating an exemplary convolution operation method according to an embodiment of the present disclosure;
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Fig. 1 is a schematic diagram of a convolution operation process. As shown in fig. 1, the size of the input data (i.e., the input feature map) of the convolution operation is Win × Hin × Cin, where Win represents the width of the input data, Hin represents the height of the input data, and Cin represents the depth of the input data. The weight data (i.e., one or more convolution kernels) has a total of Cout convolution kernels, each of which has a size Kw · Kh · Cin, where Kw represents the width of the convolution kernel, Kh represents the height of the convolution kernel, and Cin represents the depth of the convolution kernel. In the convolution process, each convolution kernel slides on the input data, corresponding element multiplication accumulation is carried out on each sliding position and the corresponding input data, and elements (namely feature points on the output feature map) in the output data corresponding to the convolution kernel are extracted; since the weight data comprises Cout convolution kernels, each convolution kernel carries out corresponding element multiplication accumulation with the input data at the same position, and thus Cout output data elements are obtained; cout output data elements form an element of output data with depth on the output data, and the depth of the element of the output data is Cout; all convolution kernels can slide on the whole input data, and each sliding position obtains an element with the depth of Cout, so that the whole output data is obtained.
For a certain element at a certain depth l (1< ═ l < ═ Cout), the formula for multiply-accumulate it is as follows:
Figure BDA0002377183050000051
dout is some element with depth in the output data, and its superscript l indicates that the depth at the output depth is l; din is a data block of input data corresponding to a convolution kernel, the superscript i of the data block corresponds to the depth of the input data, and j and k correspond to the width and height of the convolution kernel respectively; w is the element in the convolution kernel, i.e. the weight in the neural network computation, whose superscripts l and i correspond to the depth of the output data and the depth of the input data, respectively.
The method divides the operation which can be independently carried out in the convolution operation into a plurality of subtasks, wherein each subtask has corresponding input data and sub-weight data; the subtasks are assigned to processing cores in a chip including a plurality of processing cores to be executed individually.
FIG. 2 is a schematic diagram illustrating an embodiment of the disclosureThe chip structure diagram of the convolution operation method is provided. As shown in fig. 2, the chip is a chip with a multi-core architecture, and includes a plurality of processing cores C1、C2……CMAnd the plurality of processing cores have the capability of independently processing tasks. The processing cores can independently run according to own programs and do not need to be distributed by the task of the scheduler. The program of the processing core can be dynamically updated by the server, or can be written into the processing core after the processing core is started, or can be automatically updated from the memory space of the system according to the initialization program of the processing core in the running process of the processing core.
Fig. 3 is a flowchart of a convolution operation method according to an embodiment of the present disclosure. The convolution operation method in the embodiment of the present disclosure is used in a chip including a plurality of processing cores as shown in fig. 2, and the following method is described by taking one processing core of the plurality of processing cores as an example, and the method includes:
step S301, the processing core acquires a convolution operation subtask, wherein the convolution operation subtask comprises a storage address of input data and a storage address of sub-weight data, and the convolution operation subtask is a part of convolution operation;
in this step, the processing core obtains a convolution sub-task, which is a part of the convolution operation, and the convolution sub-task is not related to convolution sub-tasks of other processing cores in the operation order.
The convolution operation subtask comprises a storage address of input data required by the convolution subtask and a storage address of sub-weight data, wherein the storage address is a storage address of a system storage space. It can be understood that the storage address of the input data and the storage address of the sub-weight data are a start storage address and an end storage address of data, or the storage address of the input data and the storage address of the sub-weight data are start storage addresses, and at this time, the convolution operation subtask also needs to include size information of the input data and the sub-weight data.
Step S302, the processing core acquires input data and sub-weight data from a system storage space according to a storage address of the input data and a storage address of the sub-weight data, wherein the input data is the input data of the convolution operation, the sub-weight data is a part of the weight data of the convolution operation, the weight data comprises a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel of the plurality of convolution kernels;
the processing core is provided with a storage space in the processing core and is used for storing the convolution operation subtask and input data and sub-weight data required by the convolution operation subtask. In this step, the processing core acquires the input data and the sub-weight data from the system memory space according to the memory address of the input data and the memory address of the sub-weight data obtained in step S301, and stores them in the memory space of the processing core.
It is to be understood that the weight data includes a plurality of convolution kernels, as shown in fig. 1, Cout convolution kernels are included in the complete weight data, and since each convolution kernel is independent from the calculation of the input data, the plurality of convolution kernels in the weight data may be divided into a plurality of groups, and each group may perform a convolution operation separately using one processing kernel.
Optionally, the number of the convolution kernels in the sub-weight data is determined by the number of the processing kernels. Illustratively, the number of the sub-weight data is equal to the number of the processing cores. As shown in FIG. 4, the number of processing cores of the chip is N, and the number is C1、C2……CNThe weight data is divided into N parts, and if the weight data is divided evenly, each piece of sub-weight data includes Cout/N convolution kernels, it should be noted that, in this case, Cout/N is required to be a positive integer, and if Cout/N is not a positive integer, the convolution kernels included in each piece of sub-weight data may be set to be N
Figure BDA0002377183050000061
The convolution kernel in the sub-weight data acquired by one of the processing kernels is insufficient
Figure BDA0002377183050000062
And (4) respectively. As shown in fig. 4, assuming that Cout/N is a positive integer, the 1 st to Cout/N convolution kernels may be used as the first sub-weight data, the (Cout/N +1) th to (2Cout/N) th convolution kernels may be used as the second sub-weight data, … …, and the ((N-1) × Cout/N +1) th to Cout convolution kernels may be used as the nth sub-weight data. It can be understood that the number of the sub-weight data and the number of the processing cores may not be equal, for example, in some scenarios, some processing cores in a chip may not perform convolution operation when performing other tasks, and at this time, the input data and the weight data may be divided according to the number of the processing cores actually available in the chip, which is not described herein again.
Optionally, the size of the sub-weight data is related to the size of the storage space of the processing core. In the above alternative embodiment, the size of the storage space of the processing core itself is not considered, which may cause the size of the sub-weight data not to match the size of the storage space of the processing core, and further cause the efficiency of the processing core in executing the convolution sub-task to be low. At this time, an appropriate value may be calculated according to the size of the storage space of each processing core, and each piece of sub-weight data is divided according to the value, and at this time, the size of the sub-weight data acquired by each processing core may be different, and the weight data is not equally divided on average, but is divided according to the storage capacity of each available processing core. For example, when calculating the storage space size of the processing core itself, the space required by the program corresponding to the convolution operation subtask and the space size occupied by the input data need to be subtracted from the available space in the storage space of the processing core, and the processing core needs to be divided into appropriate sub-weight data according to the remaining storage space size. Or, for a processing core with a smaller storage space, the sub-weight data may be further divided into multiple parts, and the processing core calculates a part of the corresponding sub-output data according to one part at a time, at this time, for the processing core with the smaller storage space, the process of calculating the sub-output data is a serial process, and when the sub-weight data is further divided, the sub-weight data may be divided equally according to the storage space of the processing core, and the size of each part of sub-weight data is not greater than the storage space, or the size of each part of sub-weight data is set as the size of the storage space. Of course, dividing the weight data according to the size of the storage space avoids the problem of secondary division, and improves the data calculation efficiency.
Step S303, the processing core executes the convolution operation subtask according to the input data and the sub-weight data to obtain sub-output data.
And after the processing core obtains the input data and the sub-weight data required by the convolution operation subtask of the processing core, calculating the multiply-accumulate sum of the input data and the sub-weight data according to the sequence of convolution operation to obtain sub-output data. The specific calculation process can be shown in fig. 1, the calculation process of the convolution operation subtask of a single processing core is the same as the general convolution operation process, except that at this time, the number of convolution kernels participating in calculation in the single processing core is no longer Cout, but the number of convolution kernels of the sub-weight data determined according to the method described in step S302 is determined, and the sub-weight data is subjected to sliding calculation, multiplication, accumulation and summation on the input data according to the calculated step length to obtain sub-output data. As shown in fig. 4, the N processing cores respectively calculate the multiply-accumulate sum of the sub-weight data and the input data, resulting in N sub-output data numbered 1-N.
Through the above-described steps S301 to S303, the processing core has completed the convolution sub-task assigned to itself. However, the final output data is not obtained yet, so the method further comprises:
step S304, the processing core stores the sub-output data into the system storage space in sequence. The sub-output data which are all output data and are obtained by the convolution operation method can be known from the description, the sub-output data are partial data of complete output data in the depth direction, other operations are not needed, and the sub-output data only need to be stored in a system storage space according to the depth storage sequence of the output data. As shown in FIG. 4, processing core C 11 st sub-output data of the calculation output dataProcessing core C2Compute 2 nd sub-output data of output data, … …, processing core CNAnd calculating the Nth sub-output data of the output data, wherein the processing core only needs to store the sub-output data into the storage space according to the preset storage space address in the self program to obtain the complete output data, and the storage address of each sub-output data is related to the position of the sub-output data in the depth direction of the output data.
Another embodiment of the present disclosure provides a method of convolution operation, where the method of convolution operation includes:
acquiring input data and weight data required in the convolution operation;
dividing the weight data into a plurality of pieces of sub-weight data, wherein the weight data comprises a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel in the plurality of convolution kernels;
respectively inputting the input data and the plurality of sub-weight data into a plurality of processing cores to carry out convolution operation to obtain a plurality of sub-output data;
and combining the sub-output data to obtain output data.
In the above example, a process of dividing the weight data into a plurality of pieces of sub-weight data is further included, and a specific dividing process may be the same as that described in step S302, and is not described herein again. In addition, it can be understood that the above-mentioned dividing process may be a logical dividing process, that is, only the storage space of the weight data is divided to obtain the starting storage address and the ending storage address of each weight data in the system storage space, so that the processing core may obtain the sub-weight data without actually dividing the data into multiple parts.
Fig. 5 is a specific example of a convolution operation method according to an embodiment of the present disclosure. As shown in FIG. 5, two processing cores C are included in the chip1And C2The width and height of the input data are the same: Win-Hin-8, and the depth of the input data is: cin ═ 4; the width and height of the output data are the same: wout Hout 6, the depth of the output data is:cout ═ 8; the width and height of the convolution kernel are the same: kw ═ Kh ═ 3, the depth of the convolution kernel is: cin is 4, the number of convolution kernels is Cout 8, and the sliding step of the convolution kernels is 1. As shown in fig. 5, in this example, the weight data is divided into two pieces of sub-weight data in the number-average of the processing kernels, that is, in the order of the numbers of the convolution kernels, that is, the first sub-weight data including 4 convolution kernels numbered 1 to 4, and the second sub-weight data including 4 convolution kernels numbered 5 to 8; feeding the first sub-weight data and the input data into C1Performing convolution calculation, and sending the second sub-weight data and the input data into C2Convolution calculations are performed. C1And C2Performing convolution operation in parallel, and outputting one sub-output data respectively, wherein the size of each sub-output data is 6 × 4, and C1The output is sub-output data with the depth of 1-4 in the output data, C2The output is the sub-output data with the depth of 5-8 in the output data, and the two sub-output data are stored according to the depth sequence to obtain the complete output data.
The embodiment of the disclosure discloses a convolution operation method and a chip. The convolution operation method comprises the following steps: the processing core acquires a convolution operation subtask, wherein the convolution operation subtask comprises a storage address of input data and a storage address of sub-weight data, and the convolution operation subtask is a part of convolution operation; the processing core acquires the input data and the sub-weight data from a system storage space according to a storage address of the input data and a storage address of the sub-weight data, wherein the input data is input data of the convolution operation, the sub-weight data is a part of weight data of the convolution operation, the weight data comprises a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel of the plurality of convolution kernels; and the processing core executes the convolution operation subtask according to the input data and the sub-weight data to obtain sub-output data. By the method, the weight data are divided into a plurality of sub-weight data and distributed to a plurality of processing kernels to be subjected to convolution operation in parallel, and the technical problems of poor parallelization and low efficiency of convolution operation in the prior art are solved.
The embodiment of the present disclosure further provides a chip including a plurality of processing cores, where at least two of the plurality of processing cores execute the convolution operation method to complete convolution operation.
An embodiment of the present disclosure further provides an electronic device, including: a memory for storing computer readable instructions; and one or more processors for executing the computer readable instructions, so that the processors realize the convolution operation method in any one of the preceding embodiments when the processors execute the convolution operation method.
The present disclosure also provides a non-transitory computer-readable storage medium, which stores computer instructions for causing a computer to execute the convolution operation method in any one of the foregoing embodiments.
The embodiment of the present disclosure provides a computer program product, wherein: comprising computer instructions which, when executed by a computing device, can perform any of the convolution operations methods of the preceding embodiments.
The embodiment of the present disclosure provides a computing device, which is characterized by comprising the chip in any one of the foregoing embodiments.
The flowchart and block diagrams in the figures of the present disclosure illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Claims (10)

1. A convolution operation method for use in a chip including a plurality of processing cores, comprising:
the processing core acquires a convolution operation subtask, wherein the convolution operation subtask comprises a storage address of input data and a storage address of sub-weight data, and the convolution operation subtask is a part of convolution operation;
the processing core acquires the input data and the sub-weight data from a system storage space according to a storage address of the input data and a storage address of the sub-weight data, wherein the input data is input data of the convolution operation, the sub-weight data is a part of weight data of the convolution operation, the weight data comprises a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel of the plurality of convolution kernels;
and the processing core executes the convolution operation subtask according to the input data and the sub-weight data to obtain sub-output data.
2. The method of convolution operation of claim 1, the method further comprising:
and the processing core stores the sub-output data into the system storage space in sequence.
3. The convolution operation method according to claim 1 or 2, characterized in that:
the number of convolution kernels in the sub-weight data is determined by the number of processing kernels.
4. A convolution operation method according to any one of claims 1 to 3, characterized by:
the size of the sub-weight data is related to the size of the memory space of the processing core.
5. The convolution operation method of any one of claims 1-4, wherein:
the sub-output data is sub-output data of the output data in the depth direction.
6. A chip comprising a plurality of processing cores, wherein at least two of the plurality of processing cores perform the convolution operation method of claims 1-5 to complete a convolution operation.
7. A method of convolution operation, comprising:
acquiring input data and weight data required in the convolution operation;
dividing the weight data into a plurality of pieces of sub-weight data, wherein the weight data comprises a plurality of convolution kernels, and the sub-weight data is at least one convolution kernel in the plurality of convolution kernels;
respectively inputting the input data and the plurality of sub-weight data into a plurality of processing cores to carry out convolution operation to obtain a plurality of sub-output data;
and combining the sub-output data to obtain output data.
8. An electronic device, comprising: a memory for storing computer readable instructions; and one or more processors configured to execute the computer-readable instructions, such that the processors when executed implement the convolution operation method of claims 1-5 or claim 7.
9. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the convolution operation method of claims 1-5 or claim 7.
10. A computing device comprising the chip of claim 6.
CN202010070481.9A 2020-01-21 2020-01-21 Convolution operation method and chip Pending CN113222136A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010070481.9A CN113222136A (en) 2020-01-21 2020-01-21 Convolution operation method and chip
PCT/CN2020/136383 WO2021147567A1 (en) 2020-01-21 2020-12-15 Convolutional operation method and chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010070481.9A CN113222136A (en) 2020-01-21 2020-01-21 Convolution operation method and chip

Publications (1)

Publication Number Publication Date
CN113222136A true CN113222136A (en) 2021-08-06

Family

ID=76991794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010070481.9A Pending CN113222136A (en) 2020-01-21 2020-01-21 Convolution operation method and chip

Country Status (2)

Country Link
CN (1) CN113222136A (en)
WO (1) WO2021147567A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837922A (en) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 Computing device, data processing method and related product
CN115858178A (en) * 2023-02-21 2023-03-28 芯砺智能科技(上海)有限公司 Method, device, medium and equipment for resource sharing in convolution calculation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885700A (en) * 2017-12-29 2018-04-06 中国人民解放军国防科技大学 Multi-core implementation method for large-scale matrix convolution
US20190138898A1 (en) * 2017-11-07 2019-05-09 Samsung Electronics Co., Ltd. Method and apparatus with neural network performing deconvolution
CN110473137A (en) * 2019-04-24 2019-11-19 华为技术有限公司 Image processing method and device
CN110689115A (en) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 Neural network model processing method and device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11315018B2 (en) * 2016-10-21 2022-04-26 Nvidia Corporation Systems and methods for pruning neural networks for resource efficient inference
CN107862650B (en) * 2017-11-29 2021-07-06 中科亿海微电子科技(苏州)有限公司 Method for accelerating calculation of CNN convolution of two-dimensional image
CN108416434B (en) * 2018-02-07 2021-06-04 复旦大学 Circuit structure for accelerating convolutional layer and full-connection layer of neural network
CN109165734B (en) * 2018-07-11 2021-04-02 中国人民解放军国防科技大学 Matrix local response normalization vectorization implementation method
CN110009103B (en) * 2019-03-26 2021-06-29 深兰科技(上海)有限公司 Deep learning convolution calculation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190138898A1 (en) * 2017-11-07 2019-05-09 Samsung Electronics Co., Ltd. Method and apparatus with neural network performing deconvolution
CN107885700A (en) * 2017-12-29 2018-04-06 中国人民解放军国防科技大学 Multi-core implementation method for large-scale matrix convolution
CN110473137A (en) * 2019-04-24 2019-11-19 华为技术有限公司 Image processing method and device
CN110689115A (en) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 Neural network model processing method and device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837922A (en) * 2021-09-26 2021-12-24 安徽寒武纪信息科技有限公司 Computing device, data processing method and related product
CN115858178A (en) * 2023-02-21 2023-03-28 芯砺智能科技(上海)有限公司 Method, device, medium and equipment for resource sharing in convolution calculation
CN115858178B (en) * 2023-02-21 2023-06-06 芯砺智能科技(上海)有限公司 Method, device, medium and equipment for sharing resources in convolution calculation

Also Published As

Publication number Publication date
WO2021147567A1 (en) 2021-07-29

Similar Documents

Publication Publication Date Title
CN109993299B (en) Data training method and device, storage medium and electronic device
US10691996B2 (en) Hardware accelerator for compressed LSTM
US9152601B2 (en) Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units
US8959138B2 (en) Distributed data scalable adaptive map-reduce framework
US8887165B2 (en) Real time system task configuration optimization system for multi-core processors, and method and program
CN113222125A (en) Convolution operation method and chip
Peng et al. GLU3. 0: Fast GPU-based parallel sparse LU factorization for circuit simulation
CN113994350A (en) Generating parallel computing schemes for neural networks
CN111338695B (en) Data processing method based on pipeline technology and related product
CN111290852A (en) Method, system and computer readable medium for scheduling task graph operations
CN112784973A (en) Convolution operation circuit, device and method
CN113222136A (en) Convolution operation method and chip
CN113222099A (en) Convolution operation method and chip
CN114503126A (en) Matrix operation circuit, device and method
Krömer et al. A comparison of many-threaded differential evolution and genetic algorithms on CUDA
Aksenova et al. The models and methods of optimal control of three work-stealing deques located in a shared memory
Schmidt et al. Load-balanced parallel constraint-based causal structure learning on multi-core systems for high-dimensional data
Al Maruf et al. Optimizing DNNs Model Partitioning for Enhanced Performance on Edge Devices.
CN114283046A (en) Point cloud file registration method and device based on ICP algorithm and storage medium
Siládi et al. Adapted parallel Quine-McCluskey algorithm using GPGPU
CN114691142A (en) Compiling method of execution program, chip, electronic device, and computer-readable storage medium
TWI753728B (en) Architecture and cluster of processing elements and method of convolution operation
Souissi et al. Optimization of matching and scheduling on heterogeneous CPU/FPGA architectures
CARMONA et al. REINSURANCE ANALYTICS USING SERIAL AND PARALLEL COMPUTATION ON THE MULTIOBJECTIVE EVOLUTIONARY ALGORITHM SPEA2
Soiman et al. A parallel accelerated approach of HMM Forward Algorithm for IBM Roadrunner clusters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination