CN112132275A

CN112132275A - Parallel computing method and device

Info

Publication number: CN112132275A
Application number: CN202011059959.4A
Authority: CN
Inventors: 王丹阳; 林军; 谢逍茹; 陶为
Original assignee: Nanjing Fengxing Technology Co ltd
Current assignee: Nanjing Fengxing Technology Co ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2020-12-25
Anticipated expiration: 2040-09-30
Also published as: CN112132275B

Abstract

The application shows a parallel computing method and a device, which are used for a sparse neural network processor, wherein the parallel computing method comprises the following steps: the convolution calculation unit acquires image data to be processed, wherein the image data to be processed comprises: image channel and image size; generating parallelism according to the image channel and the image size; performing convolution calculation on the image data to be processed according to the parallelism to obtain a unit calculation result; the processing unit processes the unit calculation result to obtain a convolution calculation result; and the accumulator accumulates the convolution calculation result. The parallel computing method and the device can simultaneously process one or more image channels through parallel computing, and improve the utilization rate of a convolution computing unit in a sparse neural network processor.

Description

Parallel computing method and device

Technical Field

The invention relates to the technical field of convolutional neural network acceleration, in particular to a parallel computing method and device, which are used for a sparse neural network.

Background

Convolutional neural networks (CNN or Deep Convolutional neural networks, DCNN) are mainly used for image processing, and may also be used for other types of input, such as audio. The sparse neural network is a sparse convolutional neural network, and can convert samples into a proper sparse expression form, so that a learning task is simplified, and the complexity of a model is reduced.

The traditional sparse neural network processor has a large network model and needs massive calculation to complete tasks, the essence of the sparse neural network processor is a convolution calculation method, the traditional sparse neural network processor convolution calculation method needs to store weight data required by convolution calculation on line in the process of calculating a single image, the operation efficiency is low, and for convolution layers with a large number of channels, more storage resources of the sparse neural network processor need to be consumed.

The convolution calculation method of the existing sparse neural network processor has the problems of low operation efficiency and waste of storage resources, so that the great calculation power waste of the sparse neural network processor is caused. Based on the application scene, the application is provided.

Disclosure of Invention

Based on the above problems, an object of the present application is to provide a parallel computing method and apparatus, which are used to improve the utilization rate of a convolution computing unit, so as to solve the technical problems in the prior art.

In a first aspect, an embodiment of the present application illustrates a parallel computing method for a sparse neural network processor, including the following steps:

the convolution calculation unit acquires image data to be processed, wherein the image data to be processed comprises: image channel and image size; generating parallelism according to the image channel and the image size; performing convolution calculation on the image data according to the parallelism to obtain a unit calculation result;

the processing unit processes the unit calculation result to obtain a channel calculation result;

the accumulator accumulates the channel calculation results.

In a second aspect, the present application embodiment illustrates a parallel computing apparatus for a sparse neural network processor, including the following steps:

the accumulator accumulates the channel calculation results.

According to the technical scheme, the application shows a parallel computing method and a device, which are used for a sparse neural network processor, wherein the parallel computing method comprises the following steps: the convolution calculation unit acquires image data to be processed, wherein the image data to be processed comprises: image channel and image size; generating parallelism according to the image channel and the image size; the convolution calculation unit performs convolution calculation on the image data according to the parallelism to obtain a unit calculation result; the processing unit processes the unit calculation result to obtain a channel calculation result; the accumulator accumulates the channel calculation results. The parallel computing method and the device can simultaneously process one or more image channels through parallel computing, and improve the utilization rate of a convolution computing unit in a sparse neural network processor.

Drawings

In order to more clearly explain the technical solution of the application, the drawings needed to be used in the embodiments are briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a detailed implementation step of a parallel computing method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a parallel computing method for generating a first parallelism according to an embodiment of the present application;

fig. 3 is a schematic diagram of a parallel computing method for generating a second parallelism according to an embodiment of the present application;

fig. 4 is a schematic diagram of a parallel computing method for generating a third parallelism according to an embodiment of the present application;

fig. 5 is a schematic diagram of a parallel computing method for generating a fourth parallelism according to an embodiment of the present application;

fig. 6 is a parallel computing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions in the embodiments of the present application better understood and make the above objects, features and advantages of the embodiments of the present application more comprehensible, the technical solutions in the embodiments of the present application are described in further detail below with reference to the accompanying drawings. It should be apparent that the described exemplary embodiments are only some embodiments of the present application, and not all embodiments.

In order to facilitate understanding of the present application, some terms in the embodiments of the present application are first explained so as to be easily understood by those skilled in the art.

(1) Performing convolution calculation;

the convolution calculation includes:

an input matrix comprising four dimensions: the number of samples, the image height, the image width, and the number of image channels;

an output matrix comprising four dimensions: the sample number, the image height, the image width and the image channel number, wherein the sizes of the image height and the image width of the output matrix are changed in the calculation process, and the image channel number is changed;

a convolution kernel (weight matrix) comprising four dimensions: convolution kernel height, convolution kernel width, number of input channels, number of output channels (number of convolution kernels), the convolution kernel dimensional meaning being different from the dimensional meaning of the input matrix and the output matrix.

The number of input channels of the convolution kernel is determined by the number of channels of an input matrix; the number of channels of the output matrix is determined by the number of output channels of the convolution kernel.

For example, a convolution calculation with an input channel number of 128, an output channel number of 128, and a convolution kernel size of 3 × 3 is performed as follows:

(1) the convolution calculating unit carries out convolution kernel internal operation;

(2) accumulation between input channels:

and adding the internal operation values of the convolution kernels corresponding to the 128 input channels to obtain an output channel, repeating the operation for 128 times to obtain 128 output channels, and completing the convolution operation of 128 × 3.

Next, an application scenario proposed by the present application is introduced. Convolutional Neural Networks (CNN) are mainly used for image processing, and may also be used for other types of input, such as audio. The sparse neural network is a sparse convolutional neural network, and can convert samples into a proper sparse expression form, so that a learning task is simplified, and the complexity of a model is reduced.

When the sparse neural network processor is designed, the storage capacity of the internal storage unit is determined, and the expansion without control is impossible. The memory space of a memory unit occupied by an image is calculated according to the formula: the storage amount is image size and image channel number, and in the process of depth processing of the image, the reduction of the image size and the increase of the image channel number are always accompanied, for example, the size of one image is 416 x 416, and in the application scene, the image channel number is 3; and performing depth processing on the image, and processing the image size to be 13 × 13, wherein in the application scene, the number of image channels is 1024. When the image size is smaller than the maximum size that can be processed by the hardware calculation unit, the utilization rate of the convolution calculation unit is reduced, and great waste of calculation power is caused.

At present, when image recognition or structured video analysis is performed, a preprocessor in a computer processes an image to be processed into one or more images with fixed size, and then sends the images with fixed size to a convolutional neural network processor for processing, wherein the preprocessing is to make apparent characteristics (such as color distribution, overall brightness, size and the like) of each image as consistent as possible on the premise of not changing essential information carried by the image as much as possible, so as to facilitate the subsequent processing process. In the conventional convolution operation method, a Convolution Calculation Unit (CCU) for processing a fixed-size image processes the image in a one-to-one manner, one convolution calculation Unit has the capability of processing one fixed-size image, but the convolution calculation Unit cannot be fully utilized along with the reduction of the image size, the utilization rate of the convolution calculation Unit is low, so that the operation efficiency of a processor is low, meanwhile, along with the reduction of the image size, the storage space provided by the processor for the fixed-size image cannot be fully utilized, and the low utilization rate of the storage space causes the waste of storage resources of the processor. Therefore, the convolution operation method of the conventional convolution neural network processor has the problems of low operation efficiency and waste of storage resources, so that the great calculation power waste of the convolution neural network processor is caused. Based on the application scene, the application is provided.

Referring to fig. 1, fig. 1 shows a parallel computing method for a sparse neural network processor, comprising the steps of:

s1: the convolution calculation unit acquires image data to be processed, wherein the image data to be processed comprises an image channel and an image size; generating parallelism according to the image channel and the image size; and configuring the convolution calculation unit as a calculation group to perform convolution calculation on the image according to the parallelism, and outputting a calculation result of the unit.

The generating parallelism according to the image size and the image channel comprises the following steps:

s11: generating preliminary parallelism according to the image channel;

in a possible embodiment, a first parallelism is generated when the acquisition image channels are 1-256;

when the image channel is obtained as 257-512, generating a second parallelism;

when the obtained image channel is 513-1024, generating a third parallelism;

when the obtained image channel is 1024-minus, a fourth parallelism is generated;

s12: adjusting the preliminary parallelism according to the image size to generate the parallelism;

s121: when the image size is smaller than the preliminary parallelism support image size;

adjusting the preliminary parallelism to generate a parallelism;

s122: when the image size is larger than the preliminary parallelism support image size; splitting the image into image sizes which can be supported by the preliminary parallelism;

adjusting the preliminary parallelism to generate a parallelism;

in a possible embodiment, when the acquisition image channels are 1-256, the first parallelism is generated, the support image size is 64 x 8;

when the size of the acquired image is (33-). multidot.8, keeping the parallelism as a first parallelism;

when the size of the acquired image is (17-32) × 8, adjusting the parallelism to be a second parallelism;

when the size of the obtained image is (9-16) × 8, adjusting the parallelism to be a third parallelism;

and when the size of the acquired image is (1-8) × 8, adjusting the parallelism to be a fourth parallelism.

In a possible embodiment, when the acquisition image channel is 257 and 512, and the second parallelism is generated, the support image size is 32 × 8;

when the size of the acquired image is (17-). multidot.8, keeping the parallelism as a second parallelism;

In a possible embodiment, when the acquisition image channel is 513-1024 and the third parallelism is generated, the support image size is 16 × 8;

when the size of the obtained image is (8-). multidot.8, keeping the parallelism as a third parallelism;

And if the acquired image size is larger than the maximum image size which can be supported by the parallelism, splitting the image into the image sizes which can be supported by the parallelism.

when the acquired image size is 128 × 128, the image to be processed is split into 64 images with the image size of 32 × 8 for processing, because the image size is larger than the image size that can be supported by the parallelism.

The parallelism includes: a first degree of parallelism, the first degree of parallelism having a value of 1;

a second parallelism, the value of which is 2;

a third parallelism, the value of which is 4;

a fourth parallelism, the fourth parallelism having a value of 8;

and when the value of the generated parallelism is N, the M convolution calculation units are averagely configured to perform convolution calculation on the N image channels by the N calculation groups to obtain N unit calculation results.

In a feasible embodiment, M is 8, and the number of convolution calculation units is 8;

the convolution calculation unit includes: a first calculation unit 11, a second calculation unit 12, a third calculation unit 13, a fourth calculation unit 14, a fifth calculation unit 15, a sixth calculation unit 16, a seventh calculation unit 17, and an eighth calculation unit 18.

When the value of the parallelism is 1, 8 computing units are configured to perform convolution calculation on 1 image channel by 1 computing group, and 8 unit computing results are output;

when the value of the parallelism is 2, 8 calculation units are averagely configured to perform convolution calculation on 2 image channels by 2 calculation groups, that is, the first calculation unit 11, the second calculation unit 12, the third calculation unit 13 and the fourth calculation unit 14 are configured to be 1 calculation group, the fifth calculation unit 15, the sixth calculation unit 16, the seventh calculation unit 17 and the eighth calculation unit 18 are configured to be 1 calculation group, and each group outputs 4 unit calculation results;

when the value of the parallelism is 4, the 8 calculation units are averagely configured to perform convolution calculation on the 4 images by 4 calculation groups, that is, the first calculation unit 11 and the fifth calculation unit 15 are configured as 1 calculation group, the third calculation unit 13 and the seventh calculation unit 17 are configured as 1 calculation group, the second calculation unit 12 and the sixth calculation unit 16 are configured as 1 calculation group, and the fourth calculation unit 14 and the eighth calculation unit are configured as 1 calculation group, and each group outputs 2 unit calculation results.

When the parallelism value is 8, 8 computing units are averagely configured to perform convolution operation on 8 images by 8 computing groups, and each group outputs 1 unit computing result.

S2: the processing unit processes the unit calculation result to obtain a channel calculation result;

the processing unit is further configured to:

s21: if N is smaller than M, the calculation result of the connector combination unit obtains a channel calculation result, and if N is equal to M, the unit calculation result is equal to the channel calculation result;

s22: if N is greater than 1, the adder adds the channel calculation results.

And when N is smaller than M, the plurality of convolution computing units cooperatively process 1 image channel, and the connector combination unit computes a result to obtain a channel computation result.

In a feasible embodiment, when N is 1, the parallelism value is 1, 8 computation units cooperatively process 1 image channel, the connector combines the computation results of 8 units and outputs the computation result of 1 channel, the parallelism value is not within the second threshold range, 1 computation group completes the convolution computation of 1 image channel and outputs the computation result of 1 image channel, and because only 1 channel computation result is needed, an adder is not needed, and the computation result of the channel is equal to the convolution computation result.

When N is 2, the parallelism value is 2, 4 calculation units cooperatively calculate 1 image channel, the connector combines 4 unit calculation results and outputs 2 channel calculation results, the parallelism value is within the second threshold range, 2 calculation groups complete convolution calculation of 2 image channels and output convolution calculation results of 2 image channels, and the adder adds the 2 channel calculation results to obtain a convolution calculation result.

When N is 4, the parallelism value is 4, 2 calculation units cooperatively calculate 1 image, the connector combines calculation results of 2 units and outputs calculation results of 4 channels, the parallelism value is within the range of a second threshold, 4 calculation groups complete convolution calculation of 4 image channels and output convolution calculation results of 4 image channels, and the adder adds the calculation results of 4 channels to obtain the convolution calculation result

When N is 8, the parallelism value is 8 and is not in the first threshold range, unit calculation results of 8 calculation units are directly output, the unit calculation results are equal to channel calculation results, the parallelism value is in the second threshold range, 8 calculation groups complete convolution calculation of 8 image channels, channel calculation results of 8 images are output, and an adder is called to add the 8 channel calculation results to obtain convolution calculation results.

S3: the accumulator accumulates the convolution calculation results.

In a possible embodiment, a convolution calculation is shown with an input channel number of 128, an output channel number of 128, and a convolution kernel size of 3 x 3;

referring to fig. 2, fig. 2 shows a schematic diagram of a parallel computing method for generating a first parallelism, in a feasible embodiment, the obtained image channel is 128, and the first parallelism is generated; acquiring an image size of 64 x 8, and keeping the parallelism as a first parallelism; the value of the first parallelism is 1, 8 convolution calculation units are configured to perform convolution calculation on 1 image channel by 1 calculation group, an image to be processed with an image size of 64 x 8 is split into 8 images with an image size of 8 x 8 to be processed, each convolution calculation unit completes convolution calculation with an image size of 8 x 8, and the connector combines calculation results of 8 units to obtain calculation results of 1 channel because 8 convolution calculation units cooperatively process one image. The above operation is repeated 128 times to obtain 128 convolution calculation results, and the accumulator accumulates the convolution calculation results in the time direction.

Referring to fig. 3, fig. 3 shows a schematic diagram of a parallel computing method for generating the second parallelism, and in a possible embodiment, the obtained image channel is 128, and the first parallelism is generated; acquiring an image size of 32 x 8, and adjusting the parallelism to be a second parallelism; the value of the second parallelism is 2, 8 convolution calculation units are configured to perform convolution calculation on 2 image channels by 2 calculation groups, an image to be processed with the image size of 32 × 8 is split into 4 images with the image size of 8 × 8 to be processed, each convolution calculation unit completes convolution calculation with the image size of 8 × 8, because the 4 convolution calculation units cooperatively process one image, the connector combines calculation results of the 4 units to obtain calculation results of 1 channel, and the adder adds calculation results of the 2 channels to obtain a convolution calculation result. Repeating the above operations 64 times to obtain 64 convolution calculation results, and accumulating the convolution calculation results in the time direction by the accumulator.

Referring to fig. 4, fig. 4 shows a schematic diagram of a parallel computing method for generating a third parallelism, and in a possible embodiment, the obtained image channel is 128, and the first parallelism is generated; acquiring an image size of 16 x 8, and adjusting the parallelism to a third parallelism; the value of the third parallelism is 4, the 8 convolution calculation units are configured to perform convolution calculation on 4 image channels by 4 calculation groups, the image to be processed with the image size of 16 × 8 is split into 2 images with the image size of 8 × 8 to be processed, each convolution calculation unit completes convolution calculation with the image size of 8 × 8, because the 2 convolution calculation units cooperatively process one image, the connector combines the calculation results of the 2 units to obtain 1 channel calculation result, and the adder adds the calculation results of the 4 channels to obtain a convolution calculation result. Repeating the above operation 32 times to obtain 32 convolution calculation results, and accumulating the convolution calculation results in the time direction by the accumulator.

Referring to fig. 5, fig. 5 shows a schematic diagram of a parallel computing method for generating a fourth parallelism, in a feasible embodiment, the acquired image channel is 128, and the first parallelism is generated; acquiring the size of an image as 8 x 8, and adjusting the parallelism to be a fourth parallelism; the value of the fourth parallelism is 8, 8 convolution calculation units perform convolution calculation on 8 image channels, each convolution calculation unit completes convolution calculation of 8 × 8 image size to obtain 8 channel calculation results, and the adder adds the 8 channel calculation results to obtain a convolution calculation result. The above work is repeated 16 times to obtain 32 convolution calculation results, and the accumulator accumulates the convolution calculation results in the time direction.

Referring to fig. 6, fig. 6 shows a parallel computing device for a sparse neural network processor, comprising: convolution calculating unit 1, processing unit 2 and accumulator 3, processing unit 2 includes: a connector 21 and an adder 22.

The convolution calculation unit 1 acquires image data to be processed, which includes: image channel and image size; generating parallelism according to the image channel and the image size; and configuring the convolution calculation unit as a calculation group to perform convolution calculation on the image according to the parallelism, and outputting a calculation result of the unit.

The generating parallelism according to image channels and image sizes comprises the following steps:

the convolution calculation unit 1 generates a preliminary parallelism according to the image channel;

when the image channel is obtained as 257-512, generating a second parallelism;

when the obtained image channel is 513-1024, generating a third parallelism;

the convolution calculating unit 1 adjusts the preliminary parallelism according to the image size to generate the parallelism;

a second parallelism, the value of which is 2;

a third parallelism, the value of which is 4;

a fourth parallelism, the fourth parallelism having a value of 8;

and when the value of the generated parallelism is N, the M convolution calculation units are averagely configured to perform convolution calculation on the N images by the N calculation groups to obtain N unit calculation results.

when the value of the parallelism is 2, that is, the first calculation unit 11, the second calculation unit 12, the third calculation unit 13, and the fourth calculation unit 14 are configured as 1 calculation group, the fifth calculation unit 15, the sixth calculation unit 16, the seventh calculation unit 17, and the eighth calculation unit 18 are configured as 1 calculation group, each group outputting 4 unit calculation results;

When the parallelism value is 8, 8 computing units are averagely configured to perform convolution operation on 8 image channels by 8 computing groups, and each group outputs 1 unit computing result.

The processing unit 2 processes the unit calculation result to obtain a channel calculation result; the processing unit 2 comprises a connector 21 and an adder 22;

the connector 21 combining unit obtains a channel calculation result by calculating a result;

the adder 22 adds the channel calculation results to obtain a convolution calculation result.

The processing unit 2 is further configured to:

if N is less than M, the connector 21 combines the unit calculation results to obtain a channel calculation result, and if N is equal to M, the unit calculation result is equal to the channel calculation result;

if N is greater than 1, the adder 22 adds the channel calculation results.

And when N is smaller than M, the plurality of convolution computing units cooperatively process 1 image channel, and the connector 21 combines the computing results of the units to obtain a channel computing result.

In addition, the parallel computing apparatus according to the present invention can perform convolution operations while generating the first parallelism, the second parallelism, the third parallelism, and the fourth parallelism, and can accumulate a high-parallelism result on the basis of a low-parallelism result.

The convolution calculation unit 1 includes: a first calculation unit 11, a second calculation unit 12, a third calculation unit 13, a fourth calculation unit 14, a fifth calculation unit 15, a sixth calculation unit 16, a seventh calculation unit 17, and an eighth calculation unit 18.

In a feasible embodiment, N is 1, the parallelism value is 1, 8 computing units cooperatively process 1 image channel, the connector 21 combines the 8 unit computing results and outputs 1 channel computing result, the parallelism value is not within the second threshold range, 1 computing group completes convolution computation of 1 channel and outputs the channel computing result of 1 image, and the channel computing result is equal to the convolution computing result because only 1 channel computing result is needed without calling the adder 22.

N is 2, the value of the parallelism is 2, the first calculation unit 11, the second calculation unit 12, the third calculation unit 13, and the fourth calculation unit 14 are configured as 1 calculation group, and the fifth calculation unit 15, the sixth calculation unit 16, the seventh calculation unit 17, and the eighth calculation unit 18 are configured as 1 calculation group. The 4 calculation units cooperatively calculate 1 image channel, the connector 21 combines 4 unit calculation results, outputs 2 channel calculation results, the parallelism value is within the second threshold range, 2 calculation groups complete convolution calculation of 2 channels, outputs convolution calculation results of 2 images, and the adder 22 adds the 2 channel calculation results, and the specific process is as follows: the first calculating unit 11 is added to the fifth calculating unit 15, the second calculating unit 12 is added to the sixth calculating unit 16, the third calculating unit 13 is added to the seventh calculating unit 17, and the fourth calculating unit 14 is added to the eighth calculating unit 18; and obtaining a convolution calculation result.

N is 4, the parallelism value is 4, the first calculating unit 11 and the fifth calculating unit 15 are configured as 1 calculating group, the third calculating unit 13 and the seventh calculating unit 17 are configured as 1 calculating group, the second calculating unit 12 and the sixth calculating unit 16 are configured as 1 calculating group, the fourth calculating unit 14 and the eighth calculating unit are configured as 1 calculating group, 2 calculating units cooperatively calculate 1 image channel, the connector 21 combines 2 unit calculating results and outputs 4 channel calculating results, the parallelism value is within the second threshold range, the 4 calculating groups complete convolution calculation of 4 channels and output convolution calculation results of 4 images, the adder 22 adds 4 channel calculating results, the addition of 4 channel calculating results is performed on the basis of 2 channel adding results with the parallelism being 2, that is, the first calculating unit 11 and the fifth calculating unit 15 are added, and the third calculating unit 13 and the seventh calculating unit 17 are added Adding the sum of the second calculating unit 12 and the sixth calculating unit 16, and adding the sum of the fourth calculating unit 14 and the eighth calculating unit 18; and obtaining a convolution calculation result.

N is 8, the parallelism value is not in the first threshold range, unit calculation results of 8 calculation units are directly output, the unit calculation results are equal to channel calculation results, the parallelism value is in the second threshold range, 8 calculation groups complete convolution calculation of 8 channels and output channel calculation results of 8 images, an adder 22 adds the 8 channel calculation results, the 8 channel calculation results are added on the basis of 4 channel addition results with the parallelism of 4, namely the first calculation unit 11 and the fifth calculation unit 15 are added with the addition sum of the third calculation unit 13 and the seventh calculation unit 17, the second calculation unit 12 and the sixth calculation unit 16 are added with the addition sum of the fourth calculation unit 14 and the eighth calculation unit 18; and obtaining a convolution calculation result.

The selector 4 is used for distinguishing convolution operation results under different parallelism conditions;

when the parallel device operates under four parallelism degrees at the same time, the selector 4 inputs the convolution operation result calculated under different parallelism degrees into the accumulator 3 for accumulation.

The accumulator 3 accumulates the convolution calculation results. For specific accumulation processes, see the above examples.

The application shows a parallel computing method and device, which are used for a sparse neural network processor and simultaneously support a non-sparse network, so that a computing mode that 1 convolution computing unit processes 1 image channel in an original processing mode is changed, a plurality of convolution computing units process one or more image channels, in the process of parallel computing, a lower parallelism processes a larger image size, and a higher parallelism supports a larger image channel, for example, the convolution image size that one convolution computing unit can support is 8 x 8, the supported image channel is 256, and the characteristics jointly supported by 8 convolution computing units are as follows: when the parallelism is 1, the convolution image size of 64 × 8 is supported, 256 image channels are supported, namely 8 convolution computing units complete convolution of 1 channel of 64 × 8, and the convolution is repeated for 256 times; when the parallelism is 2, the convolution image size of 32 × 8 is supported, the input and output channels of 512, that is, 8 convolution computing units are divided into two groups, 2-channel convolution of 32 × 8 is completed, the convolution is repeated 512 times, when the parallelism is 4, the convolution image size of 16 × 8 is supported, the input and output channels of 1024 are supported, and when the parallelism is 8, the convolution image size of 8 × 8 is supported, and the input and output channels of 2048 are supported.

Embodiments of the present application also provide a computer program product comprising one or more computer program instructions. When the computer program instructions are loaded and executed by a computer, the processes or functions according to the various embodiments described above in the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. When the method is run on a computer, the method provided by the embodiment of the application is executed by the computer.

The present embodiment also provides a computer-readable storage medium, which can store computer program instructions, and when the program instructions are executed, the computer-readable storage medium can implement all the steps of the image processing method according to the above-mentioned embodiments of the present application. The computer readable storage medium includes a magnetic disk, an optical disk, a read only memory ROM, a random access memory RAM, and the like. In the above embodiments, all or part may be implemented by software, hardware, firmware, or any combination thereof. When implemented in software, the embodiments may be implemented in whole or in part in the form of a computer program product, which is not limited. Those skilled in the art will also appreciate that the various illustrative logical blocks and steps (step) set forth herein may be implemented in electronic hardware, computer software, or combinations of both. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the functions in various ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The various illustrative logical units and circuits described in this application may be implemented or operated through the design of a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration. The steps of a method or algorithm described in this application may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software cells may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a UE. In the alternative, the processor and the storage medium may reside in different components in the UE. It should be understood that, in the various embodiments of the present application, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the inherent logic, and should not constitute any limitation to the implementation process of the present application.

Furthermore, the terms "first," "second," "third," and the like in the description and in the claims of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Those skilled in the art will clearly understand that the techniques in the embodiments of the present application may be implemented by way of software plus a required general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments or some portions thereof in the embodiments of the present application. The same and similar parts in the various embodiments in this specification may be referred to each other. In particular, as for the network device/node or the device, since it is basically similar to the method embodiment, the description is simple, and the relevant points can be referred to the description in the method embodiment.

The above embodiments of the present application do not limit the scope of the present application.

It should be understood that the terms "first," "second," "third," and the like in the description and in the claims of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used are interchangeable under appropriate circumstances and can be implemented in sequences other than those illustrated or otherwise described herein with respect to the embodiments of the application, for example.

Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a list of elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or conventionally used in the art.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A parallel computing method is suitable for a sparse neural network processor and comprises the following steps:

the processing unit processes the unit calculation result to obtain a convolution calculation result;

and the accumulator accumulates the convolution calculation result.

2. A parallel computing method according to claim 1, characterized in that:

when the value of the generated parallelism is N, the M convolution calculation units are averagely configured to perform convolution operation on the N image channels by the N calculation groups to obtain N unit calculation results.

3. A parallel computing method according to claim 2, wherein the processing unit comprises: a connector and an adder;

the connector combination unit calculates a result to obtain a channel calculation result;

and the adder adds the channel calculation results to obtain a convolution calculation result.

4. A parallel computing method according to claim 3, characterized in that:

if N is smaller than M, the connector combination unit obtains a channel calculation result according to the calculation result;

if N equals M, the cell computation result equals the channel computation result;

and if N is larger than 1, the adder adds the channel calculation results to obtain a convolution calculation result.

5. A parallel computing apparatus for a sparse neural network processor, comprising:

a convolution calculation unit (1) for obtaining image data to be processed, the image data to be processed comprising: image channel and image size; generating parallelism according to the image channel and the image size; performing convolution calculation on the image data to be processed according to the parallelism to obtain a unit calculation result;

the processing unit (2) is used for processing the unit calculation result to obtain a convolution calculation result;

and an accumulator (3) for accumulating the convolution calculation result.

6. A parallel computing arrangement according to claim 5, wherein the convolution calculation unit (1) is further configured to:

7. A parallel computing device according to claim 6, characterized in that said processing unit (2) comprises: a connector (21) and an adder (22);

the connector (21) combination unit calculates the result to obtain a channel calculation result;

and the adder (22) adds the channel calculation results to obtain a convolution calculation result.

8. A parallel computing apparatus according to claim 7, wherein the processing unit (2) is further configured to:

if N is greater than 1, the adder adds the channel calculation results.

9. A parallel computing device according to claim 8, wherein the parallel computing device is operable with a first degree of parallelism, a second degree of parallelism, a third degree of parallelism, and a fourth degree of parallelism simultaneously.

10. A parallel computing arrangement according to claim 9, further comprising a selector (4), wherein when the parallel computing arrangement is operating at four degrees of parallelism simultaneously, the selector (4) distinguishes convolution results at different degrees of parallelism, and inputs the convolution results at different degrees of parallelism to the accumulator (3) for accumulation.