CN112132275B - Parallel computing method and device - Google Patents

Parallel computing method and device Download PDF

Info

Publication number
CN112132275B
CN112132275B CN202011059959.4A CN202011059959A CN112132275B CN 112132275 B CN112132275 B CN 112132275B CN 202011059959 A CN202011059959 A CN 202011059959A CN 112132275 B CN112132275 B CN 112132275B
Authority
CN
China
Prior art keywords
parallelism
convolution
unit
image
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011059959.4A
Other languages
Chinese (zh)
Other versions
CN112132275A (en
Inventor
王丹阳
林军
谢逍茹
陶为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fengxing Technology Co ltd
Original Assignee
Nanjing Fengxing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fengxing Technology Co ltd filed Critical Nanjing Fengxing Technology Co ltd
Priority to CN202011059959.4A priority Critical patent/CN112132275B/en
Publication of CN112132275A publication Critical patent/CN112132275A/en
Application granted granted Critical
Publication of CN112132275B publication Critical patent/CN112132275B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)

Abstract

The application discloses a parallel computing method and a device, which are used for a sparse neural network processor, wherein the parallel computing method comprises the following steps: the convolution computing unit obtains image data to be processed, wherein the image data to be processed comprises: image channel and image size; generating parallelism according to the image channel and the image size; performing convolution calculation on the image data to be processed according to the parallelism to obtain a unit calculation result; the processing unit processes the unit calculation result to obtain a convolution calculation result; an accumulator accumulates the convolution calculation results. According to the parallel computing method and device, one or more image channels can be processed simultaneously through parallel computing, and the utilization rate of a convolution computing unit in a sparse neural network processor is improved.

Description

Parallel computing method and device
Technical Field
The invention relates to the technical field of convolutional neural network acceleration, in particular to a parallel computing method and device for a sparse neural network.
Background
Convolutional neural networks (Convolutional neural networks, CNN or Deep convolutional neural networks, DCNN) are mainly used for image processing, but can also be used for other types of input, such as audio. The sparse neural network is a sparse convolutional neural network, and can convert samples into a proper sparse expression form, so that a learning task is simplified, and the complexity of a model is reduced.
The traditional sparse neural network processor has a large network model, massive calculation is needed to complete tasks, the essence of the sparse neural network processor is a convolution calculation method, the traditional convolution calculation method of the sparse neural network processor needs to store weight data required by convolution calculation on line in the process of calculating a single image, the operation efficiency is low, and for a convolution layer with more channels, more storage resources of the sparse neural network processor are needed to be consumed.
The convolution calculation method of the existing sparse neural network processor has the problems of low operation efficiency and storage resource waste, and causes great calculation power waste of the sparse neural network processor. Based on the application scene, the application is provided.
Disclosure of Invention
Based on the above problems, the present application aims to provide a parallel computing method and device, which are used for improving the utilization rate of a convolution computing unit, so as to solve the technical problems existing in the prior art.
In a first aspect, an embodiment of the present application shows a parallel computing method for a sparse neural network processor, including the steps of:
The convolution computing unit obtains image data to be processed, wherein the image data to be processed comprises: image channel and image size; generating parallelism according to the image channel and the image size; performing convolution calculation on the image data according to the parallelism to obtain a unit calculation result;
the processing unit processes the unit calculation result to obtain a channel calculation result;
And accumulating the channel calculation results by an accumulator.
In a second aspect, an embodiment of the present application shows a parallel computing device for a sparse neural network processor, comprising the steps of:
The convolution computing unit obtains image data to be processed, wherein the image data to be processed comprises: image channel and image size; generating parallelism according to the image channel and the image size; performing convolution calculation on the image data according to the parallelism to obtain a unit calculation result;
the processing unit processes the unit calculation result to obtain a channel calculation result;
And accumulating the channel calculation results by an accumulator.
As can be seen from the above technical solutions, the present application provides a parallel computing method and apparatus for a sparse neural network processor, where the parallel computing method includes the following steps: the convolution computing unit obtains image data to be processed, wherein the image data to be processed comprises: image channel and image size; generating parallelism according to the image channel and the image size; the convolution calculation unit carries out convolution calculation on the image data according to the parallelism to obtain a unit calculation result; the processing unit processes the unit calculation result to obtain a channel calculation result; and accumulating the channel calculation results by an accumulator. According to the parallel computing method and device, one or more image channels can be processed simultaneously through parallel computing, and the utilization rate of a convolution computing unit in a sparse neural network processor is improved.
Drawings
For a clearer description of the technical solutions of the application, the drawings that are necessary for the embodiments will be briefly described, it being obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a diagram showing steps for implementing a parallel computing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a parallel computing method for generating a first parallelism according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a parallel computing method for generating a second parallelism according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a parallel computing method for generating a third parallelism according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a parallel computing method for generating a fourth parallelism according to an embodiment of the application;
fig. 6 is a parallel computing device according to an embodiment of the present application.
Detailed Description
In order to better understand the technical solution in the embodiments of the present application and make the above objects, features and advantages of the embodiments of the present application more comprehensible, the technical solution in the embodiments of the present application is described in further detail below with reference to the accompanying drawings. It will be apparent that the described exemplary embodiments are only some, but not all, embodiments of the application.
To facilitate an understanding of the application, some of the terms used in the embodiments of the application are first explained in order to facilitate an understanding by those skilled in the art.
(1) Convolution calculation;
The convolution calculation includes:
an input matrix comprising four dimensions: sample number, image height, image width, and image channel number;
An output matrix comprising four dimensions: the method comprises the steps of calculating the number of samples, the image height, the image width and the number of image channels, wherein the sizes of the image height and the image width of an output matrix are changed in the calculating process, and the number of the image channels is changed;
A convolution kernel (weight matrix) comprising four dimensions: convolution kernel height, convolution kernel width, number of input channels, number of output channels (number of convolution kernels), the convolution kernel dimension meaning being different from the dimension meaning of the input matrix and the output matrix.
The number of input channels of the convolution kernel is determined by the number of channels of the input matrix; the number of channels of the output matrix is determined by the number of output channels of the convolution kernel.
For example, a convolution calculation with 128 input channels and 128 output channels and a convolution kernel size 3*3 is performed as follows:
(1) The convolution calculation unit carries out convolution kernel internal operation;
(2) Accumulation between input channels:
The convolution kernel internal operation values corresponding to the 128 input channels are added to obtain an output channel, the above operation is repeated 128 times to obtain 128 output channels, and the convolution operation of 128 x 3 is completed.
And secondly, introducing the application scene provided by the application. Convolutional neural networks (Convolutional neural networks CNN, or Deep convolutional neural networks, DCNN) are mainly used for image processing, but can also be used for other types of input, such as audio. The sparse neural network is a sparse convolutional neural network, and can convert samples into a proper sparse expression form, so that a learning task is simplified, and the complexity of a model is reduced.
The traditional sparse neural network processor has a large network model, massive calculation is needed to complete tasks, the essence of the sparse neural network processor is a convolution calculation method, the traditional convolution calculation method of the sparse neural network processor needs to store weight data required by convolution calculation on line in the process of calculating a single image, the operation efficiency is low, and for a convolution layer with more channels, more storage resources of the sparse neural network processor are needed to be consumed.
When the sparse neural network processor is designed, the memory amount of the memory unit inside the sparse neural network processor is determined, and the expansion of the sparse neural network processor cannot be controlled. The calculation formula of the memory capacity of the memory unit occupied by one image is as follows: memory=image size×number of image channels, in the process of deeply processing an image, there is always a decrease in image size and an increase in the number of image channels, for example, the size of one image is 416×416, and in this application scenario, the number of image channels is 3; the image is subjected to depth processing, the image size is processed to be 13 by 13, and the number of image channels is 1024 in the application scene. When the image size is smaller than the maximum size that can be processed by the hardware computing unit, the utilization rate of the convolution computing unit is reduced, and great waste of computing force is caused.
Currently, when performing image recognition or structured video analysis, a preprocessor in a computer will process an image to be processed into one or more images with a fixed size, and then send the images with the fixed size to a convolutional neural network processor for processing, where the preprocessing is to make apparent characteristics (such as color distribution, overall brightness, size, etc.) of each image as consistent as possible on the premise of not changing essential information carried by the image as much as possible, so as to facilitate the subsequent processing procedure. In the existing convolution operation method, convolution calculation units (Convolutional Count Unit, CCU) for processing fixed-size images process images in a one-to-one mode, one convolution calculation unit has the capability of processing one fixed-size image, but as the size of the image is reduced, the convolution calculation unit cannot be fully utilized, the operation efficiency of a processor is low due to the low utilization rate of the convolution calculation unit, meanwhile, as the size of the image is reduced, the storage space provided for the fixed-size image by the processor cannot be fully utilized, and the waste of storage resources of the processor is caused by the low utilization rate of the storage space. Therefore, the convolution operation method of the conventional convolution neural network processor has the problems of low operation efficiency and storage resource waste, and causes great calculation power waste of the convolution neural network processor. Based on the application scene, the application is provided.
Referring to fig. 1, fig. 1 shows a parallel computing method for a sparse neural network processor, comprising the steps of:
s1: the convolution computing unit acquires image data to be processed, wherein the image data to be processed comprises an image channel and an image size; generating parallelism according to the image channel and the image size; and configuring a convolution computing unit into a computing group according to the parallelism to carry out convolution computation on the image, and outputting a computing result of the unit.
The generating parallelism according to the image size and the image channel comprises the following steps:
S11: generating preliminary parallelism according to the image channel;
In a feasible embodiment, when the acquired image channel is 1-256, generating a first parallelism;
generating a second parallelism when the acquired image channel is 257-512;
Generating a third parallelism when the acquired image channels are 513-1024;
when the acquired image channel is 1024-to-1024, generating a fourth parallelism;
S12: adjusting the preliminary parallelism according to the image size to generate parallelism;
s121: when the image size is smaller than the preliminary parallelism-supporting image size;
adjusting the preliminary parallelism to generate parallelism;
S122: when the image size is greater than the preliminary parallelism-supporting image size; splitting the image into image sizes which can be supported by the preliminary parallelism;
adjusting the preliminary parallelism to generate parallelism;
in a feasible embodiment, when the acquired image channel is 1-256 and the first parallelism is generated, the supporting image size is 64×8;
when the acquired image size is (33-). Times.8, keeping the parallelism as a first parallelism;
when the acquired image size is (17-32) x 8, adjusting the parallelism to be the second parallelism;
when the acquired image size is (9-16) x 8, adjusting the parallelism to be a third parallelism;
and when the acquired image size is (1-8) x 8, adjusting the parallelism to be the fourth parallelism.
In a feasible embodiment, when the acquired image channel is 257-512 and the second parallelism is generated, the supported image size is 32×8;
When the acquired image size is (17-). Times.8, keeping the parallelism as a second parallelism;
when the acquired image size is (9-16) x 8, adjusting the parallelism to be a third parallelism;
and when the acquired image size is (1-8) x 8, adjusting the parallelism to be the fourth parallelism.
In a feasible embodiment, when the acquired image channels are 513-1024 and the third parallelism is generated, the supported image size is 16×8;
when the acquired image size is (8-). Times.8, keeping the parallelism as a third parallelism;
and when the acquired image size is (1-8) x 8, adjusting the parallelism to be the fourth parallelism.
If the acquired image size is larger than the maximum image size supported by the parallelism, splitting the image into the image size supported by the parallelism.
In a feasible embodiment, when the acquired image channel is 257-512, and the second parallelism is generated, the supporting image size is 32×8;
When the acquired image size is 128×128, the image to be processed is split into 64 images with the image size of 32×8 for processing because the image size is larger than the image size which can be supported by the parallelism.
The parallelism includes: a first parallelism, the value of which is 1;
a second parallelism, the value of which is 2;
A third parallelism, the value of which is 4;
a fourth parallelism, the value of which is 8;
When the value of the generated parallelism is N, the M convolution computing units are averagely configured into N computing groups to carry out convolution computation on N image channels, so that N unit computing results are obtained.
In a feasible embodiment, m=8, and the number of convolution calculation units is 8;
The convolution calculation unit includes: the first calculation unit 11, the second calculation unit 12, the third calculation unit 13, the fourth calculation unit 14, the fifth calculation unit 15, the sixth calculation unit 16, the seventh calculation unit 17, and the eighth calculation unit 18.
When the value of the parallelism is 1, 8 computing units are configured into 1 computing group to carry out convolution computation on 1 image channel, and 8 unit computing results are output;
When the value of the parallelism is 2, the 8 calculation units are configured to perform convolution calculation on 2 image channels in 2 calculation groups on average, that is, the first calculation unit 11, the second calculation unit 12, the third calculation unit 13, and the fourth calculation unit 14 are configured to 1 calculation group, and the fifth calculation unit 15, the sixth calculation unit 16, the seventh calculation unit 17, and the eighth calculation unit 18 are configured to 1 calculation group, each group outputting 4 unit calculation results;
When the value of the parallelism is 4, the 8 calculation units are configured as 4 calculation groups on average to perform convolution calculation on the 4 images, that is, the first calculation unit 11 and the fifth calculation unit 15 are configured as 1 calculation group, the third calculation unit 13 and the seventh calculation unit 17 are configured as 1 calculation group, the second calculation unit 12 and the sixth calculation unit 16 are configured as 1 calculation group, the fourth calculation unit 14 and the eighth calculation unit are configured as 1 calculation group, and each group outputs 2 unit calculation results.
When the value of the parallelism is 8, the 8 calculation units are averagely configured into 8 calculation groups, and convolution operation is performed on the 8 images, and each group outputs 1 unit calculation result.
S2: the processing unit processes the unit calculation result to obtain a channel calculation result;
The processing unit is further configured to:
S21: if N is smaller than M, the connector combination unit calculation result obtains a channel calculation result, and if N is equal to M, the unit calculation result is equal to the channel calculation result;
S22: if N is greater than 1, the adder adds the channel calculation results.
When N is smaller than M, the convolution computing units cooperatively process 1 image channel, and the connector combining unit computes the result to obtain the channel computing result.
In a feasible embodiment, when n=1, the value of the parallelism is 1,8 computing units cooperatively process 1 image channel, the connector combines 8 unit computing results, outputs 1 channel computing result, the value of the parallelism is not in the second threshold range, the 1 computing group completes the convolution computation of 1 image channel, and outputs 1 channel computing result of the 1 image, and the channel computing result is equal to the convolution computing result because only 1 channel computing result is needed, without an adder.
When n=2, the value of the parallelism is 2,4 computing units cooperatively compute 1 image channel, the connector combines 4 unit computing results, 2 channel computing results are output, the value of the parallelism is within a second threshold range, the 2 computing groups complete convolution computation of 2 image channels, the convolution computing results of 2 image channels are output, and the adder adds the 2 channel computing results to obtain the convolution computing result.
When n=4, the value of the parallelism is 4,2 computing units cooperatively compute 1 image, the connector combines 2 unit computing results, outputs 4 channel computing results, the value of the parallelism is within a second threshold range, the 4 computing groups complete the convolution computation of 4 image channels, output the convolution computing results of 4 image channels, and the adder adds the 4 channel computing results to obtain the convolution computing results
When n=8, the value of the parallelism is 8, which is not in the first threshold range, the unit calculation results of 8 calculation units are directly output, the unit calculation results are equal to the channel calculation results, the value of the parallelism is in the second threshold range, the 8 calculation groups complete the convolution calculation of 8 image channels, the channel calculation results of 8 images are output, and the adder is called to add the 8 channel calculation results to obtain the convolution calculation results.
S3: an accumulator accumulates the convolution calculation results.
In a feasible embodiment, a convolution calculation with 128 input channels, 128 output channels, and 3*3 convolution kernel size is shown;
Referring to fig. 2, fig. 2 shows a schematic diagram of a parallel computing method when generating the first parallelism, and in a feasible embodiment, the acquired image channel is 128, so as to generate the first parallelism; obtaining an image with the size of 64 x 8, and keeping the parallelism as a first parallelism; the first parallelism value is 1,8 convolution computing units are configured to carry out convolution computation on 1 image channel by 1 computing group, the image to be processed with the image size of 64 x 8 is split into 8 images with the image size of 8 x 8 to be processed, each convolution computing unit completes one convolution computation with the image size of 8 x 8, the connector combines 8 units to obtain 1 channel computing result due to the fact that the 8 convolution computing units cooperatively process one image, and the channel computing result is equal to the convolution computing result due to the fact that only one channel computing result exists. The above operation is repeated 128 times to obtain 128 convolution calculation results, and the accumulator accumulates the convolution calculation results in the time direction.
Referring to fig. 3, fig. 3 shows a schematic diagram of a parallel computing method when generating the second parallelism, and in a feasible embodiment, the acquired image channel is 128, and the first parallelism is generated; obtaining the image size of 32 x 8, and adjusting the parallelism to be the second parallelism; the second parallelism is 2,8 convolution computing units are configured to 2 computing groups to perform convolution computation on 2 image channels, the to-be-processed image with the image size of 32 x 8 is split into 4 images with the image size of 8 x 8 to be processed, each convolution computing unit completes convolution computation with the image size of 8 x 8, as the 4 convolution computing units cooperatively process one image, the connector combines the 4 unit computing results to obtain 1 channel computing result, and the adder adds the 2 channel computing results to obtain the convolution computing result. The above operation is repeated 64 times to obtain 64 convolution calculation results, and the accumulator accumulates the convolution calculation results in the time direction.
Referring to fig. 4, fig. 4 shows a schematic diagram of a parallel computing method when generating the third parallelism, and in a feasible embodiment, the acquired image channel is 128, and the first parallelism is generated; obtaining the image size of 16 x 8, and adjusting the parallelism to be the third parallelism; the third parallelism is 4,8 convolution computing units are configured to 4 computing groups to perform convolution computation on 4 image channels, the to-be-processed image with the image size of 16 x 8 is split into 2 images with the image size of 8x 8 to be processed, each convolution computing unit completes one convolution computation with the image size of 8x 8, and as the 2 convolution computing units cooperatively process one image, the connector combines 2 unit computing results to obtain 1 channel computing result, and the adder adds the 4 channel computing results to obtain the convolution computing result. The above operation is repeated 32 times to obtain 32 convolution calculation results, and the accumulator accumulates the convolution calculation results in the time direction.
Referring to fig. 5, fig. 5 shows a schematic diagram of a parallel computing method when generating the fourth parallelism, and in a feasible embodiment, the acquired image channel is 128, and the first parallelism is generated; obtaining the image size of 8 x 8, and adjusting the parallelism to be fourth parallelism; and the fourth parallelism is 8,8 convolution calculation units carry out convolution calculation on 8 image channels, each convolution calculation unit completes convolution calculation with the image size of 8 x 8 to obtain 8 channel calculation results, and the adder adds the 8 channel calculation results to obtain the convolution calculation results. The above operation is repeated 16 times to obtain 32 convolution calculation results, and the accumulator accumulates the convolution calculation results in the time direction.
Referring to fig. 6, fig. 6 shows a parallel computing device for a sparse neural network processor, comprising: a convolution calculation unit 1, a processing unit 2 and an accumulator 3, the processing unit 2 comprising: a connector 21 and an adder 22.
The convolution computing unit 1 acquires image data to be processed including: image channel and image size; generating parallelism according to the image channel and the image size; and configuring a convolution computing unit into a computing group according to the parallelism to carry out convolution computation on the image, and outputting a computing result of the unit.
The generating parallelism according to the image channel and the image size comprises the following steps:
the convolution computing unit 1 generates preliminary parallelism according to the image channel;
In a feasible embodiment, when the acquired image channel is 1-256, generating a first parallelism;
generating a second parallelism when the acquired image channel is 257-512;
Generating a third parallelism when the acquired image channels are 513-1024;
when the acquired image channel is 1024-to-1024, generating a fourth parallelism;
The convolution computing unit 1 adjusts the preliminary parallelism according to the image size to generate parallelism;
in a feasible embodiment, when the acquired image channel is 1-256 and the first parallelism is generated, the supporting image size is 64×8;
when the acquired image size is (33-). Times.8, keeping the parallelism as a first parallelism;
when the acquired image size is (17-32) x 8, adjusting the parallelism to be the second parallelism;
when the acquired image size is (9-16) x 8, adjusting the parallelism to be a third parallelism;
and when the acquired image size is (1-8) x 8, adjusting the parallelism to be the fourth parallelism.
In a feasible embodiment, when the acquired image channel is 257-512 and the second parallelism is generated, the supported image size is 32×8;
When the acquired image size is (17-). Times.8, keeping the parallelism as a second parallelism;
when the acquired image size is (9-16) x 8, adjusting the parallelism to be a third parallelism;
and when the acquired image size is (1-8) x 8, adjusting the parallelism to be the fourth parallelism.
In a feasible embodiment, when the acquired image channels are 513-1024 and the third parallelism is generated, the supported image size is 16×8;
when the acquired image size is (8-). Times.8, keeping the parallelism as a third parallelism;
and when the acquired image size is (1-8) x 8, adjusting the parallelism to be the fourth parallelism.
In a feasible embodiment, when the acquired image channel is 257-512, and the second parallelism is generated, the supporting image size is 32×8;
When the acquired image size is 128×128, the image to be processed is split into 64 images with the image size of 32×8 for processing because the image size is larger than the image size which can be supported by the parallelism.
The parallelism includes: a first parallelism, the value of which is 1;
a second parallelism, the value of which is 2;
A third parallelism, the value of which is 4;
a fourth parallelism, the value of which is 8;
When the value of the generated parallelism is N, the M convolution computing units are averagely configured into N computing groups to carry out convolution computation on N images, so that N unit computing results are obtained.
In a feasible embodiment, m=8, and the number of convolution calculation units is 8;
When the value of the parallelism is 1, 8 computing units are configured into 1 computing group to carry out convolution computation on 1 image channel, and 8 unit computing results are output;
When the value of the parallelism is 2, that is, the first calculation unit 11, the second calculation unit 12, the third calculation unit 13, and the fourth calculation unit 14 are configured as 1 calculation group, the fifth calculation unit 15, the sixth calculation unit 16, the seventh calculation unit 17, and the eighth calculation unit 18 are configured as 1 calculation group, each group outputting 4 unit calculation results;
When the value of the parallelism is 4, the 8 calculation units are configured as 4 calculation groups on average to perform convolution calculation on the 4 images, that is, the first calculation unit 11 and the fifth calculation unit 15 are configured as 1 calculation group, the third calculation unit 13 and the seventh calculation unit 17 are configured as 1 calculation group, the second calculation unit 12 and the sixth calculation unit 16 are configured as 1 calculation group, the fourth calculation unit 14 and the eighth calculation unit are configured as 1 calculation group, and each group outputs 2 unit calculation results.
When the value of the parallelism is 8, the 8 calculation units are averagely configured into 8 calculation groups to perform convolution operation on the 8 image channels, and each group outputs 1 unit calculation result.
The processing unit 2 processes the unit calculation result to obtain a channel calculation result; the processing unit 2 comprises a connector 21 and an adder 22;
the connector 21 combines the unit calculation results to obtain a channel calculation result;
the adder 22 adds the channel calculation results to obtain a convolution calculation result.
The processing unit 2 is further configured to:
If N is smaller than M, the connector 21 combines the unit calculation results to obtain a channel calculation result, and if N is equal to M, the unit calculation result is equal to the channel calculation result;
if N is greater than 1, the adder 22 adds the channel calculation results.
When N is smaller than M, the plurality of convolution computing units cooperatively process 1 image channel, and the connector 21 combines the unit computing results to obtain a channel computing result.
It should be noted that, the parallel computing device of the present application can perform convolution operation under the condition of generating the first parallelism, the second parallelism, the third parallelism and the fourth parallelism, and the high parallelism result can be accumulated and obtained on the basis of the low parallelism result.
The convolution computing unit 1 includes: the first calculation unit 11, the second calculation unit 12, the third calculation unit 13, the fourth calculation unit 14, the fifth calculation unit 15, the sixth calculation unit 16, the seventh calculation unit 17, and the eighth calculation unit 18.
In a feasible embodiment, n=1, the value of the parallelism is 1,8 computing units cooperatively process 1 image channel, the connector 21 combines 8 unit operation results, outputs 1 channel calculation result, the value of the parallelism is not within the second threshold range, the 1 computing group completes the convolution calculation of 1 channel, and outputs the channel calculation result of 1 image, and since there is only 1 channel calculation result, the channel calculation result is equal to the convolution calculation result without invoking the adder 22.
N=2, the value of the parallelism is 2, the first computing unit 11, the second computing unit 12, the third computing unit 13, and the fourth computing unit 14 are configured as 1 computing group, and the fifth computing unit 15, the sixth computing unit 16, the seventh computing unit 17, and the eighth computing unit 18 are configured as 1 computing group. The 4 calculation units cooperatively calculate 1 image channel, the connector 21 combines 4 unit operation results, outputs 2 channel calculation results, the value of the parallelism is in the second threshold range, the 2 calculation groups complete the convolution calculation of 2 channels, output the convolution operation results of 2 images, and the adder 22 adds the 2 channel calculation results, wherein the specific process is as follows: the first calculation unit 11 and the fifth calculation unit 15 are added, the second calculation unit 12 and the sixth calculation unit 16 are added, the third calculation unit 13 and the seventh calculation unit 17 are added, and the fourth calculation unit 14 and the eighth calculation unit 18 are added; and obtaining a convolution calculation result.
N=4, the value of the parallelism is 4, the first calculation unit 11 and the fifth calculation unit 15 are configured as 1 calculation group, the third calculation unit 13 and the seventh calculation unit 17 are configured as 1 calculation group, the second calculation unit 12 and the sixth calculation unit 16 are configured as 1 calculation group, the fourth calculation unit 14 and the eighth calculation unit are configured as 1 calculation group, the 2 calculation units cooperatively calculate 1 image channel, the connector 21 combines the 2 unit calculation results, the value of the parallelism is within the second threshold range, the 4 calculation groups complete the convolution calculation of the 4 channels, the convolution calculation result of the 4 images is output, the adder 22 adds the 4 channel calculation results, the addition of the 4 channel calculation results is performed on the basis of the 2 channel addition results of the parallelism of 2, that is, the sum of the addition of the first calculation unit 11 and the fifth calculation unit 15 and the addition of the third calculation unit 13 and the seventh calculation unit 17 is added, the addition of the fourth calculation unit 12 and the sixth calculation unit 16 and the addition of the fourth calculation unit 14 and the eighth calculation unit 18 is added; and obtaining a convolution calculation result.
N=8, the value of the parallelism is 8, the unit calculation results of 8 calculation units are directly output and are not in the first threshold range, the unit calculation results are equal to the channel calculation results, the value of the parallelism is in the second threshold range, 8 calculation groups complete the convolution calculation of 8 channels, the channel calculation results of 8 images are output, the adder 22 adds the 8 channel calculation results, the addition of the 8 channel calculation results is performed on the basis that the parallelism is 4 and the 4 channel addition results are obtained, namely, the sum of the addition of the first calculation unit 11 and the fifth calculation unit 15 and the sum of the addition of the third calculation unit 13 and the seventh calculation unit 17 is added, and the sum of the addition of the second calculation unit 12 and the sixth calculation unit 16 and the sum of the addition of the fourth calculation unit 14 and the eighth calculation unit 18 is added; and obtaining a convolution calculation result.
The selector 4 is used for distinguishing convolution operation results under different parallelism conditions;
When the parallel device operates under four parallel degrees simultaneously, the selector 4 inputs the convolution operation results calculated under the conditions of different parallel degrees into the accumulator 3 for accumulation.
The accumulator 3 accumulates the convolution calculation results. For a specific accumulation procedure, reference is made to the above embodiments.
The application discloses a parallel computing method and a parallel computing device, which are used for a sparse neural network processor and also support a non-sparse network, so that the computing mode that 1 convolution computing unit processes 1 image channel in the original processing mode is changed, one or more image channels are processed by a plurality of convolution computing units, in the parallel computing process, a larger picture size is processed with low parallelism, a larger image channel is supported with high parallelism, for example, the size of a convolution image which can be supported by one convolution computing unit is 8 x 8, the supported image channel is 256, and the joint support characteristics of the 8 convolution computing units are as follows: when the parallelism is 1, the convolution image size of 64 x 8 is supported, 256 image channels are supported, namely, 8 convolution calculation units complete the convolution of 1 channel of 64 x 8 and repeat 256 times; when the parallelism is 2, the convolution image size of 32 x 8 is supported, the input/output channels of 512, namely, 8 convolution calculation units are divided into two groups, the convolution of 2 channels of 32 x 8 is completed, the convolution is repeated 512 times, when the parallelism is 4, the convolution image size of 16 x 8 is supported, the input/output channels of 1024 are supported, when the parallelism is 8, the convolution image size of 8 x 8 is supported, the input/output channels of 2048 are supported, and according to the above example, when the number of processed images is the same, the parallel calculation method can calculate images with larger size under the condition of occupying the same storage space, namely, the same channel number, and when the processed image sizes are the same, the parallel calculation method can support larger storage space, namely, more channel number.
Embodiments of the present application also provide a computer program product comprising one or more computer program instructions. When the computer program instructions are loaded and executed by a computer, the processes or functions in accordance with the various embodiments of the present application described above are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. Which when run on a computer causes the computer to perform the method provided by the embodiments of the application.
The present embodiment also provides a computer-readable storage medium storing computer program instructions that, when executed, implement all the steps of the image processing method of the above embodiments of the present application. The computer readable storage medium includes magnetic disk, optical disk, read-only memory ROM or random access memory RAM, etc. In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the embodiments are not limited and may be implemented in whole or in part in the form of a computer program product. It will also be appreciated by those of skill in the art that the various illustrative logical blocks (illustrative logical block) and steps (step) described herein may be implemented in electronic hardware, computer software, or combinations of both. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the functionality in a variety of ways for each particular application, but such implementation is not to be understood as beyond the scope of the present application.
The various illustrative logical blocks and circuits described in this disclosure may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration. The steps of a method or algorithm described in the connection with the present application may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software elements may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a UE. In the alternative, the processor and the storage medium may reside in different components in a UE. It should be understood that, in various embodiments of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the present application.
Furthermore, the terms first, second, third and the like in the description and in the claims and in the above drawings, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It will be apparent to those skilled in the art that the techniques of embodiments of the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied essentially or in parts contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments or some parts of the embodiments of the present application. The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for a network device/node or apparatus device, since it is substantially similar to the method embodiments, the description is relatively simple, as far as the description in the method embodiments is concerned.
The above embodiments of the present application do not limit the scope of the present application.
It should be understood that the terms "first," "second," "third," and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such as where appropriate, for example, implementations other than those illustrated or described in connection with the embodiments of the application.
Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such that the inclusion of a list of elements is not necessarily limited to those elements expressly listed, but may include other elements not expressly listed or routinely used for such techniques.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (10)

1. A parallel computing method, which is suitable for a sparse neural network processor, comprising the following steps:
the convolution computing unit obtains image data to be processed, wherein the image data to be processed comprises: image channel and image size; generating preliminary parallelism according to an image channel, and adjusting the preliminary parallelism according to the size relation between the image size and the preliminary parallelism supporting image size to generate parallelism; performing convolution calculation on the image data according to the parallelism to obtain a unit calculation result;
The processing unit processes the unit calculation result to obtain a convolution calculation result;
an accumulator accumulates the convolution calculation results.
2. A parallel computing method according to claim 1, characterized in that:
when the value of the generated parallelism is N, the M convolution computing units are averagely configured into N computing groups to carry out convolution operation on N image channels, and N unit computing results are obtained.
3. A parallel computing method according to claim 2, characterized in that the processing unit comprises: a connector and an adder;
the connector combination unit calculates the result to obtain a channel calculation result;
And the adder adds the channel calculation results to obtain a convolution calculation result.
4. A parallel computing method according to claim 3, characterized in that:
if N is smaller than M, the connector combination unit calculates the result to obtain a channel calculation result;
if N is equal to M, the unit calculation result is equal to the channel calculation result;
If N is greater than 1, the adder adds the channel calculation results to obtain a convolution calculation result.
5. A parallel computing device for a sparse neural network processor, comprising:
A convolution computing unit (1) for acquiring image data to be processed, the image data to be processed comprising: image channel and image size; generating preliminary parallelism according to an image channel, and adjusting the preliminary parallelism according to the size relation between the image size and the preliminary parallelism supporting image size to generate parallelism; performing convolution calculation on the image data to be processed according to the parallelism to obtain a unit calculation result;
A processing unit (2) for processing the unit calculation result to obtain a convolution calculation result;
and an accumulator (3) for accumulating the convolution calculation result.
6. A parallel computing device according to claim 5, characterized in that the convolution computing unit (1) is further configured to:
when the value of the generated parallelism is N, the M convolution computing units are averagely configured into N computing groups to carry out convolution operation on N image channels, and N unit computing results are obtained.
7. A parallel computing arrangement according to claim 6, characterized in that the processing unit (2) comprises: a connector (21) and an adder (22);
The connector (21) combines the unit calculation results to obtain a channel calculation result;
The adder (22) adds the channel calculation results to obtain a convolution calculation result.
8. A parallel computing arrangement according to claim 7, characterized in that the processing unit (2) is further configured to:
if N is smaller than M, the connector combination unit calculates the result to obtain a channel calculation result;
if N is equal to M, the unit calculation result is equal to the channel calculation result;
if N is greater than 1, the adder adds the channel calculation results.
9. The parallel computing device of claim 8, wherein the parallel computing device operates with a first degree of parallelism, a second degree of parallelism, a third degree of parallelism, and a fourth degree of parallelism at the same time.
10. A parallel computing device according to claim 9, further comprising a selector (4), wherein when the parallel device is operated at four parallelism simultaneously, the selector (4) distinguishes the convolution operation results in the case of different parallelism, and inputs the convolution operation results calculated in the case of different parallelism to the accumulator (3) for accumulation.
CN202011059959.4A 2020-09-30 2020-09-30 Parallel computing method and device Active CN112132275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011059959.4A CN112132275B (en) 2020-09-30 2020-09-30 Parallel computing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011059959.4A CN112132275B (en) 2020-09-30 2020-09-30 Parallel computing method and device

Publications (2)

Publication Number Publication Date
CN112132275A CN112132275A (en) 2020-12-25
CN112132275B true CN112132275B (en) 2024-06-18

Family

ID=73843382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011059959.4A Active CN112132275B (en) 2020-09-30 2020-09-30 Parallel computing method and device

Country Status (1)

Country Link
CN (1) CN112132275B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN111416743A (en) * 2020-03-19 2020-07-14 华中科技大学 Convolutional network accelerator, configuration method and computer readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109416755B (en) * 2018-01-15 2021-11-23 深圳鲲云信息科技有限公司 Artificial intelligence parallel processing method and device, readable storage medium and terminal
CN110610227B (en) * 2018-06-15 2022-07-26 赛灵思电子科技(北京)有限公司 Artificial neural network adjusting method and neural network computing platform
CN109409511B (en) * 2018-09-25 2020-07-28 西安交通大学 Convolution operation data flow scheduling method for dynamic reconfigurable array
CN110516801B (en) * 2019-08-05 2022-04-22 西安交通大学 High-throughput-rate dynamic reconfigurable convolutional neural network accelerator

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN111416743A (en) * 2020-03-19 2020-07-14 华中科技大学 Convolutional network accelerator, configuration method and computer readable storage medium

Also Published As

Publication number Publication date
CN112132275A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
US11262982B2 (en) Computation circuit including a plurality of processing elements coupled to a common accumulator, a computation device and a system including the same
Wang et al. A fast implementation of adaptive histogram equalization
US11734554B2 (en) Pooling processing method and system applied to convolutional neural network
CN112286864B (en) Sparse data processing method and system for accelerating operation of reconfigurable processor
CN111107274B (en) Image brightness statistical method and imaging device
CN111240746A (en) Floating point data inverse quantization and quantization method and equipment
CN111709415B (en) Target detection method, device, computer equipment and storage medium
CN110109646A (en) Data processing method, device and adder and multiplier and storage medium
CN111008691B (en) Convolutional neural network accelerator architecture with weight and activation value both binarized
CN110019184B (en) Method for compressing and decompressing ordered integer array
CN116194933A (en) Processing system, processing method, and processing program
CN109844774B (en) Parallel deconvolution computing method, single-engine computing method and related products
CN112149047A (en) Data processing method and device, storage medium and electronic device
CN112132275B (en) Parallel computing method and device
CN111738424B (en) Neural network processing method and device, electronic equipment and storage medium
CN111814972B (en) Neural network convolution operation acceleration method based on FPGA
CN115130672B (en) Software and hardware collaborative optimization convolutional neural network calculation method and device
CN116227599A (en) Inference model optimization method and device, electronic equipment and storage medium
CN116167425A (en) Neural network acceleration method, device, equipment and medium
CN116245765A (en) Image denoising method and system based on enhanced depth expansion convolutional neural network
CN110163793B (en) Convolution calculation acceleration method and device
CN114037054A (en) Data processing method, device, chip, equipment and medium
CN113159297A (en) Neural network compression method and device, computer equipment and storage medium
CN114780501A (en) Data processing method, electronic device and computer program product
CN111461144A (en) Method for accelerating convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant