CN111931927A - Method and device for reducing occupation of computing resources in NPU - Google Patents

Method and device for reducing occupation of computing resources in NPU Download PDF

Info

Publication number
CN111931927A
CN111931927A CN202011114887.9A CN202011114887A CN111931927A CN 111931927 A CN111931927 A CN 111931927A CN 202011114887 A CN202011114887 A CN 202011114887A CN 111931927 A CN111931927 A CN 111931927A
Authority
CN
China
Prior art keywords
kernel
convolution
size
convolution kernel
npu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011114887.9A
Other languages
Chinese (zh)
Other versions
CN111931927B (en
Inventor
戴舒诣
范名超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aojie Intelligent Technology Shanghai Co ltd
Original Assignee
Aojie Intelligent Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aojie Intelligent Technology Shanghai Co ltd filed Critical Aojie Intelligent Technology Shanghai Co ltd
Priority to CN202011114887.9A priority Critical patent/CN111931927B/en
Publication of CN111931927A publication Critical patent/CN111931927A/en
Application granted granted Critical
Publication of CN111931927B publication Critical patent/CN111931927B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The application provides a method and a device for reducing computing resource occupation in an NPU (neutral point Unit), wherein the NPU comprises a pulse computing array PE array module for convolution operation, firstly, an average pooled shared division coefficient kernel _ size is obtained, then, the kernel _ size is expanded into a first convolution kernel, then, the first convolution kernel is input into the PE array module, and the average pooling is realized through the PE array module. Independent modules are not arranged for average pooling, the modules are mapped to PE array modules which are originally mainly used for realizing convolution operation, the PE array modules are converted into convolution operation of the same type, and the original PE array modules can directly realize arbitrary average pooling layer calculation, so that occupation of the independent average pooling modules on hardware calculation resources is avoided, the condition of shortage of the hardware calculation resources is relieved, and power consumption of NPUs is reduced.

Description

Method and device for reducing occupation of computing resources in NPU
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for reducing computing resource occupation in an NPU.
Background
Artificial Intelligence (AI) is a branch of computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. With the rapid development of artificial intelligence technology, neural networks (e.g., deep neural networks, convolutional neural networks, etc.) have achieved great success in processing and analyzing various media information such as images, videos, and voices in recent years.
With the development of neural networks, in order to obtain better effects, technicians have begun to design hardware devices such as special processors for the neural networks. The accelerator is a special processor capable of supporting multiple Neural networks, is optimized for the special direction of an AI algorithm, has an acceleration Unit only related to the AI algorithm, and realizes the rapid calculation of some specific formulas in the AI algorithm on a hardware level. In the NPU, a PE array (Processing Element) module is an important module for implementing core convolution operation in a convolutional neural network, and specifically, implements point multiplication and addition operation of an image (image) and a kernel (kernel). The PE array occupies a large amount of computing resources in a hardware FPGA/chip, and parallel operation can be simultaneously carried out during working so as to achieve the purpose of efficient computing processing.
However, in the process of implementing the scheme of the present application, the inventor finds that, besides a PE array module for implementing core convolution calculation occupies most of hardware calculation resources, a module for implementing average pooling (averaging pooling) operation in a pooling layer (pooling layer) also needs considerable hardware calculation resources, which easily causes hardware calculation resource shortage of the FPGA and power consumption increase.
In the prior art, when the situation of hardware resource shortage is faced, optimization is often performed in an independent average pooling module, and the purpose of saving computing resources is achieved by adjusting a segmentation computing process. For example, in a prior art, in view of the problem of hardware resources, the composition of the computing unit that processes one path of data in the average pooling module may be adjusted to be an adder and a comparator to reduce the hardware overhead, however, since at least N paths of data need to be processed in parallel, the hardware actually needs N computing units, and the computing resources occupied by the average pooling module are still large. In another prior art, although the overall flow of the accelerator is adjusted, an independent average pooling module is still used, the remaining hardware computing resources of the convolution module still need to be consumed \ seized, and the overall power consumption is increased during the operation of the accelerator.
Disclosure of Invention
The application provides a method and a device for reducing the occupation of computing resources in an NPU (neural network unit), so as to solve the problem of shortage of hardware computing resources in the NPU.
According to a first aspect of embodiments of the present application, there is provided a method for reducing computational resource occupation in an NPU, the method being used for a neural network processing unit NPU, wherein the NPU includes a systolic computing array PE array module for convolution operation;
the method comprises the following steps:
obtaining an average pooled shared division coefficient kernel _ size, wherein the kernel _ size is a size of a pooling kernel used for the average pooling;
expanding the kernel _ size to a first convolution kernel, wherein the first convolution kernel is a convolution kernel for deep convolution calculations;
inputting the first convolution kernel to the PE array module to achieve the average pooling by the PE array module.
Optionally, expanding the kernel _ size into a first convolution kernel includes:
acquiring the size of the pooling kernel and the reciprocal of the kernel _ size;
generating the first convolution kernel, wherein the first convolution kernel is the same size as the pooling kernel and each multiplication factor w within the first convolution kernel is the reciprocal of the kernel _ size.
Optionally:
the PE array module consists of a plurality of data computing blocks tile, and each tile consists of a plurality of data computing slices slice; when the PE array module works, the input feature graph is divided into a plurality of copies with the same quantity as the tiles, and the tiles are respectively input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight, slice pulses in each tile share the same image, and the convolution result is output in a pulsating mode.
Optionally:
each slice comprises an addition tree, an accumulator and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on the input divided two-dimensional images corresponding to different channels and the weights respectively, obtaining multi-channel sum through the addition tree, and sending the sum to the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.
Optionally, when the average pooling is a global average pooling, the method further includes:
dividing the input feature map into a plurality of copies with the same number as tiles to correspond to each tile;
and dividing the input first convolution kernel into the same size as the input feature map, performing average pooling on the divided images by each tile, and summing the divided pooled results output by slices of each column of the PE array.
Optionally, before obtaining the average pooled shared division coefficient kernel _ size, the method further includes:
when the running layer of the NPU is a convolution layer, directly transmitting the convolution kernel output by the weight cache to the PE array module to realize convolution operation;
when the running layer of the NPU is an average pooling layer, the step of obtaining the average pooled shared division coefficient kernel _ size and thereafter is performed.
According to a second aspect of the embodiments of the present application, there is provided an apparatus for reducing computational resource occupation in an NPU, the apparatus being used for a neural network processing unit NPU, and the NPU including a systolic computing array PE array module for convolution operation;
the device comprises:
a convolution kernel obtaining unit, configured to obtain an average pooled shared division coefficient kernel _ size, where the kernel _ size is a size of a pooled kernel used for the average pooling, and expand the kernel _ size into a first convolution kernel, where the first convolution kernel is a convolution kernel used for deep convolution calculation;
a convolution kernel input unit, configured to input the first convolution kernel to the PE array module to implement the average pooling through the PE array module.
Optionally, when the kernel _ size is expanded into the first convolution kernel, the convolution kernel acquisition unit is specifically configured to:
acquiring the size of the pooling kernel and the reciprocal of the kernel _ size;
generating the first convolution kernel, wherein the first convolution kernel is the same size as the pooling kernel and each multiplication factor w within the first convolution kernel is the reciprocal of the kernel _ size.
Optionally:
the PE array module consists of a plurality of data computing blocks tile, and each tile consists of a plurality of data computing slices slice; when the PE array module works, the input feature graph is divided into a plurality of copies with the same quantity as the tiles, and the tiles are respectively input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight, slice pulses in each tile share the same image, and the convolution result is output in a pulsating mode.
Optionally:
each slice comprises an addition tree, an accumulator and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on the input divided two-dimensional images corresponding to different channels and the weights respectively, obtaining multi-channel sum through the addition tree, and sending the sum to the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.
Optionally, the apparatus further comprises:
and a global average pooling acceleration unit, configured to, when the average pooling is global average pooling, divide the input feature map into a plurality of pieces having the same number as tiles to correspond to each tile, divide the input first convolution kernel into pieces having the same size as the input feature map, each tile performs average pooling on the divided images, and sums the divided pooled results output by slices of each column of the PE array.
Optionally, the apparatus further comprises:
and the weight selection unit is used for directly transmitting the convolution kernel output by the weight cache to the PE array module to realize convolution operation when the operation layer of the NPU is a convolution layer before the convolution kernel acquisition unit is triggered, and triggering the convolution kernel acquisition unit and the convolution kernel input unit when the operation layer of the NPU is an average pooling layer.
According to a third aspect of the embodiments of the present application, there is provided a neural network processing unit NPU, where the NPU includes any one of the above-mentioned devices for reducing the occupation of computing resources in the NPU and the systolic computing array PE array module for convolution operation.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
the application provides a method for reducing the occupation of computing resources in an NPU, wherein the NPU comprises a ripple computing array PE array module for convolution operation. Firstly, obtaining an average pooled shared division coefficient kernel _ size, then expanding the kernel _ size into a first convolution kernel, namely a convolution kernel for deep convolution calculation, and inputting the first convolution kernel into the PE array module so as to realize the average pooling through the PE array module. Compared with the prior art in which a neural network accelerator can use extra computing resources for the average power, and set independent modules, an independent module is not set for the average power in the scheme of the application, but an original PE array module for convolution operation in an NPU is utilized, and on the basis that the module can process convolution operation, the average power is mapped to the PE array module which is originally mainly used for realizing convolution operation under the condition that the output efficiency of the average power in the NPU is not reduced, that is, the average power is converted into one type of convolution operation, so that the original PE array module can directly realize arbitrary average power layer calculation, occupation of hardware computing resources by an independent average pooling module is avoided, the condition of shortage of hardware computing resources is relieved, and the power consumption of the NPU is also reduced.
In addition, aiming at the special scene of global average potential in the average potential, the method and the device make some adjustments on the basis of the original PE array, sum the parallel pooling results of the segmented images, namely, after the input image is segmented, multiply-add operation is performed on each part of the images in parallel, and the data pulsation output characteristic of the PE array is utilized, and only one accumulator is used for each column, so that the higher-level parallel operation is realized, and the operational efficiency of the global average potential is improved under the condition of reducing the resource power consumption to the maximum extent.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise. Furthermore, these descriptions should not be construed as limiting the embodiments, wherein elements having the same reference number designation are identified as similar elements throughout the figures, and the drawings are not to scale unless otherwise specified.
FIG. 1 is a schematic diagram of an average pooling calculation process;
FIG. 2 is a schematic diagram of a depth convolution calculation process;
fig. 3 is a schematic flow chart of a method for reducing occupation of computing resources in an NPU according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a PE array in an embodiment of the present application;
FIG. 5 is a schematic diagram of slice used in convolution operation in the embodiment of the present application;
FIG. 6 is a schematic diagram of a convolution kernel extended according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a weight selector in an embodiment of the present application;
FIG. 8 is a schematic diagram of slice used in average pooling in an embodiment of the present application;
FIG. 9 is a schematic diagram of a global average pooling calculation process;
FIG. 10 is a diagram illustrating PE array applied to global average pooling in an embodiment of the present application;
fig. 11 is a schematic diagram of an apparatus for reducing occupation of computing resources in an NPU according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described in detail below with reference to the drawings in the embodiments of the present application. When referring to the drawings, the same numbers in different drawings represent the same or similar elements unless otherwise specified. It should be apparent that the examples described below are only a part of examples of the present application and not all examples, or that the embodiments described in the following exemplary examples do not represent all embodiments consistent with the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and the like in the description, claims, and drawings of the embodiments of the present application are used for distinguishing between different objects and not for limiting a particular order. In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," should not be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
The embodiment of the application can be applied to many fields in artificial intelligence, such as intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe cities and other fields.
In particular, the embodiment of the application can be applied to the fields of image classification, image retrieval, image semantic segmentation, image super-resolution, natural language processing and the like which need to use a (deep) neural network.
For the convenience of understanding, the following briefly introduces terms and concepts related to deep neural networks, convolution kernels, pooling layers, and the like:
a Deep Neural Network (DNN), also called a multi-layer neural network, may be understood as a neural network with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is connected with any neuron of the (i + 1) th layer. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The process of training the deep neural network is the process of learning the weight matrix, and the final purpose of the process is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained deep neural network.
A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. Convolutional neural networks may consist of many neural network layers, for example, layers of several different types, usually convolutional, active, pooled, fully-connected, are generated alternately, with the depth of each filter in the network increasing from left to right, usually consisting of one or more fully-connected layers. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.
And (3) rolling layers:
it may comprise a number of convolution operators, also called kernels, which in image processing act as a filter to extract specific information from the input image matrix, which may be essentially a weight matrix, usually predefined, whose size should be related to the size of the image. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the sizes of the convolution feature maps extracted by the plurality of weight matrices having the same size are also the same, and the extracted plurality of convolution feature maps having the same size are combined to form the output of the convolution operation. The weight values in the weight matrixes need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used for extracting information from an input image, so that the convolutional neural network can carry out correct prediction.
A pooling layer:
since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after a convolutional layer, either one convolutional layer followed by one pooling layer or multiple convolutional layers followed by one or more pooling layers. The role of pooling is embodied in down-sampling: the significant features are reserved, the feature dimension is reduced, and the receptive field of the kernel is increased. In the image processing process, the purpose of the pooling layer is to reduce the space size of the image so as to simplify the network computation complexity and avoid the occurrence of overfitting to a certain extent, and to perform feature compression and extract main features. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. An average pooling (averaging) operator may calculate pixel values in an image over a certain range to generate an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.
A neural network layer:
after the convolutional layer/pooling layer processing, the convolutional neural network is not enough to output the required output information. Since, as mentioned above, the convolutional/pooling layers only extract features and reduce the parameters introduced by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network needs to generate one or a set of the required number of classes of outputs using the neural network layer. Therefore, the neural network layer may include a plurality of hidden layers and an output layer, and parameters included in the plurality of hidden layers may be obtained by pre-training according to training data related to a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.
Referring to fig. 1 and 2, average pooling is to move a kernel equal to or smaller than the feature map size on a 2D image (two-dimensional image) of each channel of an input feature map, sum each point (e.g., a, b, c, D in fig. 1) corresponding to the feature map in a kernel frame, and divide the sum by the number of corresponding points in the kernel to obtain an average value of corresponding pixels in the kernel frame.
The depth convolution is a multiplication coefficient (e.g. w in fig. 2) between the pixel points (e.g. a, b, c, D in fig. 2) of the kernel corresponding to the 2D image of each channel of the input feature map and the kernel (e.g. w in fig. 2)0,w1,w2,w3) The corresponding dot product is performed first and then the accumulation is performed.
The inventor finds that in the process of implementing the scheme of the application, the calculation operations actually required to be carried out by both the average pooling and the depth convolution are basically the same in algorithm, and only the flow sequence is different. It can be found from the variation of the formulas shown in fig. 1 and fig. 2 that the division coefficient kernel _ size (i.e. shared division coefficient) during averaging in the averaging potential can be converted to be used as the multiplication coefficient (w 0, w1, w2, w 3) of the kernel, the multiplication factor in the kernel and the dot multiplication of the pixel point corresponding to the feature map are firstly performed, and then the summation of the dot multiplication results is performed. Therefore, the average posing operation which needs to be supported by the independent hardware module originally can be converted into depthwise contribution operation calculation which is supported by the PE, only by means of the original PE array (pulse calculation array) module, and the independent average posing module does not need to be generated, thereby avoiding the consumption of extra hardware calculation resources.
Fig. 3 is a schematic flowchart of a method for reducing occupation of computing resources in an NPU according to an embodiment of the present application.
The method can be used for a neural network processing unit NPU, and the NPU comprises a pulse computation array PE array module used for convolution operation.
The following first describes the PE array module in this embodiment. In this embodiment or some other embodiments of the present application, the PE array module may be composed of a plurality of tile calculation blocks (the number of tiles may be referred to as tile _ num), where each tile is composed of a plurality of tile slices; when the PE array module works, the input feature map (feature map) is divided into a plurality of copies (namely tile _ num copies) with the same number as tiles, and each tile is input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight (weight), slice pulses in each tile share the same image (image), and the convolution result is output in a pulsating mode.
For example, see fig. 4, where fig. 4 is a schematic diagram of a PE array in the embodiment of the present application. In fig. 4, PE array is a systolic matrix hardware computation module designed for general 3D constraint high efficiency pipeline computation, and fig. 4 shows the general construction and data flow of PE array. The PE array is composed of multiple tiles (i.e., PE tiles), 4 PE tiles being shown in fig. 4. When the NPU performs convolution operation, the input feather map is divided into tile _ num parts (one of which is Data0, Data1 and the like in FIG. 4), and each tile is input to perform convolution operation independently. Each tile is composed of a plurality of slices (namely PE slices), slice pulses in each column share the same weight (each weight is a kernel, namely an independent 3D convolution kernel, and when the amplitude posing is performed, the slice pulses in each tile share the same image), and the convolution result is output in a pulsating mode, so that the high-efficiency parallel computation of pulsating data is realized. For example, in fig. 4, each PE slice0 of PE tiles 0-3 constitutes a column and shares the same weight 0.
In FIG. 4, weight 0-weight 15 are hardware buffers used to store different 3d kernel, independent of the number of input channels, and determined by the hardware resources expected to be consumed. In practice this represents the number of output image channels processed at once in the hardware. The first convolution kernel is stored in the buffer, the 3D kernel in one convolution layer is the number of channels with N corresponding image outputs, and the weight buffer is used to store different 3D kernels.
It should be noted that, for specific implementation of the PE array module, such as the number of tiles and slices, the embodiment is not limited, and those skilled in the art may select and design the PE array module according to different requirements/different scenarios, and these choices and designs may be used herein without departing from the spirit and scope of the present application.
The following describes the data slice in the PE array module of this embodiment. As an example, see fig. 5, fig. 5 is a schematic diagram of slice used in convolution operation in this embodiment of the present application. In this embodiment or some other embodiments of the present application, each slice includes an addition tree, an accumulator, and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on a divided two-dimensional image (2D image) and a weight (weight) corresponding to different input channels (channels), obtaining multi-channel sum through the addition tree, and sending the multi-channel sum into the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.
The method in this embodiment may include the steps of:
s301, obtaining an average pooled shared division coefficient kernel _ size, where the kernel _ size is a size of a pooled kernel used for the average pooling.
The pooling core is a kernel used for average pooling, which is essentially a weight matrix, so the size of the pooling core is the number of rows x the number of columns of the matrix.
S302, the kernel _ size is expanded into a first convolution kernel, wherein the first convolution kernel is a convolution kernel used for deep convolution calculation.
By observing fig. 1 and 2, the inventors found that both the average pooling and the depth convolution actually need to perform substantially the same computational operations in the algorithm, only the flow order is different, so that the conversion can be performed.
In this embodiment or some other embodiments of the present application, the expanding the kernel _ size to the first convolution kernel may specifically include:
acquiring the size of the pooling kernel and the reciprocal of the kernel _ size;
generating the first convolution kernel, wherein the first convolution kernel is the same size as the pooling kernel and each multiplication factor w within the first convolution kernel is the reciprocal of the kernel _ size.
For example, see FIG. 6, where w in FIG. 6 is the inverse of kernel _ size, which can be extended to a matrix, resulting in a convolution kernel.
S303, inputting the first convolution kernel into the PE array module to realize the average pooling through the PE array module.
Because the PE array module is a module for convolution operation, after converting pooling into convolution, the PE array module can be used to implement pooling operation.
Compared with the prior art in which a neural network accelerator can use extra computing resources for the average power, and set independent modules, an independent module is not set for the average power in the scheme of the application, but an original PE array module for convolution operation in an NPU is utilized, and on the basis that the module can process convolution operation, the average power is mapped to the PE array module which is originally mainly used for realizing convolution operation under the condition that the output efficiency of the average power in the NPU is not reduced, that is, the average power is converted into one type of convolution operation, so that the original PE array module can directly realize arbitrary average power layer calculation, occupation of hardware computing resources by an independent average pooling module is avoided, the condition of shortage of hardware computing resources is relieved, and the power consumption of the NPU is also reduced.
In this embodiment or some other embodiments of the present application, in order not to affect the original function of the PE array, a step of pre-judgment may be added, and when the average pooling operation needs to be performed, the method is performed, and when the ordinary convolution operation needs to be performed, the method is performed according to the original flow.
In other words, before the obtaining the average pooled shared division coefficient kernel _ size, the method may further include:
when the running layer of the NPU is a convolution layer, directly transmitting the convolution kernel output by the weight cache to the PE array module to realize convolution operation;
when the running layer of the NPU is an average pooling layer, the step of obtaining the average pooled shared division coefficient kernel _ size and thereafter is performed.
The present application is further illustrated below with reference to examples:
for PE array, in order to make the averaging discharging mapping operable on PE array to output correct result and reduce the storage time of NPU data and memory power consumption of this layer, in a specific implementation, a weight mux (weight selector) module may be added to a portion of PE array module where weight is input, as shown in fig. 7 for example.
When the NPU running layer is a normal convolution, the weight mux module directly outputs the kernel output by the weight buffer to the PE array. When the NPU runs the average potential layer, the system inputs a multiplication factor w in kernel obtained through kernel _ size as a parameter (weight _ avg _ scale) to the weight mux module, the weight mux module processes the parameter w into a kernel structure in a depthwise constraint form through data expansion, and then inputs the kernel structure into the PE array for convolution operation, so that the average potential layer is realized without adjusting the PE array structure. In other words, weight is not cached in the weight buffer in a large amount in the pooling layer, and only a register of a few bits is needed to store weight _ avg _ scalar, and then the weight _ avg _ scalar is expanded in real time through the weight mux and then is input into the PE.
For slice, fig. 5 already shows a schematic diagram when used for convolution operations. When the calculation of the average potential layer is implemented, as shown in fig. 8 as an example, since the calculation converted into depthwise convolution is implemented, and the convolution of the image and the kernel is performed relatively independently for each channel, the input image and kernel are performed by using only one of the plurality of parallel multipliers in the PE sliceAnd corresponding point multiplication is performed, so that the power consumption is saved, and the point multiplication of the kernel multiplication factor w and the image pixel point corresponding to the kernel multiplication factor w is realized (for example, aw, bw and cw in fig. 2). In fig. 8, a selector (mux) is added to the original slice, and when performing convolution operation, the mux sends the sum obtained by the addition tree to the accumulator for accumulation; when performing averaging discharging, the mux inputs the dot product of one multiplier into the accumulator, and accumulates all the dot product on the kernel plane to obtain a similar result as shown in fig. 20+b*w1+c*w2+d*w3The average porous result of (1).
In addition, the average potential has a special global average potential layer with the same size as the image, as shown in fig. 9, that is, the pixel values of the whole image in the 2d plane are averaged. The inventor finds that, in the process of implementing the scheme of the present application, when the picture size is large, no matter the pooling layer is implemented by using the PE array or the long-flow accumulation calculation is required in other independent pooling modules.
For this particular case, in this embodiment or some other embodiments of the present application, when the average pooling is global average pooling, the method may further include:
dividing the input feature map into a plurality of copies with the same number as tiles to correspond to each tile;
and dividing the input first convolution kernel into the same size as the input feature map, performing average pooling on the divided images by each tile, and summing the divided pooled results output by slices of each column of the PE array.
Considering that mapping to depthwise fusion realizes kernel with the same size as the feature map and the same factor, and the accumulation times of the accumulator in the PE array are more during the global fusion, in order to improve the efficiency, the image and the kernel can be correspondingly divided into the same number of parts as the tile, and parallel multiply-add calculation is performed on different parts of the input feature map.
As an example, in a specific implementation, a slice _ column _ acc module may be added to the PE array, and the result of each tile accumulation is summed, so as to improve the calculation efficiency of the PE array module for global placement.
For example, as shown in fig. 10, a newly added slice _ column _ acc module is connected to the output of PE slice of each column of PE array (one column is shown by a dashed box in the figure). When the current layer is global average potential, the input feature graph is divided into the same number of parts as the number of tiles, the input kernel is correspondingly divided into the same size as the image, each PE tile carries out average potential on the divided picture, and the slice _ column _ acc connected to each column of the PE array sums up the divided pooling results. Considering the systolic flow of the data stream inside the PE array, the slice in each slice in the PE array will be the pulsating output pooling result, each slice _ column _ acc only uses one data selector and accumulator to implement time-sharing accumulation on the pooling result data of each slice in the column, and the whole PE array only adds column _ num adders, thereby achieving the latency of the whole pooling process shortened by the time _ num times. The pipelined operation flow further improves the operation efficiency during the global operation on the premise of saving hardware computing resources.
Aiming at the special scene of global architecture, the method and the device have the advantages that some adjustment is made on the basis of the original PE array, the parallel pooling results of the segmented images are summed, namely, after the input image is segmented, multiplication and addition operation is performed on each partial image in parallel, the data pulsation output characteristic of the PE array is utilized, and only one accumulator is used for each column, so that the higher-level parallel operation is realized, and the operational efficiency of the global architecture is improved under the condition of reducing the resource power consumption to the maximum extent.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Fig. 11 is a schematic diagram of an apparatus for reducing occupation of computing resources in an NPU according to an embodiment of the present application.
The device is used for a neural network processing unit NPU, and the NPU comprises a pulse computation array PE array module for convolution operation;
the device comprises:
a convolution kernel obtaining unit 1101, configured to obtain an average pooled shared division coefficient kernel _ size, where the kernel _ size is a size of a pooling kernel used for the average pooling, and expand the kernel _ size into a first convolution kernel, where the first convolution kernel is a convolution kernel used for deep convolution calculation;
a convolution kernel input unit 1102, configured to input the first convolution kernel to the PE array module to implement the average pooling through the PE array module.
In this embodiment or some other embodiments of the present application, when the kernel _ size is expanded into the first convolution kernel, the convolution kernel obtaining unit may be specifically configured to:
acquiring the size of the pooling kernel and the reciprocal of the kernel _ size;
generating the first convolution kernel, wherein the first convolution kernel is the same size as the pooling kernel and each multiplication factor w within the first convolution kernel is the reciprocal of the kernel _ size.
In this embodiment or some other embodiments of the present application, specifically, the PE array module may be composed of a plurality of data computing blocks tile, and each tile is composed of a plurality of data computing slices slice; when the PE array module works, the input feature graph is divided into a plurality of copies with the same quantity as the tiles, and the tiles are respectively input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight, slice pulses in each tile share the same image, and the convolution result is output in a pulsating mode.
In this embodiment or some other embodiments of the present application, specifically, each slice may include an addition tree, an accumulator, and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on the input divided two-dimensional images corresponding to different channels and the weights respectively, obtaining multi-channel sum through the addition tree, and sending the sum to the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.
In this embodiment or some other embodiments of the present application, the apparatus may further include:
and a global average pooling acceleration unit, configured to, when the average pooling is global average pooling, divide the input feature map into a plurality of pieces having the same number as tiles to correspond to each tile, divide the input first convolution kernel into pieces having the same size as the input feature map, each tile performs average pooling on the divided images, and sums the divided pooled results output by slices of each column of the PE array.
In this embodiment or some other embodiments of the present application, the apparatus may further include:
and the weight selection unit is used for directly transmitting the convolution kernel output by the weight cache to the PE array module to realize convolution operation when the operation layer of the NPU is a convolution layer before the convolution kernel acquisition unit is triggered, and triggering the convolution kernel acquisition unit and the convolution kernel input unit when the operation layer of the NPU is an average pooling layer.
The embodiment provides a method for reducing the occupation of computing resources in an NPU, wherein the NPU comprises a ripple computing array PE array module for convolution operation. Firstly, obtaining an average pooled shared division coefficient kernel _ size, then expanding the kernel _ size into a first convolution kernel, namely a convolution kernel for deep convolution calculation, and inputting the first convolution kernel into the PE array module so as to realize the average pooling through the PE array module. Compared with the prior art in which a neural network accelerator can use extra computing resources for the average power, and set independent modules, an independent module is not set for the average power in the scheme of the application, but an original PE array module for convolution operation in an NPU is utilized, and on the basis that the module can process convolution operation, the average power is mapped to the PE array module which is originally mainly used for realizing convolution operation under the condition that the output efficiency of the average power in the NPU is not reduced, that is, the average power is converted into one type of convolution operation, so that the original PE array module can directly realize arbitrary average power layer calculation, occupation of hardware computing resources by an independent average pooling module is avoided, the condition of shortage of hardware computing resources is relieved, and the power consumption of the NPU is also reduced.
In addition, aiming at the special scene of global average potential in the average potential, the method and the device make some adjustments on the basis of the original PE array, sum the parallel pooling results of the segmented images, namely, after the input image is segmented, multiply-add operation is performed on each part of the images in parallel, and the data pulsation output characteristic of the PE array is utilized, and only one accumulator is used for each column, so that the higher-level parallel operation is realized, and the operational efficiency of the global average potential is improved under the condition of reducing the resource power consumption to the maximum extent.
Regarding the apparatus in the foregoing embodiments, the specific manner in which each unit \ module executes operations has been described in detail in the embodiments of the related method, and is not described herein again. In the present application, the names of the above units/modules do not limit the units/modules themselves, and in practical implementations, the units/modules may be referred to by other names, so long as the functions of the units/modules are similar to those of the present application, and all of the units/modules belong to the scope of the claims and the equivalent technology of the present application.
In addition, the application also provides a neural network processing unit NPU, wherein the NPU comprises any one of the devices for reducing the occupation of computing resources in the NPU and the pulse computing array PE array module for convolution operation.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (13)

1. A method for reducing computational resource usage in an NPU, comprising:
the method is used for a neural network processing unit NPU, wherein the NPU comprises a pulse computation array PE array module used for convolution operation;
the method comprises the following steps:
obtaining an average pooled shared division coefficient kernel _ size, wherein the kernel _ size is a size of a pooling kernel used for the average pooling;
expanding the kernel _ size to a first convolution kernel, wherein the first convolution kernel is a convolution kernel for deep convolution calculations;
inputting the first convolution kernel to the PE array module to achieve the average pooling by the PE array module.
2. The method of claim 1, wherein expanding the kernel _ size into a first convolution kernel comprises:
acquiring the size of the pooling kernel and the reciprocal of the kernel _ size;
generating the first convolution kernel, wherein the first convolution kernel is the same size as the pooling kernel and each multiplication factor w within the first convolution kernel is the reciprocal of the kernel _ size.
3. The method of claim 2, wherein:
the PE array module consists of a plurality of data computing blocks tile, and each tile consists of a plurality of data computing slices slice; when the PE array module works, the input feature graph is divided into a plurality of copies with the same quantity as the tiles, and the tiles are respectively input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight, slice pulses in each tile share the same image, and the convolution result is output in a pulsating mode.
4. The method of claim 3, wherein:
each slice comprises an addition tree, an accumulator and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on the input divided two-dimensional images corresponding to different channels and the weights respectively, obtaining multi-channel sum through the addition tree, and sending the sum to the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.
5. The method of claim 3, wherein when the average pooling is a global average pooling, the method further comprises:
dividing the input feature map into a plurality of copies with the same number as tiles to correspond to each tile;
and dividing the input first convolution kernel into the same size as the input feature map, performing average pooling on the divided images by each tile, and summing the divided pooled results output by slices of each column of the PE array.
6. The method of claim 1, wherein prior to obtaining the average pooled shared division coefficient kernel _ size, the method further comprises:
when the running layer of the NPU is a convolution layer, directly transmitting the convolution kernel output by the weight cache to the PE array module to realize convolution operation;
when the running layer of the NPU is an average pooling layer, the step of obtaining the average pooled shared division coefficient kernel _ size and thereafter is performed.
7. An apparatus for reducing computational resource usage in an NPU, comprising:
the device is used for a neural network processing unit NPU, and the NPU comprises a pulse computation array PE array module for convolution operation;
the device comprises:
a convolution kernel obtaining unit, configured to obtain an average pooled shared division coefficient kernel _ size, where the kernel _ size is a size of a pooled kernel used for the average pooling, and expand the kernel _ size into a first convolution kernel, where the first convolution kernel is a convolution kernel used for deep convolution calculation;
a convolution kernel input unit, configured to input the first convolution kernel to the PE array module to implement the average pooling through the PE array module.
8. The apparatus according to claim 7, wherein the convolution kernel acquisition unit, when expanding the kernel _ size to the first convolution kernel, is specifically configured to:
acquiring the size of the pooling kernel and the reciprocal of the kernel _ size;
generating the first convolution kernel, wherein the first convolution kernel is the same size as the pooling kernel and each multiplication factor w within the first convolution kernel is the reciprocal of the kernel _ size.
9. The apparatus of claim 8, wherein:
the PE array module consists of a plurality of data computing blocks tile, and each tile consists of a plurality of data computing slices slice; when the PE array module works, the input feature graph is divided into a plurality of copies with the same quantity as the tiles, and the tiles are respectively input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight, slice pulses in each tile share the same image, and the convolution result is output in a pulsating mode.
10. The apparatus of claim 9, wherein:
each slice comprises an addition tree, an accumulator and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on the input divided two-dimensional images corresponding to different channels and the weights respectively, obtaining multi-channel sum through the addition tree, and sending the sum to the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.
11. The apparatus of claim 9, further comprising:
and a global average pooling acceleration unit, configured to, when the average pooling is global average pooling, divide the input feature map into a plurality of pieces having the same number as tiles to correspond to each tile, divide the input first convolution kernel into pieces having the same size as the input feature map, each tile performs average pooling on the divided images, and sums the divided pooled results output by slices of each column of the PE array.
12. The apparatus of claim 7, further comprising:
and the weight selection unit is used for directly transmitting the convolution kernel output by the weight cache to the PE array module to realize convolution operation when the operation layer of the NPU is a convolution layer before the convolution kernel acquisition unit is triggered, and triggering the convolution kernel acquisition unit and the convolution kernel input unit when the operation layer of the NPU is an average pooling layer.
13. A neural network processing unit NPU, comprising the apparatus for reducing the occupation of computing resources in an NPU according to any one of claims 7 to 12, and the systolic computing array PE array module for convolution operation.
CN202011114887.9A 2020-10-19 2020-10-19 Method and device for reducing occupation of computing resources in NPU Active CN111931927B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011114887.9A CN111931927B (en) 2020-10-19 2020-10-19 Method and device for reducing occupation of computing resources in NPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011114887.9A CN111931927B (en) 2020-10-19 2020-10-19 Method and device for reducing occupation of computing resources in NPU

Publications (2)

Publication Number Publication Date
CN111931927A true CN111931927A (en) 2020-11-13
CN111931927B CN111931927B (en) 2021-02-19

Family

ID=73333737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011114887.9A Active CN111931927B (en) 2020-10-19 2020-10-19 Method and device for reducing occupation of computing resources in NPU

Country Status (1)

Country Link
CN (1) CN111931927B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361699A (en) * 2021-07-16 2021-09-07 安谋科技(中国)有限公司 Multiplication circuit, system on chip and electronic device
CN116029332A (en) * 2023-02-22 2023-04-28 南京大学 On-chip fine tuning method and device based on LSTM network
WO2024119952A1 (en) * 2022-12-06 2024-06-13 北京地平线信息技术有限公司 Neural network model compiling method and apparatus, storage medium and electronic device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388917A (en) * 2018-02-26 2018-08-10 东北大学 A kind of hyperspectral image classification method based on improvement deep learning model
CN108537331A (en) * 2018-04-04 2018-09-14 清华大学 A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN109002885A (en) * 2018-07-24 2018-12-14 济南浪潮高新科技投资发展有限公司 A kind of convolutional neural networks pond unit and pond calculation method
CN109447237A (en) * 2018-09-05 2019-03-08 浙江长兴笛卡尔科技有限公司 Pond calculation method, electronic equipment, storage medium based on statistics exceptional value
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110738308A (en) * 2019-09-23 2020-01-31 陈小柏 neural network accelerators

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388917A (en) * 2018-02-26 2018-08-10 东北大学 A kind of hyperspectral image classification method based on improvement deep learning model
CN108537331A (en) * 2018-04-04 2018-09-14 清华大学 A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN109002885A (en) * 2018-07-24 2018-12-14 济南浪潮高新科技投资发展有限公司 A kind of convolutional neural networks pond unit and pond calculation method
CN109447237A (en) * 2018-09-05 2019-03-08 浙江长兴笛卡尔科技有限公司 Pond calculation method, electronic equipment, storage medium based on statistics exceptional value
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110738308A (en) * 2019-09-23 2020-01-31 陈小柏 neural network accelerators

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
仇越: "基于FPGA的卷积神经网络加速方法研究及实现", 《中国优秀硕士学位论文全文数据库》 *
刘万军: "不同池化模型的卷积神经网络学习性能研究", 《中国图象图形学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361699A (en) * 2021-07-16 2021-09-07 安谋科技(中国)有限公司 Multiplication circuit, system on chip and electronic device
WO2024119952A1 (en) * 2022-12-06 2024-06-13 北京地平线信息技术有限公司 Neural network model compiling method and apparatus, storage medium and electronic device
CN116029332A (en) * 2023-02-22 2023-04-28 南京大学 On-chip fine tuning method and device based on LSTM network
CN116029332B (en) * 2023-02-22 2023-08-22 南京大学 On-chip fine tuning method and device based on LSTM network

Also Published As

Publication number Publication date
CN111931927B (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN111931927B (en) Method and device for reducing occupation of computing resources in NPU
CN108765247B (en) Image processing method, device, storage medium and equipment
EP3746945B1 (en) Improving performance of neural network arrays
CN111144329B (en) Multi-label-based lightweight rapid crowd counting method
US10394929B2 (en) Adaptive execution engine for convolution computing systems
CN108376387B (en) Image deblurring method based on aggregation expansion convolution network
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN112163601B (en) Image classification method, system, computer device and storage medium
CN111915660B (en) Binocular disparity matching method and system based on shared features and attention up-sampling
CN112070664B (en) Image processing method and device
WO2023146523A1 (en) Event-based extraction of features in a convolutional spiking neural network
CN113033794B (en) Light weight neural network hardware accelerator based on deep separable convolution
CN113326930B (en) Data processing method, neural network training method, related device and equipment
CN113034391A (en) Multi-mode fusion underwater image enhancement method, system and application
CN115660955A (en) Super-resolution reconstruction model, method, equipment and storage medium for efficient multi-attention feature fusion
CN115272670A (en) SAR image ship instance segmentation method based on mask attention interaction
CN109740619B (en) Neural network terminal operation method and device for target recognition
CN113239949A (en) Data reconstruction method based on 1D packet convolutional neural network
CN114846382A (en) Microscope and method with convolutional neural network implementation
CN115861062A (en) Multi-scale learning wavelet attention mechanism network and image super-resolution reconstruction method
CN112529064B (en) Efficient real-time semantic segmentation method
CN113657587A (en) FPGA-based deformable convolution acceleration method and device
CN114154621A (en) Convolutional neural network image processing method and device based on FPGA
CN116888605A (en) Operation method, training method and device of neural network model
CN111914996A (en) Method for extracting data features and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant