CN109165734B - Matrix local response normalization vectorization implementation method - Google Patents

Matrix local response normalization vectorization implementation method Download PDF

Info

Publication number
CN109165734B
CN109165734B CN201810758431.2A CN201810758431A CN109165734B CN 109165734 B CN109165734 B CN 109165734B CN 201810758431 A CN201810758431 A CN 201810758431A CN 109165734 B CN109165734 B CN 109165734B
Authority
CN
China
Prior art keywords
calculation
matrix
processing unit
vector processing
normalization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810758431.2A
Other languages
Chinese (zh)
Other versions
CN109165734A (en
Inventor
陈书明
李斌
陈海燕
扈啸
杨超
张军阳
陈伟文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201810758431.2A priority Critical patent/CN109165734B/en
Publication of CN109165734A publication Critical patent/CN109165734A/en
Application granted granted Critical
Publication of CN109165734B publication Critical patent/CN109165734B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention discloses a vectorization realization method of matrix local response normalization, which comprises the steps of accessing an input characteristic diagram into a vector processor and carrying out block transposition to concentrate data points in each channel direction in a plurality of calculation results after convolution calculation in a row, carrying out local response normalization operation on the matrix after the transposition by using each vector processing unit VPE, simultaneously finishing the calculation of each output characteristic diagram by each vector processing unit VPE, and storing the calculation result of each vector processing unit VPE in a row. The invention can realize the vectorization of the local response normalization of the matrix and has the advantages of simple realization method, good parallelism, high processing efficiency and the like.

Description

Matrix local response normalization vectorization implementation method
Technical Field
The invention relates to the technical field of machine learning based on a convolutional neural network, in particular to a vectorization implementation method for matrix local response normalization.
Background
With the rise of deep learning technology, the target recognition technology based on the convolutional neural network is widely used in the fields of image recognition, voice recognition, natural language processing and the like. The convolutional neural network is a neural network model which is most applied in a current deep learning algorithm model and is also a model with the best recognition rate, and the convolutional neural network model generally comprises matrix convolution, an activation function, maximum value pooling or average value pooling, local linear normalization operation and the like.
Normalization is one of the commonly used modules in the convolutional neural network model, in which a Local Response Normalization (LRN) is a mechanism that performs "side suppression" by performing a Local Response Normalization operation on an input number, specifically, a mechanism that creates a competition for the activity of Local neurons, so that a value with a larger Response becomes relatively larger, and suppresses other neurons with smaller feedback, thereby enhancing the generalization capability of the model. Lateral inhibition in neurobiology refers to the inhibition of adjacent neurons by activated neurons, and the principle of normalization of local responses is to mimic the phenomenon of inhibition of adjacent neurons by biologically active neurons (lateral inhibition), i.e., to achieve local inhibition by taking into account the concept of lateral inhibition, which is intended to be "inhibitory", and is useful especially when the ReLU nonlinear activation function is used. The LRN layer creates a competition mechanism for the activity of local neurons by simulating a lateral inhibition mechanism of a biological nervous system, so that a larger response value is correspondingly larger, the generalization capability of the model is improved, the normalization of local response is equivalent to smoothing, and the recognition rate can be improved by 1-2%.
The local response normalization layer needs parameters mainly include: norm _ region: selecting normalization between adjacent channels or normalization of space regions in the channels, and defaulting to normalization between the channels; local _ size: there are two representations, (1) inter-channel normalization represents the number of channels summed; (2) the side length of a summation interval is represented during channel normalization, and the default value is 5; alpha: a scaling factor, with a default value of 1; beta: index entry, default 5.
Local response normalization in an inter-channel normalization mode, a local area range is between adjacent channels, but no spatial expansion exists; in the intra-channel normalization mode, the local region is spatially extended, but only for the individual channels. The local response normalization can effectively prevent overfitting, and because the data among channels are not independent, the generalization capability of the model can be improved, and overfitting layers such as dropout are reduced.
Along with the continuous emergence of high-density and real-time operation applications such as solution of a high-density large linear equation set, high-definition video coding and decoding, 4G communication, digital image processing and the like, the system structure of a computer is obviously changed, and novel system structures such as a many-core system structure, a heterogeneous multi-core system structure, a vector processor system structure and the like of a GPU are continuously emerged, a plurality of processor cores are integrated on a single chip by the novel system structures, each core comprises abundant processing parts, and the computing performance of the chip is greatly improved. A vector processor is a new architecture, and as shown in fig. 1, it generally includes a Vector Processor Unit (VPU) and a Scalar Processing Unit (SPU), and the vector processing unit usually includes a plurality of parallel vector processing units (VPEs), and the VPEs can perform data interaction through specification and shuffle, and all VPEs perform the same operation based on SIMD.
However, the local response normalization operation is not only computationally intensive but also memory intensive, and if a reasonable computing method cannot be adopted, the computing advantage is difficult to be exerted even if high-performance computing equipment is used. However, in the prior art, during the local response normalization operation, a non-vectorization method is usually adopted, that is, only one pixel point in one graph can be operated each time, the calculation efficiency is not high, and the method cannot be applied to the vector processor structure to always realize multi-core parallel operation.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a vectorization implementation method of the matrix local response normalization, which has the advantages of simple implementation method, good parallelism and high processing efficiency, can implement the vectorization of the matrix local response normalization with high efficiency of the multi-input characteristic diagram, and can improve the parallelism of a multi-core vector processor and the operation efficiency of the processor.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
a vectorization implementation method for matrix local response normalization comprises the steps of accessing an input characteristic diagram into a vector processor and conducting block transposition, enabling data points corresponding to the same channel direction in a plurality of calculation results after convolution calculation to be stored in a row, conducting local response normalization operation on a reversed matrix by using each vector processing unit VPE, enabling each vector processing unit VPE to complete calculation of each output characteristic diagram at the same time, and enabling calculation results of each vector processing unit VPE to be stored in a row.
As a further development of the invention, the method steps comprise:
s1, determining the size M multiplied by N of an output characteristic diagram which can be calculated by a vector processing unit VPE according to the number p of the vector processing unit VPE in a vector processor, the number of convolution kernels, the size of an input characteristic diagram and the weight group number N of convolution which is performed on input data obtained by calculating results after convolution;
s2, inputting an input characteristic diagram into a vector processor, and respectively distributing the input characteristic diagram to each vector processing unit VPE;
s3, each vector processing unit VPE carries out parallel processing, and each vector processing unit VPE carries out block transposition on a plurality of blocks of calculation results after convolution calculation to obtain an N multiplied by M transposed matrix;
s4, performing local response normalization calculation on the transposed matrix according to columns, storing the normalized result according to the rows, and outputting the normalized result with the size of M multiplied by N;
s5, repeatedly executing the steps S3-S4 until the calculation of the p output feature maps is completed in parallel.
As a further improvement of the present invention, the specific steps of step S1 are: when the input feature map size is W x H x N, the calculation amount of the single-core vector processing unit VPE is M x N x p once, N is an integral multiple of p, the configuration is such that the output feature map which can be calculated by each vector processing unit VPE is M x N, M x N p does not exceed the storage capacity in the core, and M and N can be divided by W and H in integer respectively to be capable of performing an integral number of operations.
As a further improvement of the present invention, in step S2, the input feature map is specifically input to the out-of-core DDR of the vector processor, the input feature map data is uniformly distributed to each vector processing unit VPE, and if there is excess data that cannot be uniformly distributed, the excess data is processed again by using a plurality of vector processing units VPE.
As a further improvement of the invention: in step S3, the N/p block calculation result after the convolution calculation is specifically block-wise transposed according to the storage position of the input feature map, so that the data points in the last N channel directions are stored in one column.
As a further improvement of the invention: in step S4, a column is taken for each time of the transposed matrix, and normalization calculation is performed sequentially according to the sequence from top to bottom.
As a further improvement of the invention: in step S4, normalization operation is specifically performed by using the following formula;
Figure BDA0001727299930000031
wherein k, N, α, and β are hyper-parameters, i represents the output of the ith kernel after applying the activation function ReLU at the position (x, y), N is the number of channels, a is the output result of the convolution layer and is a four-dimensional [ batch, height, width, channel ], batch is the number of batches, height is the picture height, width is the picture width, and channel is the processed picture depth.
As a further improvement of the invention: in the step S4, in the local response normalization calculation, when the squares of several adjacent elements are calculated, the specific current calculation result is obtained by subtracting the square of the first element in the previous calculation from the previous calculation result and adding the square of the last element.
As a further improvement of the invention: the input feature map is specifically a result of pooling or convolution calculation.
Compared with the prior art, the invention has the advantages that:
1) the invention takes the data after the convolution pooling operation as the input characteristic diagram data, carries out transposition operation to centralize the results of all channels, and stores the results of simultaneous calculation of each PE in a row, thereby being capable of operating each value in a long vector and realizing the vectorization of local response normalization.
2) In the vectorization implementation method, an optimal implementation mode is determined based on the architectural features of the vector processor and the number and scale of the input feature maps, so that the parallelization operation of the vector processor is effectively improved, different input feature maps are handed to different PEs for processing, no correlation operation exists between the PEs, and equivalently, how many input feature maps can be simultaneously calculated by how many PEs, the implementation of local normalization calculation is simple, the operation is convenient, the parallelism of each level of instructions, data, tasks and the like of the vector processor can be fully mined, and the advantage of high-performance calculation capability of the vector processor with multiple PE operation parts is fully exerted.
Drawings
FIG. 1 is a schematic diagram of the architecture of a vector processor.
Fig. 2 is a schematic diagram of an implementation flow of the matrix local response normalization vectorization implementation method in this embodiment.
Fig. 3 is a schematic diagram of an implementation principle of transposing partial response normalized input data in this embodiment.
Fig. 4 is a schematic diagram of an implementation principle of normalization calculation of the post-rotation matrix in this embodiment.
Fig. 5 is a schematic diagram of an implementation principle of the local response normalization calculation according to the present embodiment.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.
The invention discloses a vectorization implementation method of matrix local response normalization, which comprises the steps of respectively distributing input characteristic graphs to vector processing units VPE in a vector processor, parallelly calculating output characteristic graphs by each vector processing unit VPE, carrying out block transposition on input data during calculation by each vector processing unit VPE, storing data points corresponding to each channel direction in a plurality of calculation results after convolution calculation in a row in a centralized manner, and carrying out local response normalization operation on the transposed matrix according to the row after the transposed matrix is obtained.
The method for realizing the vectorization of the matrix local response normalization is suitable for the local response normalization processing after the convolution pooling operation is executed by using a vector processor. The invention is based on the convolution operation of a vector processor, because a plurality of cores are processed simultaneously, and the positions of a plurality of channel processing results of one element in a DDR are stored in blocks.
As shown in fig. 2, the steps of the matrix local response normalization vectorization implementation method in this embodiment include:
s1, determining the size M multiplied by N of an output characteristic diagram which can be calculated by a vector processing unit VPE according to the number p of the vector processing unit VPE in a vector processor, the number of convolution kernels, the size of an input characteristic diagram and the weight group number N of convolution which is performed on input data obtained by calculating results after convolution;
s2, inputting the input characteristic graph into a vector processor, and respectively distributing the input characteristic graph to each vector processing unit VPE;
s3, each vector processing unit VPE performs parallel processing, and each vector processing unit VPE performs block transposition on a plurality of blocks of calculation results after convolution calculation to obtain an NxM transposed matrix;
s4, carrying out local response normalization calculation on the post-conversion matrix according to columns, storing the normalized result according to the rows, and outputting the normalized result with the size of M multiplied by N;
s5, repeatedly executing the steps S3-S4 until the calculation of the p output feature maps is completed in parallel.
The specific steps of step S1 in this embodiment are: when the input feature map size is W x H x N, the calculation amount of the single-core vector processing unit VPE is M x N x p once, N is an integral multiple of p, the configuration is such that the output feature map which can be calculated by each vector processing unit VPE is M x N, M x N p does not exceed the storage capacity in the core, and M and N can be divided by W and H in integer respectively to be capable of performing an integral number of operations. If there may be several pictures in the data to be processed, several graphs need to be distributed to each kernel, and for a specific problem, the number of kernels determines the size of the input feature graph, and the output feature graph that can be calculated by the vector processing unit VPE can be determined according to the actual requirements by the number p of PEs, the number of convolution kernels, the size of the input feature graph, and the number n of weight groups.
For a vector processor, the number p of VPEs of a single core is determined, a determined neural network is calculated on the determined VPEs, the network scale of each layer after the neural network is determined is also determined, the local response normalization operation is the operation after convolution layer convolution calculation, and the weight group number n is the number of channels. When the input feature map, i.e., the local response normalized input data size is determined (W × H × N), the calculated amount (M × N × p) of a single core can be determined from the storage capacity in the core, and the output feature map M × N that each PE can calculate is determined such that M × N × p cannot exceed the storage capacity in the core, and M and N can be divided by W and H, respectively, to enable an integer number of operations.
In this embodiment, in step S2, the input feature map is specifically input to the out-of-core DDR of the vector processor, the input feature map data is uniformly distributed to each vector processing unit VPE, so as to ensure that the cores are completely parallel, and if there is excess data that cannot be uniformly distributed, the excess data is processed by using the vector processing units VPE again.
In this embodiment, in step S3, the N/p block calculation result after the convolution calculation is specifically subjected to block transposition according to the storage position of the input feature map, so that the data points in the last N channel directions are stored in one column. The vector processor is characterized in that p PEs operate simultaneously, the storage and the value taking are carried out simultaneously by p data, and the p data are stored according to rows to form a long vector. The normalization operation is connected after convolution, and the results calculated by p PEs after convolution are stored together in sequence, namely stored in rows. Since the normalization calculation needs to be performed on all the p values calculated simultaneously, and the storage by rows cannot perform the operation on each value in the long vector, the embodiment performs block transposition on n/p block data, so that the results of the simultaneous calculation of p PEs are stored by columns, and the vectorization processing of the normalization operation can be realized based on the matrix after the transposition. In the embodiment shown in fig. 3, the input feature map is transposed into N × M data blocks, and the results of simultaneous computation of p PEs are stored in a column, for example, the results of simultaneous computation of PE are stored in a column
Figure BDA0001727299930000051
The first column. After convolution, the size of the feature map is M × N, and the transpose operation can be completed using a specific instruction.
In this embodiment, when the local response normalization calculation is performed on the post-rotation matrix in columns in step S4, specifically, one column is taken for each time of the post-rotation matrix, and the normalization calculation is performed sequentially from top to bottom. As shown in FIG. 4, in the specific embodiment, the first column [ a ] is taken for the left transposed matrix11a12...a1n]And sequentially taking all elements from top to bottom for normalization calculation to obtain the first column of the normalized matrix on the right side, wherein the normalization calculation principles of the other columns are the same.
As shown in fig. 5, in step S4, the normalization operation is performed specifically according to the following formula;
Figure BDA0001727299930000061
wherein k, N, α, and β are hyper parameters, and generally, k is 2, N is 5, α is 1 × e-4, β is 0.75, i represents an output of the i-th kernel after applying an activation function ReLU at a position (x, y), N is a channel number, a is an output result of a convolution layer (including convolution operation and pooling operation) and is a four-dimensional [ batch, height, width, channel ], batch is a batch number, where each batch is a picture, height is a picture height, width is a picture width, and channel is a channel number, that is, a processed picture depth; ai (x, y) represents a position [ a, b, c, d ] in the output structure, which can be understood as a point at a certain height and a certain width position under a certain channel in a certain figure, i.e. a point at which the height of the d channel in the a-th figure is b and the width is c; the direction of Σ superposition is along the channel direction, i.e. the sum of squares of each point value is along the 3 rd dimension channel direction in a, i.e. the sum of squares of the points of the front n/2 channels (minimum 0 th channel) and the back n/2 channels (maximum d-1 th channel) of a point in the same direction (n +1 points in total), the number of channels of input is taken as the number of 3-dimension matrices, and the direction of superposition is also in the channel direction.
In the above-described local response normalization calculation, in the inter-channel normalization mode, the local region range is between adjacent channels but there is no spatial expansion, and in the intra-channel normalization mode, the local region is spatially expanded but only for an independent channel.
The matrix local response normalization vectorization is realized by processing p pixel points simultaneously by using the formula (1), performing local response normalization operation on the N × M after each time of rotation by using the formula (1) according to columns, storing the N × M after normalization by rows, repeatedly executing the M times, and outputting a normalization result with the size of M × N.
In the present embodiment, in the step S4, in the local response normalization calculation, when the squares of several adjacent elements are calculated, the specific current calculation result is obtained by subtracting the square of the first element in the previous calculation from the previous calculation result and adding the square of the last element to the previous calculation result. In the normalization operation, the square sum of several adjacent elements needs to be calculated, in order to avoid redundant operations, the square sum of all the elements does not need to be calculated each time in the calculation of the present embodiment, and the square sum of the elements need to be obtained by subtracting the square of the frontmost element from the square of the frontmost element and then adding the square of the backmost element, for example, if there is a segment in the input feature map [ a1, a1, a1, a1, a1, a1, a1 ], and a1 ] to perform the normalization calculation, b1 ═ a1+ a1 a, and a1 a +1 a, and a1 b only need to calculate the square of the first element a1 a.
In the vectorization implementation method, an optimal implementation mode is determined based on the architectural features of the vector processor and the number and scale of the input feature maps, so that the parallelization operation of the vector processor is effectively improved, different input feature maps are handed to different PEs for processing, no correlation operation exists between the PEs, and equivalently, how many input feature maps can be simultaneously calculated by how many PEs, the implementation of local normalization calculation is simple, the operation is convenient, the parallelism of each level of instructions, data, tasks and the like of the vector processor can be fully mined, and the advantage of high-performance calculation capability of the vector processor with multiple PE operation parts is fully exerted.
In a specific application embodiment, the matrix local response normalization-based vectorization implementation method of the invention has the following specific procedures:
(1) according to the number p of vector processing units VPE in the vector processor, the size of the input feature map, and the number n of weight groups of convolution performed before normalization, an output feature map that can be calculated by each PE is determined, where p is 16, n is 64, and the input feature map is 1152 × 16, and 16 maps can be calculated at the same time, and then the size of the output feature map that can be calculated by each PE is determined to be 18 × 64.
(2) The input feature map is placed into the DDR of the vector processor.
(3) And (4) carrying out block transposition according to the storage position of the input feature map, wherein the size of the transposed block is 64 multiplied by 18, and the transposed block is placed in the AM.
(4) The post-rotation 64 × 18 is subjected to local response normalization operation by column.
(5) And (5) repeating the step (4)18 times, and outputting a calculation result with the size of 18 multiplied by 64 after the operation is finished.
(6) And (5) repeating the steps (3) to (5) until the calculation of the whole 16 output feature maps is completed.
The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims (8)

1. A matrix local response normalization vectorization implementation method is characterized by comprising the steps of accessing an input feature map into a vector processor, carrying out block transposition, concentrating data points in each channel direction in a plurality of calculation results after convolution calculation in a row, carrying out local response normalization operation on the matrix after the transposition by using each vector processing unit VPE, completing calculation of each output feature map by each vector processing unit VPE at the same time, and storing the calculation results of each vector processing unit VPE in a row; the method comprises the following steps:
s1, determining the size M multiplied by N of an output characteristic diagram which can be calculated by a vector processing unit VPE according to the number p of the vector processing unit VPE in a vector processor, the number of convolution kernels, the size of an input characteristic diagram and the weight group number N of convolution of input data obtained by calculating a result after the convolution;
s2, inputting an input characteristic diagram into a vector processor, and respectively distributing the input characteristic diagram to each vector processing unit VPE;
s3, each vector processing unit VPE carries out parallel processing, and each vector processing unit VPE carries out block transposition on a plurality of blocks of calculation results after convolution calculation to obtain an N multiplied by M transposed matrix;
s4, performing local response normalization calculation on the transposed matrix according to columns, storing the normalized result according to the rows, and outputting the normalized result with the size of M multiplied by N;
s5, repeatedly executing the steps S3-S4 until the calculation of the p output feature maps is completed in parallel.
2. The matrix partial response normalization vectorization implementation method according to claim 1, wherein the step S1 specifically comprises the steps of: when the input feature map size is W x H x N, the calculation amount of the single-core vector processing unit VPE is M x N x p once, N is an integral multiple of p, the configuration is such that the output feature map which can be calculated by each vector processing unit VPE is M x N, M x N p does not exceed the storage capacity in the core, and M and N can be divided by W and H in integer respectively to be capable of performing an integral number of operations.
3. The matrix partial response normalized vectorization implementation method of claim 1, wherein: specifically, in step S2, the input feature map is input to the out-of-core DDR of the vector processor, the input feature map data is uniformly distributed to the vector processing units VPE, and if there is excess data that cannot be uniformly distributed, the excess data is processed by the vector processing units VPE.
4. The matrix partial response normalized vectorization implementation method of claim 1, wherein: in step S3, the N/p block calculation result after the convolution calculation is specifically block-wise transposed according to the storage position of the input feature map, so that the data points in the last N channel directions are stored in one column.
5. The matrix local response normalization vectorization implementation method according to any one of claims 1 to 4, characterized in that: in step S4, a column is taken for each time of the transposed matrix, and normalization calculation is performed sequentially according to the sequence from top to bottom.
6. The method for realizing the vectorization of the matrix partial response normalization according to any one of claims 1 to 4, wherein the normalization operation is performed in step S4 by using the following formula;
Figure FDA0002685638680000021
wherein k, N, α, and β are hyper-parameters, i represents the output of the ith kernel after applying the activation function ReLU at the position (x, y), N is the number of channels, a is the output result of the convolution layer and is a four-dimensional [ batch, height, width, channel ], batch is the number of batches, height is the picture height, width is the picture width, and channel is the processed picture depth.
7. The matrix partial response normalization vectorization implementation method according to any one of claims 1 to 4, wherein in the step S4, when calculating the squares of several adjacent elements in the partial response normalization calculation, the specific current calculation result is obtained by subtracting the square of the first element in the previous calculation from the last calculation result and adding the square of the last element to the last calculation result.
8. The matrix partial response normalization vectorization implementation method according to any one of claims 1 to 4, wherein the input feature map is a result of pooling or convolution calculation.
CN201810758431.2A 2018-07-11 2018-07-11 Matrix local response normalization vectorization implementation method Active CN109165734B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810758431.2A CN109165734B (en) 2018-07-11 2018-07-11 Matrix local response normalization vectorization implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810758431.2A CN109165734B (en) 2018-07-11 2018-07-11 Matrix local response normalization vectorization implementation method

Publications (2)

Publication Number Publication Date
CN109165734A CN109165734A (en) 2019-01-08
CN109165734B true CN109165734B (en) 2021-04-02

Family

ID=64897593

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810758431.2A Active CN109165734B (en) 2018-07-11 2018-07-11 Matrix local response normalization vectorization implementation method

Country Status (1)

Country Link
CN (1) CN109165734B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110580450A (en) * 2019-08-12 2019-12-17 西安理工大学 traffic sign identification method based on convolutional neural network
CN110766157B (en) * 2019-10-21 2022-03-18 中国人民解放军国防科技大学 Multi-sample neural network forward propagation vectorization implementation method
CN110796236B (en) * 2019-10-21 2022-06-17 中国人民解放军国防科技大学 Vectorization implementation method for pooling of multi-sample multi-channel convolutional neural network
CN113222136A (en) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 Convolution operation method and chip

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411773A (en) * 2011-07-28 2012-04-11 中国人民解放军国防科学技术大学 Vector-processor-oriented mean-residual normalized product correlation vectoring method
CN107680092A (en) * 2017-10-12 2018-02-09 中科视拓(北京)科技有限公司 A kind of detection of container lock and method for early warning based on deep learning
CN108205703A (en) * 2017-12-29 2018-06-26 中国人民解放军国防科技大学 Multi-input multi-output matrix average value pooling vectorization implementation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100073383A1 (en) * 2008-09-25 2010-03-25 Sergey Sidorov Cloth simulation pipeline

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411773A (en) * 2011-07-28 2012-04-11 中国人民解放军国防科学技术大学 Vector-processor-oriented mean-residual normalized product correlation vectoring method
CN107680092A (en) * 2017-10-12 2018-02-09 中科视拓(北京)科技有限公司 A kind of detection of container lock and method for early warning based on deep learning
CN108205703A (en) * 2017-12-29 2018-06-26 中国人民解放军国防科技大学 Multi-input multi-output matrix average value pooling vectorization implementation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向GPDSP科学计算的高性能DMA传输方式的设计与实现;王占立;《中国优秀硕士学位论文全文数据库信息科技辑》;20170315(第03期);I137-120 *

Also Published As

Publication number Publication date
CN109165734A (en) 2019-01-08

Similar Documents

Publication Publication Date Title
CN109165734B (en) Matrix local response normalization vectorization implementation method
US11720800B2 (en) Efficient data layouts for convolutional neural networks
CN109543830B (en) Splitting accumulator for convolutional neural network accelerator
CN108205702B (en) Parallel processing method for multi-input multi-output matrix convolution
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN116113941A (en) Neural network accelerator, acceleration method and device
CN110796236B (en) Vectorization implementation method for pooling of multi-sample multi-channel convolutional neural network
US20210150363A1 (en) Apparatus and method for multi-phase pruning for neural network with multi-sparsity levels
CN111738276A (en) Image processing method, device and equipment based on multi-core convolutional neural network
CN113010213A (en) Simplified instruction set storage and calculation integrated neural network coprocessor based on resistance change memristor
KR20220071723A (en) Method and apparatus for performing deep learning operations
JP7122041B2 (en) Joint Sparsity Method Based on Mixed Granularity Used in Neural Networks
CN113516580B (en) Method and device for improving neural network image processing efficiency and NPU
KR20230104235A (en) Method and system for convolution with workload-balanced activation sparsity
CN113888390A (en) Feature map processing method and device, electronic equipment and computer readable medium
CN110765413A (en) Matrix summation structure and neural network computing platform
KR102548283B1 (en) Convolutional neural network computing device
US20240028869A1 (en) Reconfigurable processing elements for artificial intelligence accelerators and methods for operating the same
Fazlali et al. GPU-based Parallel Technique for Solving the N-Similarity Problem in Textual Data Mining
DE202023104860U1 (en) Apparatus for matrix calculation using data conversion in a computing accelerator
Kondo et al. Accelerating Convolution Neural Networks in Embedded Intel GPUs
DE202023106035U1 (en) Device for compressing weight blocks in neural networks in a computing accelerator
US20200104669A1 (en) Methods and Apparatus for Constructing Digital Circuits for Performing Matrix Operations
KR20240025827A (en) In memory computing(imc) processor and operating method of imc processor
York _. Connection Machine Implementation of the Boundary Contour System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant