WO2020107616A1 - 一种并行计算方法及装置 - Google Patents

一种并行计算方法及装置 Download PDF

Info

Publication number
WO2020107616A1
WO2020107616A1 PCT/CN2018/124831 CN2018124831W WO2020107616A1 WO 2020107616 A1 WO2020107616 A1 WO 2020107616A1 CN 2018124831 W CN2018124831 W CN 2018124831W WO 2020107616 A1 WO2020107616 A1 WO 2020107616A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
input
matrices
layer
output
Prior art date
Application number
PCT/CN2018/124831
Other languages
English (en)
French (fr)
Inventor
苏岚
顾鹏
Original Assignee
深圳云天励飞技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术有限公司 filed Critical 深圳云天励飞技术有限公司
Publication of WO2020107616A1 publication Critical patent/WO2020107616A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • This application relates to the field of computers, and in particular to a parallel computing method and device.
  • Neural networks are widely used in fields such as pattern recognition, image processing, function approximation, and optimization calculations. Because of their high operation rate, they are receiving more and more attention from academia and industry.
  • the vector processor is a specially designed processor for highly pipelined operations, which can efficiently operate the entire vector matrix line by line. Therefore, the deep learning tasks in current neural networks mostly use vector processors (such as GPU, vector DSP) , CPU with SIMD extended instruction set, etc.) processing operations.
  • the present application provides a parallel calculation method and device, which can enable the vector processor to improve calculation performance when processing small-sized data.
  • the present application provides a parallel computing method, which includes the following steps:
  • the N input matrices of the target layer of the convolutional neural network are horizontally spliced to obtain a first spliced input matrix, wherein the target layer includes a convolutional layer and a pooling layer, and the input matrix includes a calculation identifier.
  • the calculation flag includes a convolution calculation flag and a pooling calculation flag, the convolution calculation flag includes a convolution kernel parameter, the pooling calculation flag includes a pooling window parameter, and the value of N is determined by the bit width of the vector processor ;
  • the vector processor performs calculation processing on the first stitching input matrix in the horizontal direction according to the calculation identifier to obtain a stitching output matrix, where the stitching output matrix includes N output matrices;
  • the N output matrices are selected from the spliced output matrix, and the N output matrices are used as the N input matrices of the next layer of the convolutional neural network.
  • the horizontally splicing the N input matrices of the target layer of the convolutional neural network to obtain the first splicing input matrix includes:
  • the N input matrices are horizontally end-to-end spliced to obtain a first spliced input matrix
  • the input matrix is a non-filled convolutional layer matrix and the sliding step of the convolution kernel is greater than 1, fill each row end of the N input matrices with k zeros to obtain N filled matrices.
  • the first filling input matrix is obtained by performing horizontal head-to-tail stitching on the N filling matrices, where k is the result of the matrix width of the input matrix modulo the sliding step of the convolution kernel;
  • the input matrix is a filled convolution layer matrix and the convolution kernel sliding step is 1, after filling each row end of the N input matrices with k zeros, N filled matrices are obtained, and the N The first and the last splicing input matrices are obtained by performing a head-to-tail stitching of a filling matrix, where k is the result of rounding down half the width of the convolution kernel;
  • the input matrix is a filled convolutional layer matrix and the convolution kernel sliding step size is greater than 1, fill each row end of the N input matrices with k zeros to obtain N filled matrices.
  • the first splicing input matrix is obtained by performing a head-to-tail splicing of the filling matrix in the horizontal direction, where k is the result of taking the modulus of the sliding width of the convolution kernel by the matrix width of the input matrix.
  • the horizontally splicing N different input matrices of the target layer of the convolutional neural network to obtain the first splicing input matrix includes:
  • N padding matrices are obtained after filling each row end of the N input matrixes with k zeros, and the N Filling the matrix to perform first-to-end stitching in the horizontal direction to obtain a first stitched input matrix, where k is the result of modulo the sliding width of the convolution kernel by the matrix width of the input matrix;
  • the N input matrices are horizontally end-to-end spliced to obtain a first spliced input matrix.
  • using the N output matrices as the N input matrices of the next layer of the convolutional neural network includes:
  • the N output matrices are used as the N input matrices of the next layer of the convolutional neural network.
  • using the N output matrices as the N input matrices of the next layer of the convolutional neural network further includes:
  • the N second splicing matrices are used as the N input matrices of the next layer of the convolutional neural network.
  • the present application provides a parallel computing device, the device includes a splicing unit, a computing unit, and an output unit, wherein,
  • the stitching unit is used for stitching horizontally the N input matrices of the target layer of the convolutional neural network to obtain a first stitched input matrix, wherein the target layer includes a convolutional layer and a pooling layer, and the input matrix Contains calculation flags, the calculation flags include convolution calculation flags and pooling calculation flags, the convolution calculation flags include convolution kernel parameters, the pooling calculation flags include pooling window parameters, and the value of N is processed by vectors The bit width of the device is determined;
  • the calculation unit is configured to use a vector processor to perform a horizontal calculation process on the first splicing input matrix according to the calculation identifier to obtain a splicing output matrix, where the splicing output matrix includes N output matrices;
  • the output unit is configured to filter out the N output matrices from the spliced output matrix, and use the N output matrices as N input matrices of the next layer of the convolutional neural network.
  • the splicing unit is specifically used to convert all the input matrices when the input matrix is a non-filled convolutional layer matrix and the convolution kernel sliding step is 1. Describing the N input matrices by performing horizontal end-to-end stitching to obtain a first stitched input matrix;
  • the splicing unit is specifically used to obtain N after filling the end of each row of the N input matrices with k zeros when the input matrix is a non-filled convolutional layer matrix and the sliding step of the convolution kernel is greater than 1.
  • a filling matrix, the first filling input matrix is obtained by performing horizontal end-to-end stitching on the N filling matrices, where k is the result of the matrix width of the input matrix modulo the sliding step of the convolution kernel;
  • the splicing unit is specifically used to obtain N pieces after filling the end of each row of the N input matrices with k zeros when the input matrix is a filled convolutional layer matrix and the convolution kernel sliding step is 1.
  • a filling matrix, the first filling input matrix is obtained by performing horizontal head-to-tail stitching of the N stuffing matrices, where k is the result of rounding down half the width of the convolution kernel;
  • the splicing unit is specifically used to obtain N pieces after filling the end of each row of the N input matrices with k zeros when the input matrix is a filled convolutional layer matrix and the sliding step of the convolution kernel is greater than 1.
  • a filling matrix, the first filling input matrix is obtained by performing horizontal end-to-end stitching on the N filling matrices, where k is the result of modulo the sliding width of the convolution kernel by the matrix width of the input matrix.
  • the splicing unit is specifically configured to, when the input matrix is a non-filling pooling layer matrix and the pooling window width is s,
  • the N input matrices are filled with k zeros at the end of each row to obtain N filled matrices, and the N filled matrices are horizontally first-to-tail spliced to obtain a first spliced input matrix, where k is the matrix of the input matrix.
  • the splicing unit is specifically configured to, when the input matrix is a filled pooling layer matrix and the pooling window width is s, perform horizontal head-to-tail stitching of the N input matrices to obtain a first spliced input matrix.
  • the output unit is specifically configured to use the N output matrices as N input matrices of the next layer of the convolutional neural network when the matrix width of the output matrix is greater than half the width of the input matrix.
  • the output unit is specifically configured to join two rows of the output matrix end to end and form a row when the matrix width of the output matrix is less than or equal to half the width of the input matrix to obtain N Second splicing matrix;
  • the output unit is specifically configured to use the N second splicing matrices as N input matrices of the next layer of the convolutional neural network.
  • the N input matrices of the target layer of the convolutional neural network are horizontally spliced to obtain a first spliced input matrix, and then the first spliced input matrix is horizontally processed according to the calculation identifier using a vector processor
  • the direction calculation process obtains a spliced output matrix, so that the N output matrices are selected from the spliced output matrix, and the N output matrices are used as the N input matrices of the next layer of the convolutional neural network.
  • a plurality of small-sized input matrices are spliced in the horizontal direction to obtain a splicing matrix, thereby greatly extending the length of the vector processor's pipeline, and reducing the loading and storage required by the processor when wrapping the pipeline.
  • the extra overhead further improves the computing performance of the vector processor when processing small-sized data.
  • FIG. 1 is a schematic flowchart of a parallel computing method provided by this application
  • FIG. 2 is a schematic structural diagram of a convolutional neural network provided by this application.
  • FIG. 3a is a schematic diagram of a 6 ⁇ 6 convolutional layer input matrix including a convolution calculation identifier provided by this application;
  • FIG. 3b is a schematic diagram of a pooling layer input matrix with a size of 4 ⁇ 4 including a pooling calculation identifier provided by this application;
  • 4a is a schematic diagram of a splicing matrix of non-filled convolutional layer data provided by the present application
  • FIG. 4b is a schematic diagram of a splicing matrix filled with convolutional layer data provided by this application;
  • FIG. 5a is a schematic diagram of a splicing matrix of non-filled pooling layer data provided by this application.
  • FIG. 5b is a schematic diagram of a splicing matrix filled with pooling layer data provided by the present application.
  • FIG. 6 is a schematic diagram of a calculation process of a convolution kernel for sliding convolution on a mosaic boundary of a mosaic matrix provided by this application;
  • FIG. 7 is a schematic diagram of a second splicing matrix provided by this application.
  • FIG. 8 is a schematic structural diagram of a parallel computing device provided by this application.
  • FIG. 9 is a schematic block diagram of a structure of an electronic device provided by this application.
  • Vector processor system (Vector Processor System, VPS) is a parallel processing computer system oriented to vector-type parallel computing, mainly with pipeline structure.
  • the use of parallel processing structures such as advance control and overlapping operation techniques, operation pipelines, and cross-access parallel memory have an important role in increasing the operation speed. But in actual operation, the potential of parallel processing cannot be fully realized.
  • Vector operations are well suited to the structural characteristics of pipeline computers.
  • the combination of vector-type parallel computing and pipeline structure can largely overcome the shortcomings of conventional pipeline computers, such as too large instruction processing volume, uneven storage access, serious related waiting, and poor pipeline, and can fully utilize the parallel processing structure.
  • the vector processor calculates each component of the vector, there is a correlation between reading and writing data, which results in low pipeline efficiency. If a multi-functional pipeline is used, pipeline switching must be performed frequently.
  • this application proposes a parallel computing method and device, wherein the specific steps of the method refer to FIG. 1.
  • FIG. 1 is a schematic flowchart of a parallel computing method provided by this application. As shown in FIG. 1, the parallel computing method provided by this application includes the following steps:
  • S101 Perform horizontal stitching on the N input matrices of the target layer of the convolutional neural network to obtain a first stitched input matrix.
  • the target layer includes a convolutional layer and a pooling layer.
  • FIG. 2 is a schematic structural diagram of a convolutional neural network provided by the present application.
  • the convolutional neural network The CNN includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer.
  • the CNN structure shown in FIG. 2 has two convolutional layers and two pooling layers, and FIG. 2 is only used as an example. It shows that convolutional neural networks can have more convolutional layers and pooling layers. However, the number of convolutional layers is generally the same as the number of pooling layers.
  • the output matrix of the convolutional layer serves as the input matrix of the pooling layer
  • the output matrix of the pooling layer will serve as the input matrix of the next convolutional layer.
  • the parallel computing method provided by the present application is directed to the processing of the convolutional layer and the pooling layer.
  • the processing methods of the input layer, the output layer, and the fully connected layer may be processed according to the prior art, so this application is no longer Repeat.
  • the calculation of the convolutional layer and pooling layer of the convolutional neural network model accounts for more than 85% of the calculation of the entire model. Therefore, although this application only proposes a parallel calculation method for the convolutional layer and the pooling layer, But it can greatly improve the computing performance of the entire convolutional neural network model.
  • the input matrix includes a calculation identifier, the calculation identifier includes a convolution calculation identifier and a pooling calculation identifier, the convolution calculation identifier includes a convolution kernel parameter, and the pooling calculation identifier Includes pooling window parameters.
  • the input matrix may be a convolutional layer input matrix and a pooling layer input matrix.
  • the input matrix may be a pixel matrix obtained after the input picture passes through the convolutional neural network input layer.
  • FIG. 3a is provided by this application. Is a 6 ⁇ 6 convolutional layer input matrix containing a convolution calculation identifier, the convolution kernel size of the input matrix is 3 ⁇ 3, and the sliding step size is 1.
  • FIG. 3a is provided by this application. Is a 6 ⁇ 6 convolutional layer input matrix containing a convolution calculation identifier, the convolution kernel size of the input matrix is 3 ⁇ 3, and the sliding step size is 1.
  • FIG. 3b is a pooling layer input matrix of size 4 ⁇ 4 including pooling calculation identifier provided by the present application.
  • the pooling window size of the input matrix is 2 ⁇ 2, and the sliding step size is 1.
  • FIGS. 3a and 3b are for illustration only, and do not constitute a specific limitation.
  • the value of N is determined by the bit width of the vector processor. It should be understood that although the vector processor is a pipeline operation, the more vector operations and the longer the vectors included in each pipeline, the faster the operation speed will be. However, the length of vector operations that each pipeline can contain is limited, that is, The data bandwidth of a vector processor executing an instruction is limited. For example, a 1024-bit vector processor can process 256 bytes at a time, and a convolutional layer input matrix can be a 48 ⁇ 48 matrix. That is to say, to process such a data, the vector processor requires pipeline line switching 48 Each time, it only occupies 48 bytes each time.
  • the method provided in this application combines multiple input matrices in the horizontal direction, that is to say, taking the above example as an example, processing five 48 ⁇ 48 input matrices Pipeline line feed switching is required 240 times, occupying 48 bytes each time, and five 48 ⁇ 48 input matrices are spliced into a 240 ⁇ 48 splicing matrix.
  • the processor only needs pipeline line switching 48 times, but each time it will occupy 240 bytes of vector memory. Therefore, the use of the stitching method can greatly reduce the number of pipeline line-breaking switching of the processor, thereby reducing the time consumed by the processor in the loading and storing steps during the pipeline switching, and further improving the computing performance of the processor.
  • the calculation method of the convolutional layer is to use the convolution kernel to gradually slide on the input matrix to perform the convolution operation, a calculation result is obtained as an element of the output matrix for each sliding. Therefore, when the sliding step size is not 1, if multiple input matrix data are directly spliced in the horizontal direction, the slip calculation of individual input matrices may be missing, that is, the splicing boundary of adjacent input matrices may be It has a certain impact on the convolution processing results.
  • this application proposes the following splicing method: when the target layer is a convolutional layer, the N input matrices of the target layer of the convolutional neural network are horizontally spliced to obtain the first spliced input matrix
  • the method includes: when the input matrix is a non-filled convolutional layer matrix and the sliding step of the convolution kernel is 1, performing horizontal end-to-end stitching of the N input matrices to obtain a first stitched input matrix;
  • the input matrix is a non-filled convolutional layer matrix and the sliding step of the convolution kernel is greater than 1
  • the end of each row of the N input matrices is filled with k zeros to obtain N filled matrices, and the N filled
  • the matrix is horizontally end-to-end spliced to obtain a first spliced input matrix
  • FIG. 4a is a schematic diagram of a splicing matrix of non-filled convolutional layer data provided by this application.
  • the size of the input matrix is a 5 ⁇ 5 unfilled convolutional layer matrix
  • the size of the convolution kernel is 3 ⁇ 3
  • the stitching matrix at this time may be a 15 ⁇ 5 matrix
  • the interaction step is greater than 1, the stitching matrix at this time may be a 15 ⁇ 5 matrix.
  • Fig. 4b is a schematic diagram of a splicing matrix of input convolution layer data provided by this application.
  • the size of the input matrix is a 6 ⁇ 6 filled convolution layer matrix, and the size of the convolution kernel is 3 ⁇ 3
  • the stitching matrix at this time may be a 15 ⁇ 5 matrix
  • the sliding step is greater than 1
  • the stitching matrix at this time may be a 15 ⁇ 5 matrix.
  • the calculation method of the pooling layer is to use the pooling window to gradually slide on the input matrix to perform average pooling, maximum pooling, or random pooling operations, one calculation result is obtained for each slide as an output An element of the matrix. Therefore, when the sliding step size is greater than 1, if multiple input matrix data are directly stitched in the horizontal direction, the slip calculation of the individual input matrix may be missing, that is to say, the stitching boundary of the adjacent input matrix may be Pooling processing results have a certain impact. In order to further improve the accuracy and reliability of the processing results, the pooling processing results of the stitching matrix can be truly and accurately pooled with the single input matrix.
  • this application proposes the following stitching method: when the target layer is a pooling layer, the N different input matrices of the target layer of the convolutional neural network are horizontally stitched to obtain the first stitching
  • the input matrix includes: when the input matrix is a non-filled pooling layer matrix and the pooling window width is s, N padding matrices are obtained after filling the end of each row of the N input matrixes with k zeros, and
  • the first filling input matrix is obtained by performing horizontal end-to-end stitching on the N filling matrices, where k is the result of the matrix width of the input matrix modulo the sliding step of the convolution kernel; in the input matrix
  • the N input matrices are horizontally end-to-end spliced to obtain a first spliced input matrix.
  • FIG. 5a is a schematic diagram of a splicing matrix of non-filled pooling layer data provided by this application.
  • the size of the input matrix is a 5 ⁇ 5 unfilled pooling layer matrix
  • the size of the pooling window is 3 ⁇ 3, when the sliding step is 1, the splicing matrix at this time may be a 15 ⁇ 5 matrix.
  • FIG. 5b is a schematic diagram of a splicing matrix filled with pooling layer data provided by the present application.
  • the size of the input matrix is a 5 ⁇ 5 filled pooling layer matrix, and the size of the pooling window is 3 ⁇ 3.
  • the splicing matrix at this time may be a 15 ⁇ 5 matrix.
  • the vector processor performs calculation processing on the first stitching input matrix in the horizontal direction according to the calculation identifier to obtain a stitching output matrix.
  • the splicing output matrix includes N output matrices. It can be understood that, due to different input matrices, the splicing matrix is also spliced in different ways. If the end of each row of the N input matrices is filled with k zeros to obtain N padding matrices, and the padding matrix is used for splicing, the resulting The output will contain invalid calculation results. Therefore, it is necessary to further process the spliced output matrix, and remove invalid calculation results from the spliced output matrix to obtain N output matrices.
  • FIG. 6 is a schematic diagram of a calculation process for a convolution kernel to perform sliding convolution on the stitching boundary of a stitching matrix provided by this application.
  • the input matrix shown in FIG. 6 is a non-filled convolutional layer with a size of 5 ⁇ 5.
  • the product kernel size is 3 ⁇ 3 and the sliding step size is 2. Therefore, the convolution calculation result of a single input matrix should be a 2 ⁇ 2 output matrix.
  • the matrix stitching method provided by this application shows that the input matrix is unfilled In the case of a convolutional layer matrix and a convolution kernel sliding step size greater than 1, the end of each row of the N input matrices is filled with k zeros to obtain N filling matrices, and the N filling matrices are horizontally
  • the first splicing input matrix is obtained by end-to-end splicing, where k is the result of modulo the sliding width of the convolution kernel by the matrix width of the input matrix.
  • the k value at this time is 1, that is, fill the matrix with 1 zero at the end of each row to obtain a fill matrix, and the fill matrix is horizontally first-to-tail spliced to obtain a stitching matrix, thereby obtaining a size of 11 ⁇ 5 splicing matrix.
  • the convolution calculation result of the splicing matrix with the size of 11 ⁇ 5 should be a 5 ⁇ 2 splicing output matrix.
  • the splicing output matrix contains two 2 ⁇ 2 output matrices (in the figure Light gray area) and a 1 ⁇ 2 invalid matrix (white area in the figure), therefore, after removing the invalid matrix from the stitched output matrix, two output matrices can be obtained, and FIG.
  • FIG. 6 clearly shows the volume
  • the process of sliding the convolution of the product core on the stitching boundary can be seen from Figure 6.
  • the result of the convolution processing of the stitching matrix can be truly equivalent to the convolution processing result obtained after the convolution calculation of a single input matrix. It should be understood that FIG. 6 is only for illustration, and does not constitute a specific limitation.
  • S103 Filter the N output matrices from the spliced output matrix, and use the N output matrices as the N input matrices of the next layer of the convolutional neural network.
  • using the N output matrices as N input matrices of the next layer of the convolutional neural network includes: when the matrix width of the output matrix is greater than half the width of the input matrix, the The N output matrices are used as the N input matrices of the next layer of the convolutional neural network.
  • Taking the N output matrices as the N input matrices of the next layer of the convolutional neural network further includes: when the matrix width of the output matrix is less than or equal to half the width of the input matrix, each of the output matrix Two rows of first and last stitches are merged into one row to obtain N second stitching matrices; the N second stitching matrices are used as N input matrices of the next layer of the convolutional neural network. That is to say, when the width of the output matrix is greater than half the width of the input matrix, the output matrix is directly used as the input matrix of the next layer of the convolutional neural network.
  • Figure 7 provides a schematic diagram of a second splicing matrix. It should be understood that Figure 7 is only used as an example. , And does not constitute a specific limit.
  • each output data will be used as input data for the next layer, and continue to use the parallel computing method proposed in this application for calculation until all After the calculation of the convolutional layer and the pooling layer are completed, the extracted feature data is input into the fully connected layer and the output layer, and the classification result is finally obtained.
  • the parallel computing method provided in this application is compared with the ordinary computing method. Because the convolution or pooling calculation of each layer performs data splicing, the number of pipeline line switching of the processor is greatly reduced, thereby reducing the processor The time consumed by the load and store steps during pipeline switching further improves the processor's computing performance.
  • the N input matrices of the target layer of the convolutional neural network are horizontally spliced to obtain a first spliced input matrix, and then the first spliced input matrix is horizontally processed according to the calculation identifier using a vector processor
  • the direction calculation process obtains a spliced output matrix, so that the N output matrices are selected from the spliced output matrix, and the N output matrices are used as the N input matrices of the next layer of the convolutional neural network.
  • the above method can splice multiple input matrices of small size into rows to obtain a splicing matrix, thereby greatly extending the length of the pipeline processed by the vector processor and reducing the additional load and storage brought by the processor when the processor wraps the pipeline.
  • the overhead further improves the computing performance of the vector processor when processing small-sized data.
  • the parallel computing device provided by the present application includes a splicing unit 810, a computing unit 820, and an output unit 830, where,
  • the stitching unit 810 is configured to stitch N input matrices of the target layer of the convolutional neural network in the horizontal direction to obtain a first stitched input matrix.
  • the calculation unit 820 is configured to use a vector processor to perform calculation processing on the first splicing input matrix in the horizontal direction according to the calculation identifier to obtain a splicing output matrix.
  • the output unit 830 is configured to filter the N output matrices from the spliced output matrix, and use the N output matrices as N input matrices of the next layer of the convolutional neural network.
  • the target layer includes a convolutional layer and a pooling layer.
  • FIG. 2 is a schematic structural diagram of a convolutional neural network provided by the present application.
  • the convolutional neural network The CNN includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer.
  • the CNN structure shown in FIG. 2 has two convolutional layers and two pooling layers, and FIG. 2 is only used as an example. It shows that convolutional neural networks can have more convolutional layers and pooling layers. However, the number of convolutional layers is generally the same as the number of pooling layers.
  • the output matrix of the convolutional layer serves as the input matrix of the pooling layer
  • the output matrix of the pooling layer will serve as the input matrix of the next convolutional layer.
  • the parallel computing method provided by the present application is directed to the processing of the convolutional layer and the pooling layer.
  • the processing methods of the input layer, the output layer, and the fully connected layer may be processed according to the prior art, so this application is no longer Repeat.
  • the calculation of the convolutional layer and pooling layer of the convolutional neural network model accounts for more than 85% of the calculation of the entire model. Therefore, although this application only proposes a parallel calculation method for the convolutional layer and the pooling layer, But it can greatly improve the computing performance of the entire convolutional neural network model.
  • the input matrix includes a calculation identifier, the calculation identifier includes a convolution calculation identifier and a pooling calculation identifier, the convolution calculation identifier includes a convolution kernel parameter, and the pooling calculation identifier Includes pooling window parameters.
  • the input matrix may be a convolutional layer input matrix and a pooling layer input matrix.
  • the input matrix may be a pixel matrix obtained after the input picture passes through the convolutional neural network input layer.
  • FIG. 3a is provided by this application. Is a 6 ⁇ 6 convolutional layer input matrix containing a convolution calculation identifier, the convolution kernel size of the input matrix is 3 ⁇ 3, and the sliding step size is 1.
  • FIG. 3a is provided by this application. Is a 6 ⁇ 6 convolutional layer input matrix containing a convolution calculation identifier, the convolution kernel size of the input matrix is 3 ⁇ 3, and the sliding step size is 1.
  • FIG. 3b is a pooling layer input matrix of size 4 ⁇ 4 including pooling calculation identifier provided by the present application.
  • the pooling window size of the input matrix is 2 ⁇ 2, and the sliding step size is 1.
  • FIGS. 3a and 3b are for illustration only, and do not constitute a specific limitation.
  • the value of N is determined by the bit width of the vector processor. It should be understood that although the vector processor is a pipeline operation, the more vector operations and the longer the vectors included in each pipeline, the faster the operation speed will be. However, the length of vector operations that each pipeline can contain is limited, that is, The data bandwidth of a vector processor executing an instruction is limited. For example, a 1024-bit vector processor can process 256 bytes at a time, and a convolutional layer input matrix can be a 48 ⁇ 48 matrix. That is to say, to process such a data, the vector processor requires pipeline line switching 48 Each time, it only occupies 48 bytes each time.
  • the method provided in this application combines multiple input matrices in the horizontal direction, that is to say, taking the above example as an example, processing five 48 ⁇ 48 input matrices Pipeline line feed switching is required 240 times, occupying 48 bytes each time, and five 48 ⁇ 48 input matrices are spliced into a 240 ⁇ 48 splicing matrix.
  • the processor only needs pipeline line switching 48 times, but each time it will occupy 240 bytes of vector memory. Therefore, the use of the stitching method can greatly reduce the number of pipeline line-breaking switching of the processor, thereby reducing the time consumed by the processor in the loading and storing steps during the pipeline switching, and further improving the computing performance of the processor.
  • the calculation method of the convolutional layer is to use the convolution kernel to gradually slide on the input matrix to perform the convolution operation, a calculation result is obtained as an element of the output matrix for each sliding. Therefore, when the sliding step size is not 1, if multiple input matrix data are directly spliced in the horizontal direction, the slip calculation of individual input matrices may be missing, that is, the splicing boundary of adjacent input matrices may be It has a certain impact on the convolution processing results.
  • the convolution processing results of the stitching matrix can be truly and accurately convolution processing results obtained by convolution calculation with a single input matrix
  • the splicing unit 810 is specifically used to convert the input matrix to a non-filled convolutional layer matrix and a convolution kernel sliding step of 1.
  • the N input matrices are horizontally end-to-end spliced to obtain a first spliced input matrix; the splicing unit 810 is specifically used when the input matrix is an unfilled convolutional layer matrix and the convolution kernel sliding step is greater than 1.
  • the splicing unit 810 is specifically used when the input matrix is a filled convolution layer matrix and the sliding step of the convolution kernel is 1.
  • the splicing unit 810 is specifically configured to divide the N number when the input matrix is a filled convolution layer matrix and the sliding step of the convolution kernel is greater than 1.
  • each row of the input matrix is filled with k zeros to obtain N filling matrices
  • the first filling input matrix is obtained by performing horizontal first-to-tail joining of the N filling matrices, where k is the matrix width of the input matrix
  • FIG. 4a is a schematic diagram of a splicing matrix of non-filled convolutional layer data provided by this application.
  • the size of the input matrix is a 5 ⁇ 5 unfilled convolutional layer matrix
  • the size of the convolution kernel is 3 ⁇ 3
  • the stitching matrix at this time may be a 15 ⁇ 5 matrix
  • the sliding step is greater than 1, the stitching matrix at this time may be a 17 ⁇ 5 matrix.
  • FIG. 4b is a schematic diagram of an input matrix provided by the present application is a splicing matrix filled with convolutional layer data.
  • the size of the input matrix is a 5 ⁇ 5 filled convolutional layer matrix, and the size of the convolution kernel is 3 ⁇ 3.
  • the stitching matrix at this time may be a 17 ⁇ 5 matrix, and when the sliding step is greater than 1, the stitching matrix at this time may be a 17 ⁇ 5 matrix.
  • the calculation method of the pooling layer is to use the pooling window to gradually slide on the input matrix to perform average pooling, maximum pooling, or random pooling operations, one calculation result is obtained for each slide as an output An element of the matrix. Therefore, when the sliding step size is greater than 1, if multiple input matrix data are directly stitched in the horizontal direction, the slip calculation of the individual input matrix may be missing, that is to say, the stitching boundary of the adjacent input matrix may be Pooling processing results have a certain impact. In order to further improve the accuracy and reliability of the processing results, the pooling processing results of the stitching matrix can be truly and accurately pooled with the single input matrix.
  • the splicing unit 810 is specifically configured to, when the input matrix is a non-filled pooling layer matrix and the pooling window width is s,
  • the N input matrices are filled with k zeros at the end of each row to obtain N filled matrices
  • the first filled input matrix is obtained by performing horizontal first-to-tail splicing of the N filled matrices, where k is the matrix of the input matrix
  • the result of taking the modulus of the sliding step of the convolution kernel by the width; the splicing unit 810 is specifically used to convert the N number when the input matrix is a filled pooling layer matrix and the pooling window width is s
  • the input matrix is horizontally end-to-end spliced to obtain the first spliced input matrix.
  • FIG. 5a is a schematic diagram of a splicing matrix of non-filled pooling layer data provided by this application.
  • the size of the input matrix is a 5 ⁇ 5 unfilled pooling layer matrix
  • the pooling window size is 3 ⁇ 3
  • the stitching matrix at this time may be a 15 ⁇ 5 matrix
  • the sliding step is greater than 1, the stitching matrix at this time may be a 15 ⁇ 5 matrix.
  • the splicing output matrix includes N output matrices. It can be understood that, due to different input matrices, the splicing matrix is also spliced in different ways. If the end of each row of the N input matrices is filled with k zeros to obtain N padding matrices, and the padding matrix is used for splicing, the resulting The output will contain invalid calculation results. Therefore, it is necessary to further process the spliced output matrix, and remove invalid calculation results from the spliced output matrix to obtain N output matrices.
  • FIG. 6 is a schematic diagram of a calculation process for a convolution kernel to perform sliding convolution on the stitching boundary of a stitching matrix proposed in this application.
  • the input matrix shown in FIG. 6 is a non-filled convolutional layer with a size of 5 ⁇ 5.
  • the product kernel size is 3 ⁇ 3 and the sliding step size is 2. Therefore, the convolution calculation result of a single input matrix should be a 2 ⁇ 2 output matrix.
  • the matrix stitching method provided by this application shows that the input matrix is unfilled In the case of a convolutional layer matrix and a convolution kernel sliding step size greater than 1, the end of each row of the N input matrices is filled with k zeros to obtain N filling matrices, and the N filling matrices are horizontally
  • the first splicing input matrix is obtained by end-to-end splicing, where k is the result of modulo the sliding width of the convolution kernel by the matrix width of the input matrix.
  • the k value at this time is 1, that is, fill the matrix with 1 zero at the end of each row to obtain a fill matrix, and the fill matrix is horizontally first-to-tail spliced to obtain a stitching matrix, thereby obtaining a size of 11 ⁇ 5 splicing matrix.
  • the convolution calculation result of the splicing matrix with the size of 11 ⁇ 5 should be a 5 ⁇ 2 splicing output matrix.
  • the splicing output matrix contains two 2 ⁇ 2 output matrices (in the figure Light gray area) and a 1 ⁇ 2 invalid matrix (white area in the figure), therefore, after removing the invalid matrix from the stitched output matrix, two output matrices can be obtained, and FIG.
  • FIG. 6 clearly shows the volume
  • the process of sliding the convolution of the product core on the stitching boundary can be seen from Figure 6.
  • the result of the convolution processing of the stitching matrix can be truly equivalent to the convolution processing result obtained after the convolution calculation of a single input matrix. It should be understood that FIG. 6 is only for illustration, and does not constitute a specific limitation.
  • the output unit 830 is specifically configured to use the N output matrices as the next layer of the convolutional neural network when the matrix width of the output matrix is greater than half the width of the input matrix N input matrices.
  • the output unit 830 is specifically used for joining two rows of the output matrix end to end and forming one row when the matrix width of the output matrix is less than or equal to half the width of the input matrix to obtain N second stitching Matrix; the output unit 830 is specifically configured to use the N second splicing matrices as N input matrices of the next layer of the convolutional neural network.
  • Figure 7 provides a schematic diagram of a second splicing matrix. It should be understood that Figure 7 is only used as an example. , And does not constitute a specific limit.
  • each output data will be used as input data for the next layer, and continue to use the parallel computing method proposed in this application for calculation until all After the calculation of the convolutional layer and the pooling layer are completed, the extracted feature data is input into the fully connected layer and the output layer, and the classification result is finally obtained.
  • the parallel computing method provided in this application is compared with the ordinary computing method. Because the convolution or pooling calculation of each layer performs data splicing, the number of pipeline line switching of the processor is greatly reduced, thereby reducing the processor The time consumed by the load and store steps during pipeline switching further improves the processor's computing performance.
  • the N input matrices of the target layer of the convolutional neural network are horizontally spliced to obtain a first spliced input matrix, and then the first spliced input matrix is horizontally processed according to the calculation identifier using a vector processor
  • the direction calculation process obtains a spliced output matrix, so that the N output matrices are selected from the spliced output matrix, and the N output matrices are used as the N input matrices of the next layer of the convolutional neural network.
  • the above method can horizontally splice multiple small-size input matrices to obtain a splicing matrix, thereby greatly extending the length of the pipeline that the vector processor processes and calculating, and reducing the additional load and storage brought by the processor when the processor wraps the pipeline.
  • the overhead further improves the computing performance of the vector processor when processing small-sized data.
  • the electronic device in this embodiment may include: one or more processors 911, a memory 912, and a communication interface 913.
  • the processor 911, the memory 912 and the communication interface 913 may be connected through a bus 914.
  • the processor 911 includes one or more general-purpose processors, where the general-purpose processor may be any type of device capable of processing electronic instructions, including a central processing unit (CPU) and an image processing unit (GPU) ), microprocessor, microcontroller, main processor, controller and application specific integrated circuit (Application Specific Integrated Circuit (ASIC)), digital signal processor (Digital Signal Processor, DSP), programmable gate array (Field-Programmable Gate) Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the processor 911 is used to execute program instructions stored in the memory 912.
  • the memory 912 may include volatile memory, such as random access memory (Random Access Mmemory, RAM); the memory may also include non-volatile memory, such as read-only memory (Read-Only Memory, ROM), flash memory (Flash) Memory), hard disk (Hard Disk Drive, HDD) or solid-state hard disk (Solid-State Drive, SSD), the memory may also include a combination of the above types of memory.
  • volatile memory such as random access memory (Random Access Mmemory, RAM); the memory may also include non-volatile memory, such as read-only memory (Read-Only Memory, ROM), flash memory (Flash) Memory), hard disk (Hard Disk Drive, HDD) or solid-state hard disk (Solid-State Drive, SSD), the memory may also include a combination of the above types of memory.
  • the memory 912 may be centralized storage or distributed storage, which is not specifically limited here. It can be understood that the memory 912 is used to store computer programs, such as computer program instructions. In the embodiment of the
  • the communication interface 913 may be a wired interface (such as an Ethernet interface) or a wireless interface (such as a cellular network interface or using a wireless local area network interface) for communicating with other computer devices or users.
  • the communication interface 913 can adopt the protocol family on the network communication protocol (Transmission Control/Protocol/Internet Protocol, TCP/IP), for example, the remote function call (Remote Function Call, RFC) protocol, simple Object Access Protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (Simple Network Management Protocol, SNMP), Common Object Request Broker Architecture Protocol (Common Object Request Broker Architecture, CORBA) and distributed protocols, etc.
  • TCP/IP Transmission Control/Protocol/Internet Protocol
  • RFC Remote Function Call
  • Simple Object Access Protocol Simple Object Access Protocol
  • SOAP Simple Network Management Protocol
  • SNMP Simple Network Management Protocol
  • CORBA Common Object Request Broker Architecture Protocol
  • the communication interface 913 is a wireless interface
  • cellular communication can be utilized according to the Global System for Mobile (GSM) or Code Division Multiple Access (CDMA) standards, and therefore includes wireless for data transmission Modem, electronic processing equipment, one or more digital memory devices, and dual antennas.
  • GSM Global System for Mobile
  • CDMA Code Division Multiple Access
  • the processor 911, the memory 912, the communication interface 913, and the bus 914 may execute the implementation manner described in any embodiment of the parallel computing method provided by the embodiment of the present application, and details are not described herein again.
  • the disclosed method and device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a division of logical functions.
  • there may be other divisions for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be indirect couplings or communication connections through some interfaces, devices, or units, and may also be electrical, mechanical, or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application essentially or part of the contribution to the existing technology, or all or part of the technical solution can be embodied in the form of a software product
  • the computer software product is stored in a storage medium
  • several instructions are included to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
  • the foregoing storage media include various media that can store program codes, such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

一种并行计算方法及装置,所述方法包括:将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵(S101);矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵(S102);从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵(S103)。该方法通过将输入矩阵进行水平方向的拼接,从而减少了处理器换行处理流水线时所需的载入和存储带来的额外开销,提升了矢量处理器在处理小尺寸数据时的计算性能。

Description

一种并行计算方法及装置
本申请要求于2018年11月26日提交中国专利局,申请号为201811417046.8、发明名称为“一种并行计算方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种并行计算方法及装置。
背景技术
神经网络被广泛应用于模式识别、图像处理、函数逼近和优化计算等领域,因其较高的运算速率,受到学术界和工业界越来越广泛的关注。而矢量处理器是专门设计的高度流水线作业的处理器,能够对整个矢量矩阵上逐行进行高效率的操作,因此当前神经网络中的深度学习任务多是使用矢量处理器(如GPU、矢量DSP、带有SIMD扩展指令集的CPU等等)处理运算的。
在卷积神经网络模型中,有一些应用场景,比如计算机视觉领域当中,可能会出现模型的输入矩阵的尺寸很小,但是模型执行的频率又很高的情况,使用矢量处理器对该种情况进行处理时,会存在浪费矢量寄存器位宽、计算流水中断等问题,影响处理器的计算性能。
申请内容
本申请提供了一种并行计算方法及装置,能够使得矢量处理器在处理小尺寸数据时,计算性能得到提升。
第一方面,本申请提供了一种并行计算方法,所述方法包括以下步骤:
将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵,其中,所述目标层包括卷积层和池化层,所述输入矩阵包含计算标识,所述计算标识包括卷积计算标识及池化计算标识,所述卷积计算标识包括卷积核参数,所述池化计算标识包括池化窗口参数,N的值是由矢量处理器的位宽确定的;
矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵,其中,所述拼接输出矩阵包含N个输出矩阵;
从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
可选地,所述目标层为卷积层的情况下,所述将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵包括:
在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵;
在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;
在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述卷积核一半宽度向下取整的结果;
在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果。
可选地,在所述目标层是池化层的情况下,所述将卷积神经网络目标层的N个不同的输入矩阵进行水平方向的拼接,得到第一拼接输入矩阵包括:
在所述输入矩阵为非填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;
在所述输入矩阵为填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵。
可选地,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵包括:
在所述输出矩阵的矩阵宽度大于输入矩阵一半宽度的情况下,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
可选地,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵还包括:
在所述输出矩阵的矩阵宽度小于或等于输入矩阵一半宽度的情况下,将所述输出矩阵中的每两行首尾拼接合并成一行,得到N个第二拼接矩阵;
将所述N个第二拼接矩阵作为卷积神经网络下一层的N个输入矩阵。
第二方面,本申请提供了一种并行计算装置,所述装置包括拼接单元、计算单元以及输 出单元,其中,
所述拼接单元用于将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵,其中,所述目标层包括卷积层和池化层,所述输入矩阵包含计算标识,所述计算标识包括卷积计算标识及池化计算标识,所述卷积计算标识包括卷积核参数,所述池化计算标识包括池化窗口参数,N的值是由矢量处理器的位宽确定的;
所述计算单元用于使用矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵,其中,所述拼接输出矩阵包含N个输出矩阵;
所述输出单元用于从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
可选地,所述目标层为卷积层的情况下,所述拼接单元具体用于在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵;
所述拼接单元具体用于在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;
所述拼接单元具体用于在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述卷积核一半宽度向下取整的结果;
所述拼接单元具体用于在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果。
可选地,在所述目标层是池化层的情况下,所述拼接单元具体用于在所述输入矩阵为非填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;
所述拼接单元具体用于在所述输入矩阵为填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵。
可选地,所述输出单元具体用于在所述输出矩阵的矩阵宽度大于输入矩阵一半宽度的情 况下,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
可选地,所述输出单元具体用于在所述输出矩阵的矩阵宽度小于或等于输入矩阵一半宽度的情况下,将所述输出矩阵中的每两行首尾相接合并成一行,得到N个第二拼接矩阵;
所述输出单元具体用于将所述N个第二拼接矩阵作为卷积神经网络下一层的N个输入矩阵。
上述方法中,通过将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵,再使用矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵,从而从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。通过上述方案,将多个小尺寸的输入矩阵进行水平方向的拼接获得拼接矩阵,从而大大延长矢量处理器的流水线的长度,减少了处理器换行处理流水线时所需的载入和存储带来的额外开销,进一步提升了矢量处理器在处理小尺寸数据时的计算性能。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请提供的一种并行计算方法的流程示意图;
图2是本申请提供的一种卷积神经网络的结构示意图;
图3a是本申请提供的一个包含卷积计算标识的尺寸为6×6的卷积层输入矩阵示意图;
图3b是本申请提供的一个包含池化计算标识的尺寸为4×4的池化层输入矩阵示意图;
图4a是本申请提供的一种输入矩阵是非填充卷积层数据的拼接矩阵示意图;
图4b是本申请提供的一种输入矩阵是填充卷积层数据的拼接矩阵示意图;
图5a是本申请提供的一种输入矩阵是非填充池化层数据的拼接矩阵示意图;
图5b是本申请提供的一种输入矩阵是填充池化层数据的拼接矩阵示意图;
图6是本申请提供的一种卷积核在拼接矩阵的拼接边界上进行滑动卷积的计算流程示意图;
图7是本申请提供的一种第二拼接矩阵示意图;
图8是本申请提供的一种并行计算装置的结构示意图;
图9是本申请提供的一种电子设备结构示意框图。
具体实施方式
下面通过具体实施方式结合附图对本申请作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本申请能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他方法所替代。在某些情况下,本申请相关的一些操作并没有在说明书中显示或描述,这是为了避免本申请的核心部分被过多的描述所淹没。对于本领域技术人员而言,详细描述这些相关操作并不是必要的,他们根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。
为了使本申请能够被更好的理解,下面对矢量处理器进行简要介绍。
矢量处理器系统(Vector Processor System,VPS),是面向向量型并行计算,以流水线结构为主的并行处理计算机系统。采用先行控制和重叠操作技术、运算流水线、交叉访问的并行存储器等并行处理结构,对提高运算速度有重要作用。但在实际运行时还不能充分发挥并行处理潜力。向量运算很适合于流水线计算机的结构特点。向量型并行计算与流水线结构相结合,能在很大程度上克服通常流水线计算机中指令处理量太大、存储访问不均匀、相关等待严重、流水不畅等缺点,并可充分发挥并行处理结构的潜力,显著提高运算速度。但是,向量处理器在计算向量的每个分量时,都发生读写数据相关,导致流水线效率低,如果采用多功能流水线,必须频繁进行流水线切换。
综上可知,对于矢量处理系统来说,只有研制和采用向量型并行算法,使程序中包含的向量运算越多、向量越长,运算速度才会越高。本申请基于此思想,提出了一种并行计算方法及装置,其中,方法的具体步骤参见图1。
图1是本申请提供的一种并行计算方法的流程示意图。如图1所示,本申请提供的并行计算方法包括以下步骤:
S101:将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵。
在本申请具体的实施方式中,所述目标层包括卷积层和池化层,例如,图2是本申请提供的一种卷积神经网络的结构示意图,由图2可知,卷积神经网络CNN包括输入层、卷积层、池化层、全连接层以及输出层,可以理解的是,图2显示的CNN结构拥有两个卷积层以及两个池化层,图2仅仅用于举例说明,卷积神经网络可以拥有更多的卷积层以及池化层。但是卷积层的数量一般与池化层的数量相同,也就是说,卷积层的输出矩阵作为池化层的输入矩阵,池化层的输出矩阵将作为下一层卷积层输入矩阵。应理解,本申请提供的并行计算方法是针对于卷积层和池化层进行处理的,输入层、输出层及全连接层的处理方式可以是按照现有技术进行处理,因此本申请不再赘述。但是,卷积神经网络模型的卷积层和池化层计算量约占整个模型计算量的85%以上,因此,本申请虽然只针对卷积层和池化层提出了一种并行 计算方法,但是可以极大地提升整个卷积神经网络模型的计算性能。
在本申请具体的实施方式中,所述输入矩阵包含计算标识,所述计算标识包括卷积计算标识及池化计算标识,所述卷积计算标识包括卷积核参数,所述池化计算标识包括池化窗口参数。其中,所述输入矩阵可以是卷积层输入矩阵以及池化层输入矩阵,所述输入矩阵可以是输入图片经过卷积神经网络输入层后获得的一个像素矩阵,例如,图3a是本申请提供的一个包含卷积计算标识的尺寸为6×6的卷积层输入矩阵,所述输入矩阵的卷积核尺寸为3×3,滑动步长为1。图3b是本申请提供的一个包含池化计算标识的尺寸为4×4的池化层输入矩阵,所述输入矩阵的池化窗口尺寸为2×2,滑动步长为1。应理解,图3a、图3b仅用于举例说明,并不能构成具体限定。
在本申请具体的实施方式中,N的值是由矢量处理器的位宽确定的。应理解,矢量处理器虽然是流水线作业,每条流水线包含的向量运算越多、向量越长,运算速度才会越高,但是每条流水线能包含的向量运算长度是有限的,也就是说,矢量处理器执行一次指令的数据带宽是有限的。例如,一个1024bit的矢量处理器,一次可以处理256个字节,而一个卷积层输入矩阵可以是一个48×48的矩阵,也就是说,处理这样一个数据,矢量处理器需要流水线换行切换48次,每次只占用了48个字节,因此,本申请提供的方法将多个输入矩阵进行水平方向的拼接,也就是说,仍以上述例子为例,处理5个48×48的输入矩阵需要流水线换行切换240次,每次占用48个字节,而将5个48×48的输入矩阵进拼接成一个240×48的拼接矩阵后,处理这样一个拼接矩阵,处理器仅需要流水线换行切换48次,但是每次将占用矢量存储器240个字节。因此,使用拼接的方法可以使得处理器的流水线换行切换次数大大减少,从而减少了处理器在进行流水线切换时载入和存储步骤所消耗的时间,进一步提升了处理器的计算性能。
在本申请具体的实施方式中,由于卷积层的计算方式是使用卷积核在输入矩阵上逐步滑动进行卷积运算,每滑动一次获得一个计算结果作为输出矩阵的一个元素。因此当滑动步长不为1时,如果直接将多个输入矩阵数据进行水平方向的拼接处理,可能会对个别输入矩阵的滑动计算出现缺漏,也就是说,相邻输入矩阵的拼接边界可能会对卷积处理结果带来一定的影响,为了进一步提升处理结果的准确性和可靠性,使得拼接矩阵的卷积处理结果可以真实准确地与单个输入矩阵进行卷积计算后得到的卷积处理结果完全等效,本申请提出了以下拼接方法:所述目标层为卷积层的情况下,所述将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵包括:在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵;在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个 输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述卷积核一半宽度向下取整的结果;在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果。
下面以图4a和图4b为例,对目标层为卷积层的情况下的数据拼接规则作出举例说明。其中,图4a是本申请提供的一种输入矩阵是非填充卷积层数据的拼接矩阵示意图,由图4a可知,输入矩阵的尺寸为5×5的非填充卷积层矩阵,卷积核尺寸为3×3,当滑动步长为1时,此时的拼接矩阵可以是一个15×5的矩阵,当互动步长大于1时,此时的拼接矩阵可以是一个15×5的矩阵。图4b是本申请提供的一种输入矩阵是填充卷积层数据的拼接矩阵示意图,由图4b可知,输入矩阵的尺寸为6×6的填充卷积层矩阵,卷积核尺寸为3×3,当滑动步长为1时,此时的拼接矩阵可以是一个15×5的矩阵,当滑动步长大于1时,此时的拼接矩阵可以是一个15×5的矩阵。应理解,上述举例仅用于说明,并不能构成具体限定。
在本申请具体的实施方式中,由于池化层的计算方式是使用池化窗口在输入矩阵上逐步滑动进行平均池化、最大池化或者随机池化运算,每滑动一次获得一个计算结果作为输出矩阵的一个元素。因此当滑动步长大于1时,如果直接将多个输入矩阵数据进行水平方向的拼接处理,可能会对个别输入矩阵的滑动计算出现缺漏,也就是说,相邻输入矩阵的拼接边界可能会对池化处理结果带来一定的影响,为了进一步提升处理结果的准确性和可靠性,使得拼接矩阵的池化处理结果可以真实准确地与单个输入矩阵进行池化计算后得到的池化处理结果完全等效,本申请提出了以下拼接方法:在所述目标层是池化层的情况下,所述将卷积神经网络目标层的N个不同的输入矩阵进行水平方向的拼接,得到第一拼接输入矩阵包括:在所述输入矩阵为非填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;在所述输入矩阵为填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵。
下面以图5a和图5b为例,对目标层为卷积层的情况下的数据拼接规则作出举例说明。 其中,图5a是本申请提供的一种输入矩阵是非填充池化层数据的拼接矩阵示意图,由图5a可知,输入矩阵的尺寸为5×5的非填充池化层矩阵,池化窗口尺寸为3×3,当滑动步长为1时,此时的拼接矩阵可以是一个15×5的矩阵,图5b是本申请提供的一种输入矩阵是填充池化层数据的拼接矩阵示意图,由图5b可知,输入矩阵的尺寸为5×5的填充池化层矩阵,池化窗口尺寸为3×3,当滑动步长大于1时,此时的拼接矩阵可以是一个15×5的矩阵。应理解,上述举例仅用于说明,并不能构成具体限定。
S102:矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵。
在本申请具体的实施方式中,所述拼接输出矩阵包含N个输出矩阵。可以理解的是,由于输入矩阵不同,拼接矩阵的拼接方式也不同,如果将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,并使用填充矩阵进行拼接,得到的输出结果将包含无效计算结果。因此,需要对拼接输出矩阵进行进一步的处理,将无效计算结果从拼接输出矩阵中剔除,获得N个输出矩阵。
例如,图6是本申请提供的一种卷积核在拼接矩阵的拼接边界上进行滑动卷积的计算流程示意图,图6所示输入矩阵是尺寸为5×5的非填充卷积层,卷积核尺寸为3×3且滑动步长为2,因此单个输入矩阵的卷积计算结果应为一个2×2的输出矩阵,由本申请提供的矩阵拼接方法可知,在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果。因此,此时的k值为1,也就是说,将输入矩阵的每行末尾填充1个零后获得填充矩阵,并将填充矩阵进行水平方向的首尾拼接获得拼接矩阵,从而获得一个尺寸为11×5的拼接矩阵。可以理解的是,尺寸为11×5的拼接矩阵的卷积计算结果应为一个5×2的拼接输出矩阵,由图6可知,拼接输出矩阵包含了两个2×2的输出矩阵(图中的浅灰色区域)和一个1×2的无效矩阵(图中的白色区域),因此,将所述无效矩阵从拼接输出矩阵中剔除后即可得到两个输出矩阵,图6清楚的显示了卷积核在拼接边界上滑动卷积的过程,由图6可知,拼接矩阵的卷积处理结果可以真实准确地与单个输入矩阵进行卷积计算后得到的卷积处理结果完全等效。应理解,图6仅仅用于举例说明,并不能构成具体限定。
S103:从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
在本申请具体的实施方式中,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵包括:在所述输出矩阵的矩阵宽度大于输入矩阵一半宽度的情况下,将所述N个输出矩 阵作为卷积神经网络下一层的N个输入矩阵。将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵还包括:在所述输出矩阵的矩阵宽度小于或等于输入矩阵一半宽度的情况下,将所述输出矩阵中的每两行首尾拼接合并成一行,得到N个第二拼接矩阵;将所述N个第二拼接矩阵作为卷积神经网络下一层的N个输入矩阵。也就是说,当输出矩阵的宽度大于输入矩阵一半宽度的情况下,将输出矩阵直接作为卷积神经网络下一层的输入矩阵,当输出矩阵的宽度小于输入矩阵一半宽度的情况下,将输出矩阵中的每两行首尾拼接合并成一行后,再作为卷积神经网络下一层的输入矩阵,其中,图7提供了一种第二拼接矩阵的示意图,应理解,图7仅仅用于举例,并不能构成具体限定。可以理解的是,由于卷积神经网络拥有多个卷积层和池化层,因此,每个输出数据都会作为下一层的输入数据,继续使用本申请提出的并行计算方法进行计算,直到所有的卷积层和池化层的计算都处理完毕后,将提取到的特征数据输入全连接层、输出层,最终获得分类结果。本申请提供的并行计算方法与普通的计算方法相比较,由于每一层的卷积或池化计算都进行了数据拼接,使得处理器的流水线换行切换次数大大减少,从而减少了处理器在进行流水线切换时载入和存储步骤所消耗的时间,进一步提升了处理器的计算性能。
上述方法中,通过将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵,再使用矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵,从而从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。上述方法可以将多个小尺寸的输入矩阵按行拼接获得拼接矩阵,从而大大延长矢量处理器处理计算的流水线的长度,减少了处理器换行处理流水线时所需的载入和存储带来的额外开销,进一步提升了矢量处理器在处理小尺寸数据时的计算性能。
图8是本申请提供的一种并行计算装置的结构示意图,由图8可知,本申请提供的并行计算装置包括拼接单元810、计算单元820以及输出单元830,其中,
所述拼接单元810用于将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵。
所述计算单元820用于使用矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵。
所述输出单元830用于从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
在本申请具体的实施方式中,所述目标层包括卷积层和池化层,例如,图2是本申请提供的一种卷积神经网络的结构示意图,由图2可知,卷积神经网络CNN包括输入层、卷积层、 池化层、全连接层以及输出层,可以理解的是,图2显示的CNN结构拥有两个卷积层以及两个池化层,图2仅仅用于举例说明,卷积神经网络可以拥有更多的卷积层以及池化层。但是卷积层的数量一般与池化层的数量相同,也就是说,卷积层的输出矩阵作为池化层的输入矩阵,池化层的输出矩阵将作为下一层卷积层输入矩阵。应理解,本申请提供的并行计算方法是针对于卷积层和池化层进行处理的,输入层、输出层及全连接层的处理方式可以是按照现有技术进行处理,因此本申请不再赘述。但是,卷积神经网络模型的卷积层和池化层计算量约占整个模型计算量的85%以上,因此,本申请虽然只针对卷积层和池化层提出了一种并行计算方法,但是可以极大地提升整个卷积神经网络模型的计算性能。
在本申请具体的实施方式中,所述输入矩阵包含计算标识,所述计算标识包括卷积计算标识及池化计算标识,所述卷积计算标识包括卷积核参数,所述池化计算标识包括池化窗口参数。其中,所述输入矩阵可以是卷积层输入矩阵以及池化层输入矩阵,所述输入矩阵可以是输入图片经过卷积神经网络输入层后获得的一个像素矩阵,例如,图3a是本申请提供的一个包含卷积计算标识的尺寸为6×6的卷积层输入矩阵,所述输入矩阵的卷积核尺寸为3×3,滑动步长为1。图3b是本申请提供的一个包含池化计算标识的尺寸为4×4的池化层输入矩阵,所述输入矩阵的池化窗口尺寸为2×2,滑动步长为1。应理解,图3a、图3b仅用于举例说明,并不能构成具体限定。
在本申请具体的实施方式中,N的值是由矢量处理器的位宽确定的。应理解,矢量处理器虽然是流水线作业,每条流水线包含的向量运算越多、向量越长,运算速度才会越高,但是每条流水线能包含的向量运算长度是有限的,也就是说,矢量处理器执行一次指令的数据带宽是有限的。例如,一个1024bit的矢量处理器,一次可以处理256个字节,而一个卷积层输入矩阵可以是一个48×48的矩阵,也就是说,处理这样一个数据,矢量处理器需要流水线换行切换48次,每次只占用了48个字节,因此,本申请提供的方法将多个输入矩阵进行水平方向的拼接,也就是说,仍以上述例子为例,处理5个48×48的输入矩阵需要流水线换行切换240次,每次占用48个字节,而将5个48×48的输入矩阵进拼接成一个240×48的拼接矩阵后,处理这样一个拼接矩阵,处理器仅需要流水线换行切换48次,但是每次将占用矢量存储器240个字节。因此,使用拼接的方法可以使得处理器的流水线换行切换次数大大减少,从而减少了处理器在进行流水线切换时载入和存储步骤所消耗的时间,进一步提升了处理器的计算性能。
在本申请具体的实施方式中,由于卷积层的计算方式是使用卷积核在输入矩阵上逐步滑动进行卷积运算,每滑动一次获得一个计算结果作为输出矩阵的一个元素。因此当滑动步长不为1时,如果直接将多个输入矩阵数据进行水平方向的拼接处理,可能会对个别输入矩阵 的滑动计算出现缺漏,也就是说,相邻输入矩阵的拼接边界可能会对卷积处理结果带来一定的影响,为了进一步提升处理结果的准确性和可靠性,使得拼接矩阵的卷积处理结果可以真实准确地与单个输入矩阵进行卷积计算后得到的卷积处理结果完全等效,所述目标层为卷积层的情况下,所述拼接单元810具体用于在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵;所述拼接单元810具体用于在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;所述拼接单元810具体用于在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述卷积核一半宽度向下取整的结果;所述拼接单元810具体用于在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果。
下面以图4a和图4b为例,对目标层为卷积层的情况下的数据拼接规则作出举例说明。其中,图4a是本申请提供的一种输入矩阵是非填充卷积层数据的拼接矩阵示意图,由图4a可知,输入矩阵的尺寸为5×5的非填充卷积层矩阵,卷积核尺寸为3×3,当滑动步长为1时,此时的拼接矩阵可以是一个15×5的矩阵,当滑动步长大于1时,此时的拼接矩阵可以是一个17×5的矩阵。图4b是本申请提供的一种输入矩阵是填充卷积层数据的拼接矩阵示意图,由图4b可知,输入矩阵的尺寸为5×5的填充卷积层矩阵,卷积核尺寸为3×3,当滑动步长为1时,此时的拼接矩阵可以是一个17×5的矩阵,当滑动步长大于1时,此时的拼接矩阵可以是一个17×5的矩阵。应理解,上述举例仅用于说明,并不能构成具体限定。
在本申请具体的实施方式中,由于池化层的计算方式是使用池化窗口在输入矩阵上逐步滑动进行平均池化、最大池化或者随机池化运算,每滑动一次获得一个计算结果作为输出矩阵的一个元素。因此当滑动步长大于1时,如果直接将多个输入矩阵数据进行水平方向的拼接处理,可能会对个别输入矩阵的滑动计算出现缺漏,也就是说,相邻输入矩阵的拼接边界可能会对池化处理结果带来一定的影响,为了进一步提升处理结果的准确性和可靠性,使得拼接矩阵的池化处理结果可以真实准确地与单个输入矩阵进行池化计算后得到的池化处理结果完全等效,在所述目标层是池化层的情况下,所述拼接单元810具体用于在所述输入矩阵为非填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵的每行末尾填充k 个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;所述拼接单元810具体用于在所述输入矩阵为填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵。
下面以图5a和图5b为例,对目标层为卷积层的情况下的数据拼接规则作出举例说明。其中,图5a是本申请提供的一种输入矩阵是非填充池化层数据的拼接矩阵示意图,由图5a可知,输入矩阵的尺寸为5×5的非填充池化层矩阵,池化窗口尺寸为3×3,当滑动步长为1时,此时的拼接矩阵可以是一个15×5的矩阵,当滑动步长大于1时,此时的拼接矩阵可以是一个15×5的矩阵。应理解,上述举例仅用于说明,并不能构成具体限定。
在本申请具体的实施方式中,所述拼接输出矩阵包含N个输出矩阵。可以理解的是,由于输入矩阵不同,拼接矩阵的拼接方式也不同,如果将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,并使用填充矩阵进行拼接,得到的输出结果将包含无效计算结果。因此,需要对拼接输出矩阵进行进一步的处理,将无效计算结果从拼接输出矩阵中剔除,获得N个输出矩阵。
例如,图6是本申请提出的一种卷积核在拼接矩阵的拼接边界上进行滑动卷积的计算流程示意图,图6所示输入矩阵是尺寸为5×5的非填充卷积层,卷积核尺寸为3×3且滑动步长为2,因此单个输入矩阵的卷积计算结果应为一个2×2的输出矩阵,由本申请提供的矩阵拼接方法可知,在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果。因此,此时的k值为1,也就是说,将输入矩阵的每行末尾填充1个零后获得填充矩阵,并将填充矩阵进行水平方向的首尾拼接获得拼接矩阵,从而获得一个尺寸为11×5的拼接矩阵。可以理解的是,尺寸为11×5的拼接矩阵的卷积计算结果应为一个5×2的拼接输出矩阵,由图6可知,拼接输出矩阵包含了两个2×2的输出矩阵(图中的浅灰色区域)和一个1×2的无效矩阵(图中的白色区域),因此,将所述无效矩阵从拼接输出矩阵中剔除后即可得到两个输出矩阵,图6清楚的显示了卷积核在拼接边界上滑动卷积的过程,由图6可知,拼接矩阵的卷积处理结果可以真实准确地与单个输入矩阵进行卷积计算后得到的卷积处理结果完全等效。应理解,图6仅仅用于举例说明,并不能构成具体限定。
在本申请具体的实施方式中,所述输出单元830具体用于在所述输出矩阵的矩阵宽度大于输入矩阵一半宽度的情况下,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。所述输出单元具体830用于在所述输出矩阵的矩阵宽度小于或等于输入矩阵一半宽度的 情况下,将所述输出矩阵中的每两行首尾相接合并成一行,得到N个第二拼接矩阵;所述输出单元830具体用于将所述N个第二拼接矩阵作为卷积神经网络下一层的N个输入矩阵。也就是说,当输出矩阵的宽度大于输入矩阵一半宽度的情况下,将输出矩阵直接作为卷积神经网络下一层的输入矩阵,当输出矩阵的宽度小于输入矩阵一半宽度的情况下,将输出矩阵中的每两行首尾拼接合并成一行后,再作为卷积神经网络下一层的输入矩阵,其中,图7提供了一种第二拼接矩阵的示意图,应理解,图7仅仅用于举例,并不能构成具体限定。可以理解的是,由于卷积神经网络拥有多个卷积层和池化层,因此,每个输出数据都会作为下一层的输入数据,继续使用本申请提出的并行计算方法进行计算,直到所有的卷积层和池化层的计算都处理完毕后,将提取到的特征数据输入全连接层、输出层,最终获得分类结果。本申请提供的并行计算方法与普通的计算方法相比较,由于每一层的卷积或池化计算都进行了数据拼接,使得处理器的流水线换行切换次数大大减少,从而减少了处理器在进行流水线切换时载入和存储步骤所消耗的时间,进一步提升了处理器的计算性能。
上述方法中,通过将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵,再使用矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵,从而从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。上述方法可以将多个小尺寸的输入矩阵进行水平拼接获得拼接矩阵,从而大大延长矢量处理器处理计算的流水线的长度,减少了处理器换行处理流水线时所需的载入和存储带来的额外开销,进一步提升了矢量处理器在处理小尺寸数据时的计算性能。
参见图9,图9是本申请提供的一种电子设备的结构示意图。如图所示的本实施例中的电子设备可以包括:一个或者多个处理器911、存储器912和通信接口913。其中,处理器911、存储器912和通信接口913之间可以通过总线914连接。
处理器911包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)、微处理器、微控制器、主处理器、控制器以及专用集成电路(Application Specific Integrated Circuit,ASIC)、数字信号处理器(Digital Signal Processor,DSP)、可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器911用于执行存储器912存储的程序指令。
存储器912可以包括易失性存储器,例如随机存取存储器(Random Access Mmemory,RAM);存储器也可以包括非易失性存储器,例如只读存储器(Read-Only Memory,ROM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD),存 储器还可以包括上述种类的存储器的组合。存储器912可以采用集中式存储,也可以采用分布式存储,此处不作具体限定。可以理解的是,存储器912用于存储计算机程序,例如:计算机程序指令等。在本申请实施例中,存储器912可以向处理器911提供指令和数据。
通信接口913可以为有线接口(例如以太网接口)或无线接口(例如蜂窝网络接口或使用无线局域网接口),用于与其他计算机设备或用户进行通信。当通信接口913为有线接口时,通信接口913可以采用网络通讯协议(Transmission Control Protocol/Internet Protocol,TCP/IP)之上的协议族,例如,远程函数调用(Remote Function Call,RFC)协议、简单对象访问协议(Simple Object Access Protocol,SOAP)协议、简单网络管理协议(Simple Network Management Protocol,SNMP)、公共对象请求代理体系结构协议(Common Object Request Broker Architecture,CORBA)以及分布式协议等等。当通信接口913为无线接口时,可以根据全球移动通信系统(Global System for Mobile Communication,GSM)或者码分多址(Code Division Multiple Access,CDMA)标准利用蜂窝通信,因此包括用于数据传输的无线调制解调器、电子处理设备、一个或多个数字存储器设备以及双天线。
在本申请实施例中,处理器911、存储器912、通信接口913和总线914可执行本申请实施例提供的并行计算方法的任一实施例中所描述的实现方式,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法及装置,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个 人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (10)

  1. 一种并行计算方法,其特征在于,包括:
    将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵,其中,所述目标层包括卷积层和池化层,所述输入矩阵包含计算标识,所述计算标识包括卷积计算标识及池化计算标识,所述卷积计算标识包括卷积核参数,所述池化计算标识包括池化窗口参数,N的值是由矢量处理器的位宽确定的;
    矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵,其中,所述拼接输出矩阵包含N个输出矩阵;
    从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
  2. 根据权利要求1所述的方法,其特征在于,所述目标层为卷积层的情况下,所述将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵包括:
    在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵;
    在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;
    在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述卷积核一半宽度向下取整的结果;
    在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果。
  3. 根据权利要求2所述的方法,其特征在于,在所述目标层是池化层的情况下,所述将卷积神经网络目标层的N个不同的输入矩阵进行水平方向的拼接,得到第一拼接输入矩阵包括:
    在所述输入矩阵为非填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼 接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;
    在所述输入矩阵为填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵。
  4. 根据权利要求3所述的方法,其特征在于,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵包括:
    在所述输出矩阵的矩阵宽度大于输入矩阵一半宽度的情况下,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
  5. 根据权利要求4所述的方法,其特征在于,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵还包括:
    在所述输出矩阵的矩阵宽度小于或等于输入矩阵一半宽度的情况下,将所述输出矩阵中的每两行首尾拼接合并成一行,得到N个第二拼接矩阵;
    将所述N个第二拼接矩阵作为卷积神经网络下一层的N个输入矩阵。
  6. 一种并行计算装置,其特征在于,所述装置包括拼接单元、计算单元以及输出单元,其中,
    所述拼接单元用于将卷积神经网络目标层的N个输入矩阵进行水平方向的拼接,获得第一拼接输入矩阵,其中,所述目标层包括卷积层和池化层,所述输入矩阵包含计算标识,所述计算标识包括卷积计算标识及池化计算标识,所述卷积计算标识包括卷积核参数,所述池化计算标识包括池化窗口参数,N的值是由矢量处理器的位宽确定的;
    所述计算单元用于使用矢量处理器根据所述计算标识对所述第一拼接输入矩阵进行水平方向的计算处理,得到拼接输出矩阵,其中,所述拼接输出矩阵包含N个输出矩阵;
    所述输出单元用于从所述拼接输出矩阵中筛选出所述N个输出矩阵,并将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
  7. 根据权利要求6所述的装置,其特征在于,所述目标层为卷积层的情况下,所述拼接单元具体用于在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长为1的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵;
    所述拼接单元具体用于在所述输入矩阵为非填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;
    所述拼接单元具体用于在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长为1的情况 下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述卷积核一半宽度向下取整的结果;
    所述拼接单元具体用于在所述输入矩阵为填充卷积层矩阵且卷积核滑动步长大于1的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果。
  8. 根据权利要求7所述的装置,其特征在于,在所述目标层是池化层的情况下,所述拼接单元具体用于在所述输入矩阵为非填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵的每行末尾填充k个零后获得N个填充矩阵,将所述N个填充矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵,其中,k值为所述输入矩阵的矩阵宽度对所述卷积核滑动步长取模的结果;
    所述拼接单元具体用于在所述输入矩阵为填充池化层矩阵且池化窗口宽度为s的情况下,将所述N个输入矩阵进行水平方向的首尾拼接获得第一拼接输入矩阵。
  9. 根据权利要求8所述的装置,其特征在于,所述输出单元具体用于在所述输出矩阵的矩阵宽度大于输入矩阵一半宽度的情况下,将所述N个输出矩阵作为卷积神经网络下一层的N个输入矩阵。
  10. 根据权利要求9所述的装置,其特征在于,所述输出单元具体用于在所述输出矩阵的矩阵宽度小于或等于输入矩阵一半宽度的情况下,将所述输出矩阵中的每两行首尾相接合并成一行,得到N个第二拼接矩阵;
    所述输出单元具体用于将所述N个第二拼接矩阵作为卷积神经网络下一层的N个输入矩阵。
PCT/CN2018/124831 2018-11-26 2018-12-28 一种并行计算方法及装置 WO2020107616A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811417046.8A CN111222624B (zh) 2018-11-26 2018-11-26 一种并行计算方法及装置
CN201811417046.8 2018-11-26

Publications (1)

Publication Number Publication Date
WO2020107616A1 true WO2020107616A1 (zh) 2020-06-04

Family

ID=70830288

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124831 WO2020107616A1 (zh) 2018-11-26 2018-12-28 一种并行计算方法及装置

Country Status (2)

Country Link
CN (1) CN111222624B (zh)
WO (1) WO2020107616A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312285A (zh) * 2021-06-11 2021-08-27 西安微电子技术研究所 一种卷积神经网络加速器及其工作方法
CN113919405A (zh) * 2020-07-07 2022-01-11 华为技术有限公司 数据处理方法、装置与相关设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156807A (zh) * 2015-04-02 2016-11-23 华中科技大学 卷积神经网络模型的训练方法及装置
CN107368886A (zh) * 2017-02-23 2017-11-21 奥瞳系统科技有限公司 基于重复使用小规模卷积神经网络模块的神经网络系统
US20180182083A1 (en) * 2016-12-27 2018-06-28 Intel IP Corporation Convolutional neural network for wide-angle camera images
CN108334910A (zh) * 2018-03-30 2018-07-27 国信优易数据有限公司 一种事件检测模型训练方法以及事件检测方法
CN108572850A (zh) * 2017-03-09 2018-09-25 谷歌有限责任公司 矢量处理单元

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6959300B1 (en) * 1998-12-10 2005-10-25 At&T Corp. Data compression method and apparatus
CN104915322B (zh) * 2015-06-09 2018-05-01 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法
CN106782602B (zh) * 2016-12-01 2020-03-17 南京邮电大学 基于深度神经网络的语音情感识别方法
CN107292256B (zh) * 2017-06-14 2019-12-24 西安电子科技大学 基于辅任务的深度卷积小波神经网络表情识别方法
CN108280514B (zh) * 2018-01-05 2020-10-16 中国科学技术大学 基于fpga的稀疏神经网络加速系统和设计方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156807A (zh) * 2015-04-02 2016-11-23 华中科技大学 卷积神经网络模型的训练方法及装置
US20180182083A1 (en) * 2016-12-27 2018-06-28 Intel IP Corporation Convolutional neural network for wide-angle camera images
CN107368886A (zh) * 2017-02-23 2017-11-21 奥瞳系统科技有限公司 基于重复使用小规模卷积神经网络模块的神经网络系统
CN108572850A (zh) * 2017-03-09 2018-09-25 谷歌有限责任公司 矢量处理单元
CN108334910A (zh) * 2018-03-30 2018-07-27 国信优易数据有限公司 一种事件检测模型训练方法以及事件检测方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919405A (zh) * 2020-07-07 2022-01-11 华为技术有限公司 数据处理方法、装置与相关设备
CN113919405B (zh) * 2020-07-07 2024-01-19 华为技术有限公司 数据处理方法、装置与相关设备
CN113312285A (zh) * 2021-06-11 2021-08-27 西安微电子技术研究所 一种卷积神经网络加速器及其工作方法
CN113312285B (zh) * 2021-06-11 2023-08-18 西安微电子技术研究所 一种卷积神经网络加速器及其工作方法

Also Published As

Publication number Publication date
CN111222624B (zh) 2022-04-29
CN111222624A (zh) 2020-06-02

Similar Documents

Publication Publication Date Title
US10395341B2 (en) Panoramic image generation method and apparatus for user terminal
CN110381322A (zh) 视频流解码方法、装置、终端设备及存储介质
CN108509272B (zh) 将gpu显存纹理拷贝到系统内存的方法、装置及电子设备
CN110428382B (zh) 一种用于移动终端的高效视频增强方法、装置和存储介质
WO2020107616A1 (zh) 一种并行计算方法及装置
US11308647B2 (en) Method and system for improving compression ratio by difference between blocks of image file
JP2013186770A (ja) データ処理装置
US11470327B2 (en) Scene aware video content encoding
WO2020014893A1 (zh) 反卷积实现方法及相关产品
CN108124157B (zh) 信息交互方法、装置及系统
US11734007B2 (en) Address generation method, related apparatus, and storage medium
WO2020041962A1 (zh) 一种并行反卷积计算方法、单引擎计算方法及相关产品
US10311557B2 (en) Automated tonal balancing
US11539955B2 (en) Method and system for improving compression ratio through pixel conversion of image file
TW201903649A (zh) 辨識系統、辨識方法及非暫態電腦可讀取媒體
AU2017363055B2 (en) Multi-pixel caching scheme for lossless encoding
CN116227599A (zh) 一种推理模型的优化方法、装置、电子设备及存储介质
CN115346099A (zh) 基于加速器芯片的图像卷积方法、芯片、设备及介质
CN110910312A (zh) 图像处理方法和装置、自动驾驶车辆、电子设备
WO2024078356A1 (zh) 规则处理、规则查找方法、装置、设备及可读存储介质
CN116883691B (zh) 一种边缘设备的高帧率多路目标检测方法
US11223846B1 (en) Complexity-based motion search for video encoder
WO2024114304A1 (zh) 一种运算资源处理方法以及相关设备
US20230171390A1 (en) Image conversion apparatus and method
US8566597B2 (en) Digital signature program, digital signature apparatus, and digital signature method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18941740

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18941740

Country of ref document: EP

Kind code of ref document: A1