WO2023065983A1 - Appareil informatique, dispositif de traitement de réseau neuronal, puce et procédé de traitement de données - Google Patents

Appareil informatique, dispositif de traitement de réseau neuronal, puce et procédé de traitement de données Download PDF

Info

Publication number
WO2023065983A1
WO2023065983A1 PCT/CN2022/121442 CN2022121442W WO2023065983A1 WO 2023065983 A1 WO2023065983 A1 WO 2023065983A1 CN 2022121442 W CN2022121442 W CN 2022121442W WO 2023065983 A1 WO2023065983 A1 WO 2023065983A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
data
multiply
blocks
accumulators
Prior art date
Application number
PCT/CN2022/121442
Other languages
English (en)
Chinese (zh)
Inventor
孙炜
祝叶华
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2023065983A1 publication Critical patent/WO2023065983A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present application relate to the field of data processing, and more specifically, relate to a computing device, a neural network processing device, a chip, and a method for processing data.
  • the essence of many data processing tasks is matrix multiplication.
  • the essence of convolution operation tasks and fully connected operation tasks is matrix multiplication.
  • the traditional matrix multiplication operation method adopts the operation method based on vector inner product.
  • the operation method based on the vector inner product requires a large amount of repeated scheduling of the data corresponding to the element, resulting in a large amount of data to be transmitted during the matrix multiplication operation and high transmission power consumption.
  • the present application provides a computing device, a neural network processing device, a chip and a method for processing data, so as to reduce the amount of data to be transmitted and the transmission power consumption required for matrix multiplication operations.
  • a computing device including: a data memory, configured to store matrix data of a first matrix and a second matrix, the first matrix is an m ⁇ s matrix, and the second matrix is s ⁇ n matrix, the matrix data includes s groups of data, wherein each group of data includes a column vector in the first matrix and a row vector identical to the column vector index in the second matrix, where s, Both m and n are positive integers greater than or equal to 1;
  • the calculation module is connected to the data memory, and is used to perform the multiplication operation of the first matrix and the second matrix;
  • the scheduling module is connected to the data memory , for controlling the data memory to input the s sets of data to the calculation module, the calculation module performs a vector outer product operation based on the column vector and the row vector in each set of data in the s sets of data, to Obtain the s m ⁇ n intermediate result matrices corresponding to the s groups of data one by one, and add the corresponding elements in the s m ⁇ n intermediate
  • a neural network processing device includes a computing device, and the computing device includes: a data memory for storing matrix data of a first matrix and a second matrix, and the first matrix is a matrix of m ⁇ s, the second matrix is a matrix of s ⁇ n, and the matrix data includes s groups of data, wherein each group of data includes a column vector in the first matrix and a column vector in the second matrix A row vector with the same index as the column vector, wherein s, m, and n are all positive integers greater than or equal to 1; a calculation module, connected to the data memory, is used to execute the first matrix and the The multiplication operation of the second matrix; a scheduling module, connected to the data storage, for controlling the data storage to input the s sets of data to the calculation module, and the calculation module is based on each of the s sets of data Perform a vector outer product operation on the column vector and row vector in the data to obtain s m ⁇ n intermediate result matrices
  • a chip in a third aspect, includes a computing device, and the computing device includes: a data memory for storing matrix data of a first matrix and a second matrix, and the first matrix is an m ⁇ s matrix , the second matrix is a matrix of s ⁇ n, and the matrix data includes s sets of data, wherein each set of data includes a column vector in the first matrix and a column vector in the second matrix and the column vector A row vector with the same index, wherein s, m, and n are all positive integers greater than or equal to 1; a calculation module, connected to the data memory, is used to perform the multiplication operation of the first matrix and the second matrix a scheduling module, connected to the data memory, for controlling the data memory to input the s groups of data to the calculation module, and the calculation module is based on the column vector sum in each group of data in the s groups of data
  • the row vector performs a vector outer product operation to obtain the s m ⁇ n intermediate result matrices corresponding to the s groups of data
  • an electronic device in a fourth aspect, includes a computing device, and the computing device includes: a data memory for storing matrix data of a first matrix and a second matrix, and the first matrix is m ⁇ s matrix, the second matrix is a matrix of s ⁇ n, and the matrix data includes s groups of data, wherein each group of data includes a column vector in the first matrix and a column vector in the second matrix and the A row vector with the same column vector index, wherein s, m, and n are all positive integers greater than or equal to 1; a calculation module, connected to the data memory, is used to execute the calculation of the first matrix and the second matrix Multiplication operation; scheduling module, connected to the data memory, used to control the data memory to input the s group of data to the calculation module, and the calculation module is based on the column in each group of data in the s group of data The vector and the row vector perform a vector outer product operation to obtain s m ⁇ n intermediate result matrices corresponding to the s groups
  • a method for processing data is applied to a computing device, and the computing device includes: a data memory for storing matrix data of a first matrix and a second matrix, and the first matrix is m A matrix of ⁇ s, the second matrix is a matrix of s ⁇ n, and the matrix data includes s groups of data, wherein each group of data includes a column vector in the first matrix and an AND in the second matrix A row vector with the same column vector index, wherein s, m, and n are all positive integers greater than or equal to 1; a calculation module, connected to the data memory, is used to execute the first matrix and the second matrix Matrix multiplication; the method includes: controlling the data memory to input the s sets of data to the calculation module; performing a vector outer product based on the column vector and the row vector in each set of data in the s sets of data operation, to obtain the s m ⁇ n intermediate result matrices corresponding to the s groups of data one-to-one; add
  • a computer-readable storage medium on which codes for executing the method described in the fifth aspect are stored.
  • a computer program product including a plurality of instructions, and the instructions implement the method as described in the fifth aspect when executed by a computing device.
  • Figure 1 is an example diagram of matrix multiplication based on vector inner product.
  • FIG. 2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • FIG. 3 is an example diagram of a hardware architecture of a computing device provided by an embodiment of the present application.
  • Fig. 4 is an example diagram of matrix multiplication based on vector outer product provided by the embodiment of the present application.
  • FIG. 5 is an example diagram of a data mapping manner when the method in FIG. 4 is applied to the hardware architecture shown in FIG. 3 .
  • Fig. 6 is an example diagram of convolution operation.
  • FIG. 7 is an example diagram of performing the convolution operation shown in FIG. 6 by using matrix multiplication based on vector outer product provided by the embodiment of the present application.
  • FIG. 8 is an example diagram of performing a fully connected operation using matrix multiplication based on vector outer product provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a device provided by an embodiment of the present application.
  • Fig. 10 is a schematic flowchart of a method for processing data provided by an embodiment of the present application.
  • the inner product operation of two vectors can also be called the dot product operation of two vectors, that is, the elements at the corresponding positions of the two vectors are multiplied one by one and then summed.
  • the result of the outer product operation of two vectors is not a scalar, but a matrix (or two-dimensional vector).
  • the multiplication operation of the matrix can be converted into the vector inner product operation of the row and column vectors of the matrix.
  • FIG. 1 taking the first matrix 12 of m ⁇ s and the second matrix 14 of s ⁇ n as examples, the multiplication operation of matrices will be illustrated.
  • the first matrix 12 includes m row vectors
  • the second matrix 14 includes n column vectors.
  • the resulting matrix 16 obtained by multiplying the first matrix 12 and the second matrix 14 is an m ⁇ n matrix.
  • the multiplication operation of the first matrix 12 and the second matrix 14 can be regarded as a vector inner product operation between the m row vectors of the first matrix and the n column vectors of the second matrix.
  • Each vector inner product operation can obtain the value of an element 161 in the result matrix 16 .
  • the values of all elements in the result matrix 16 can be obtained.
  • a matrix can be divided into multiple partitions.
  • a multiplication operation between two matrices can be decomposed into multiplication and addition operations between blocks of the two matrices. hypothesis matrix matrix Then the matrix A can be decomposed into the following four block matrices:
  • the computing device mentioned in this application may refer to any type of computing device based on hardware architecture and capable of performing the multiplication operation of two matrices.
  • the computing device 20 may include a data storage 22 , a computing module 24 and a scheduling module 26 .
  • the data memory 22 may be used to store matrix data for the first matrix and the second matrix.
  • the first matrix m represents a row index of the first matrix
  • s represents a column index of the first matrix.
  • m and s are usually positive integers greater than 1.
  • the embodiment of the present application is not limited thereto.
  • one of m and s can also be equal to 1, in this case, the first matrix can be understood as a vector.
  • s represents a row index of the second matrix
  • n represents a column index of the second matrix.
  • s and n are usually positive integers greater than 1.
  • the embodiment of the present application is not limited thereto.
  • one of s and n can also be equal to 1, in this case, the second matrix can be understood as a vector.
  • the embodiment of the present application does not specifically limit the data content of the matrix data of the first matrix and the second matrix, which is related to the calculation tasks performed by the calculation device 20 .
  • the convolution operation is essentially a matrix multiplication
  • the convolution operation will be illustrated in conjunction with FIG.
  • the data in the matrix can be the data of the input feature map
  • the data in the second matrix can be the weight data in the convolution kernel.
  • the data memory 22 may include, for example, a high-speed random access memory (random access memory, RAM).
  • RAM random access memory
  • the data storage 22 may be, for example, a data cache of a neural-network processing unit (NPU).
  • NPU neural-network processing unit
  • the calculation module 24 may also be called a processing element (processing element, PE) or an operation circuit.
  • the computing module 24 can be connected to the data storage 22 .
  • the computing module 24 may be responsible for completing some or all of the computing tasks on the computing device 20 .
  • calculation module 24 may be configured to perform a multiplication operation of the first matrix and the second matrix.
  • the calculation module 24 may be implemented in various ways. Since the matrix multiplication operation is essentially a multiply-accumulate operation, in some embodiments, the computing module 24 may be a multiply-accumulator-based computing module.
  • FIG. 3 shows a possible implementation of the calculation module 24 .
  • the basic unit in the calculation module 24 is a multiply-accumulator 241 .
  • the multiply accumulator 241 may include a multiplier 2411 and an adder 2412 .
  • the input of the multiplier 2411 is respectively one element of the first matrix and one element of the second matrix.
  • the data in the first matrix is the data in the input feature map
  • the data in the second matrix is the weight data in the convolution kernel.
  • the inputs of the multiplier 2411 can be denoted by F and W. Among them, F represents the feature data in the input feature map; W represents the weight data in the convolution kernel.
  • the calculation module 24 may include m ⁇ n multiply-accumulate trees 242 (a column of multiply-accumulates in FIG. 3 corresponds to one multiply-accumulate tree 242 ).
  • the multiply-accumulate tree 242 may include s multiply-accumulators 241 .
  • the s multiply-accumulators 241 are connected end-to-end through a data line 2413 to form a data operation pipeline.
  • the pipeline can accumulate the calculation results of the s multiply-accumulators in the multiply-accumulate tree 242 step by step.
  • the multiply-accumulate tree 242 can utilize its internal s multiply-accumulators to perform s multiplication operations, and then the multiply-accumulate tree 242 can start from the first-stage multiply-accumulator 241a, and the s multiply results can be obtained step by step It is passed down to the last stage of multiplication accumulator 241s, and the multiplication results of all stages of multiplication accumulators are accumulated during the transfer process.
  • the output of the last stage of multiply-accumulator 241s is the accumulated sum of the s multiplication operations performed by the s multiply-accumulators in the multiply-accumulate tree 242 .
  • the scheduling module 26 may be connected to the data store 22 .
  • the scheduling module 26 may be, for example, a controller with a data scheduling function.
  • the scheduling module 26 can control the data memory 22 to input the matrix data of the first matrix and the second matrix to be multiplied to the calculation module 24, so that the calculation module 24 can be used to complete the matrix operation of the first matrix and the second matrix.
  • the scheduling module 26 usually schedules matrix data according to a certain data scheduling strategy.
  • the scheduling module 26 may input a row vector of the first matrix and a column vector of the second matrix to the calculation module 24 according to the matrix multiplication operation method shown in FIG. 1 .
  • the calculation module 24 After obtaining the row vector of the first matrix and the column vector of the second matrix, the calculation module 24 will perform a multiply-accumulate operation according to the vector inner product, thereby obtaining the value of an element in the result matrix.
  • the specific calculation method can be See the related description of Figure 1. Taking the computing module 24 shown in Fig.
  • the scheduling module 26 can read the row vector (comprising s elements) of the first matrix and the column vector (comprising s elements) of the second matrix into one of Fig. 3 respectively In the s multiply-accumulators of the multiply-accumulate tree 242. Then, the multiply-accumulate tree 242 can perform s multiplications, and accumulate the results of the s multiplications to obtain the value of an element in the result matrix.
  • the above data scheduling and calculation method is referred to as matrix multiplication based on vector inner product.
  • the calculation module 24 can obtain the value of an element in the result matrix, and perform m ⁇ n vector inner product operations to obtain all the elements in the result matrix The value of m ⁇ n elements of .
  • every calculation of the value of one of the elements needs to input 2s elements to the calculation module 26 (a row of the first matrix and a column of the second matrix, a total of 2s elements) corresponding to the amount of data.
  • each row vector of the first matrix needs to be multiplied by n column vectors of the second matrix, each row vector of the first matrix will be repeatedly scheduled n times; similarly, due to the Each column vector needs to be multiplied by m row vectors of the first matrix, therefore, each column vector of the second matrix will be repeatedly scheduled m times.
  • Such large-scale repeated scheduling of data will inevitably increase the amount of data to be transmitted and the transmission power consumption.
  • the embodiment of the present application provides a matrix multiplication based on vector outer product, which can reduce the amount of data to be transmitted and the transmission power consumption during the matrix multiplication operation.
  • the embodiment of this application proposes a matrix multiplication operation based on vector outer product. Compared with the traditional matrix multiplication operation based on vector inner product, the matrix multiplication operation based on vector outer product has a higher data reuse rate, so it can reduce Data transmission volume and transmission power consumption.
  • the matrix data of the first matrix and the second matrix may include s sets of data.
  • Each set of data in the s sets of data may include a column vector in the first matrix and a row vector in the second matrix having the same index as the column vector.
  • the i-th set of data in the s sets of data may include the i-th column vector of the first matrix and the i-th row vector of the second matrix.
  • the scheduling module 26 can control the data storage 22 to input s sets of data to the calculation module 24 .
  • the scheduling module 26 may sequentially input s sets of data to the calculation module 24 .
  • the scheduling module 26 may also input multiple sets of data in the s sets of data at one time.
  • the calculation module 24 can perform a vector outer product operation based on the column vector and the row vector in each group of data in s groups of data to obtain s intermediate results of m ⁇ n corresponding to s groups of data one-to-one matrix (a matrix 162 in FIG. 4 represents an m ⁇ n intermediate result matrix), and add corresponding elements in s m ⁇ n intermediate result matrices to obtain the product result of the first matrix and the second matrix.
  • the calculation module 24 can start from the 0th column of the first matrix 12 and the 0th row of the second matrix 14, and perform a vector outer product based on the column vector 0 of the first matrix 12 and the row vector 0 of the second matrix .
  • the column vector 1 of the first matrix 12 can be taken along the direction of the arrow in the first matrix 12 in Figure 4, and the second matrix can be taken along the direction of the arrow in the second matrix 14 in Figure 4 14 row vector 1, and then perform a vector outer product based on the column vector 1 of the first matrix 12 and the row vector 1 of the second matrix.
  • the calculation module 24 completes s vector outer product operations, it can obtain the product result of the first matrix and the second matrix.
  • the above-mentioned scheduling and calculation method of matrix data is referred to as matrix multiplication based on vector outer product.
  • Matrix multiplication based on vector outer product Performing a vector outer product operation requires a column vector of the first matrix and a row vector of the second matrix, with a total of (m+n) elements.
  • the matrix multiplication based on the vector outer product needs to perform the above-mentioned vector outer product operation s times, and a total of s ⁇ (m+n) elements are required.
  • FIG. 3 A possible implementation of the calculation module 24 is given above in conjunction with FIG. 3 .
  • the hardware architecture shown in Figure 3 is suitable for both matrix multiplication based on vector inner product and matrix multiplication based on vector outer product. In other words, matrix multiplication based on vector outer product and matrix multiplication based on vector inner product can share the same hardware architecture.
  • the calculation module 24 includes m ⁇ n multiply-accumulate trees, and each multiply-accumulate tree includes s multiply-accumulators (corresponding to a column of multiply-accumulators in Fig. 3 or Fig. 5 ) connected end to end in sequence. .
  • the multiply-accumulators of the m ⁇ n multiply-accumulate trees can be divided into s groups of multiply-accumulates, each group includes m ⁇ n multiply-accumulates, and the m ⁇ n multiply-accumulates belong to the m ⁇ n multiply-accumulate trees respectively . That is to say, any two multiply-accumulators in each set of multiply-accumulators in the s sets of multiply-accumulators come from different multiply-accumulate trees among the m ⁇ n multiply-accumulate trees.
  • the multiply-accumulators 243 in each row in FIG. 5 can be divided into one group.
  • the s row multiplication accumulator in FIG. 5 can form the above s group multiplication accumulator.
  • the calculation module 24 can respectively perform the above-mentioned s times of vector outer product operations based on s multiplication accumulators, so as to obtain s m ⁇ n intermediate result matrices.
  • the calculation module 24 can control the i-th row multiplication accumulator in the s row multiplication accumulator to perform the vector calculation between the i-th column vector of the first matrix and the i-th row vector of the second matrix.
  • Product operation to obtain s m ⁇ n intermediate result matrices.
  • the s m ⁇ n intermediate result matrices are respectively stored in the s ⁇ m ⁇ n multiply-accumulators shown in FIG. 5 .
  • the calculation module 24 can use the data processing pipeline provided by the m ⁇ n multiply-accumulate trees to add corresponding elements in the s m ⁇ n intermediate result matrices to obtain the product result of the first matrix and the second matrix.
  • the calculation module 24 can start from the first-level multiply-accumulator of each multiply-accumulate tree, accumulate the multiplication results in each multiply-accumulate tree, and output a multiplication-accumulator at the last level of each multiply-accumulate tree The final result of multiplying and accumulating.
  • the m ⁇ n final results output by the m ⁇ n multiply-accumulate tree are the same as the m ⁇ n final results obtained by matrix multiplication based on vector inner product.
  • the size of the two matrices to be multiplied may be relatively large. Limited by hardware processing capability, the computing device 20 cannot complete the multiplication of the two matrices at one time. Faced with this situation, the third matrix and/or the fourth matrix can be processed in blocks according to the matrix block method introduced above. For example, the third matrix can be divided into A blocks (the sizes of the A blocks are all less than or equal to m ⁇ s), and the fourth matrix can be divided into B blocks (the sizes of the B blocks are all smaller than or equal to s ⁇ n), so that the computing device 20 can support the matrix multiplication operation between A blocks and B blocks.
  • the calculation device 20 can multiply the A blocks and the B blocks in pairs according to the matrix multiplication based on the vector outer product introduced above, and based on the pairwise multiplication of the A blocks and the B blocks, As a result, the product result of the third matrix and the fourth matrix is obtained.
  • the computing device provided by the embodiments of the present application can be applied to various scenarios or computing tasks requiring matrix multiplication operations.
  • neural networks have been widely used in various industries.
  • the essence of many operators in neural networks is matrix multiplication.
  • convolution operations and fully connected operations that are often used in neural networks are essentially matrix multiplication. Therefore, in some embodiments, the computing device 20 provided by the embodiment of the present application may be used to perform neural network operations, such as convolution operations or full connection operations.
  • the computing device 20 may also be called a convolution device, a convolution processor, a convolution accelerator, a convolution acceleration engine, and the like.
  • Convolution operations are widely used in neural network computing.
  • the essence of the convolution operation is the multiplication operation of two matrices. Therefore, the same as the multiplication operation of two matrices, the core operation of the convolution operation is the multiply-accumulate operation.
  • Figure 6 shows an example of a convolution operation, taking a 1 ⁇ 1 convolution kernel as an example.
  • the upper left corner of Figure 6 shows the input feature image (input feature map).
  • M represents the width of the input feature map
  • S represents the number of channels of the input feature map (the channels can be R, G, B, for example).
  • the lower left corner of FIG. 2 shows a convolution kernel (kernel), and the data in the convolution kernel may be called weight data.
  • N represents the number of convolution kernels.
  • the number of channels of the convolution kernel is equal to the number of channels of the input feature map, both are S. Shown on the right side of Figure 2 is the output feature map. For convolution operations, the number of channels of the output feature map is equal to the number of convolution kernels.
  • the calculation amount of the convolution operation is usually relatively large. Therefore, when the above-mentioned computing device 20 is used to perform the convolution operation, the hardware processing capability of the computing device needs to be considered.
  • s, m, and n in FIG. 6 are the number of input channels, feature image width, and number of convolution kernels that the computing device 20 can process at one time. Therefore, in some embodiments, the input feature map and the weight data in the convolution kernel can be arranged in the form of two matrices to form the third matrix and the fourth matrix for matrix multiplication to be performed as shown in FIG. 7 .
  • the size of the third matrix is M ⁇ S
  • the size of the fourth matrix is S ⁇ N.
  • the third matrix and/or the fourth matrix may be processed in blocks according to the matrix block method introduced above.
  • the third matrix can be divided into A blocks (the sizes of the A blocks are all less than or equal to m ⁇ s), and the fourth matrix can be divided into B blocks (the sizes of the B blocks are all smaller than or equal to s ⁇ n), so that the computing device 20 can support the matrix multiplication operation between A blocks and B blocks.
  • the calculation device 20 can multiply the A blocks and the B blocks in pairs according to the scheduling method based on the vector outer product introduced above, and based on the two-by-two multiplication of the A blocks and the B blocks, As a result, the product result of the third matrix and the fourth matrix is obtained (that is, the corresponding elements in the result of multiplying the A block and the B block are accumulated to obtain the product result of the third matrix and the fourth matrix ).
  • the result of the product of the third matrix and the fourth matrix can be used as the data in the output feature map.
  • FIG. 6 uses a 1 ⁇ 1 convolution kernel as an example to illustrate the convolution operation provided by the embodiment of the present application
  • the embodiment of the present application is not limited thereto.
  • convolution kernels of other sizes For example, convolution kernels of 3 ⁇ 3, 5 ⁇ 5, 7 ⁇ 7 and other sizes, whose core is matrix multiplication, can also be converted to matrix operations based on vector extraproducts to reduce data transmission volume and transmission power consumption.
  • applying the calculation device provided by the embodiment of the present application to the convolution operation can greatly reduce the amount of data that needs to be transmitted during the convolution operation, thereby reducing the power and data transmission of data. memory line width.
  • the fully connected operation is a special case of matrix multiplication.
  • FIG 8 taking the matrix multiplication of the first matrix of m ⁇ s and the second matrix of s ⁇ n mentioned above as an example, if the value of m is set to 1, the multiplication of the first matrix and the second matrix It can be regarded as a full connection operation.
  • the computing device 20 mentioned in the embodiment of the present application may include other various types of components in addition to the above-mentioned data storage 22 , computing module 24 and scheduling module 26 .
  • the computing device 20 may also include one or more of the following components: a computing module (or computing circuit, such as an activation circuit for performing an activation operation, a pooling circuit for performing a pooling operation, etc.) capable of performing other operations, a register , internal memory, etc.
  • the embodiment of the present application further provides a device 90 .
  • the device 90 may comprise the computing means 20 described above.
  • the device 90 may be a neural network processing device and/or chip.
  • the device 90 may be, for example, a mobile terminal (such as a mobile phone), a computer, a server, and the like.
  • Fig. 10 is a schematic flowchart of a method for processing data provided by an embodiment of the present application.
  • the method of FIG. 10 is applicable to computing devices.
  • the computing device may be the aforementioned computing device 20 .
  • the method in FIG. 10 includes steps S1010-S1030.
  • step S1010 the control data store inputs s sets of data to the calculation module.
  • step S1020 a vector outer product operation is performed based on the column vectors and row vectors in each of the s sets of data to obtain s m ⁇ n intermediate result matrices corresponding to the s sets of data one-to-one.
  • step S1030 the corresponding elements in the s m ⁇ n intermediate result matrices are added together to obtain the product result of the first matrix and the second matrix.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be read by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital versatile disc (digital video disc, DVD)) or a semiconductor medium (for example, a solid state disk (solid state disk, SSD) )wait.
  • a magnetic medium for example, a floppy disk, a hard disk, a magnetic tape
  • an optical medium for example, a digital versatile disc (digital video disc, DVD)
  • a semiconductor medium for example, a solid state disk (solid state disk, SSD)

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)

Abstract

L'invention concerne un appareil informatique, un dispositif de traitement de réseau neuronal, une puce et un procédé de traitement de données. L'appareil informatique comprend : une mémoire de données utilisée pour stocker des données de matrice d'une première matrice et d'une seconde matrice, la première matrice étant une matrice m*s, la seconde matrice étant une matrice s*n, et les données de matrice comprenant s groupes de données, chaque groupe de données comprenant un vecteur de colonne dans la première matrice et un vecteur de rangée, ayant le même indice que le vecteur de colonne, dans la seconde matrice ; un module informatique connecté à la mémoire de données et utilisé pour exécuter une opération de multiplication de la première matrice et de la seconde matrice ; et un module de planification connecté à la mémoire de données et utilisé pour commander la mémoire de données pour entrer les s groupes de données dans le module informatique, le module informatique exécutant s fois l'opération de produit externe de vecteur sur la base des s groupes de données de façon à obtenir un résultat de produit de la première matrice et de la seconde matrice. L'opération de multiplication de matrice basée sur le produit externe de vecteur fournie par les modes de réalisation de la présente demande présente un taux de réutilisation de données élevé, de telle sorte que la quantité de transmission de données et la consommation d'énergie de transmission puissent être réduites.
PCT/CN2022/121442 2021-10-19 2022-09-26 Appareil informatique, dispositif de traitement de réseau neuronal, puce et procédé de traitement de données WO2023065983A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111218718.4 2021-10-19
CN202111218718.4A CN113918120A (zh) 2021-10-19 2021-10-19 计算装置、神经网络处理设备、芯片及处理数据的方法

Publications (1)

Publication Number Publication Date
WO2023065983A1 true WO2023065983A1 (fr) 2023-04-27

Family

ID=79241552

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121442 WO2023065983A1 (fr) 2021-10-19 2022-09-26 Appareil informatique, dispositif de traitement de réseau neuronal, puce et procédé de traitement de données

Country Status (2)

Country Link
CN (1) CN113918120A (fr)
WO (1) WO2023065983A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918120A (zh) * 2021-10-19 2022-01-11 Oppo广东移动通信有限公司 计算装置、神经网络处理设备、芯片及处理数据的方法
CN116795432B (zh) * 2023-08-18 2023-12-05 腾讯科技(深圳)有限公司 运算指令的执行方法、装置、电路、处理器及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213962A (zh) * 2017-07-07 2019-01-15 华为技术有限公司 运算加速器
CN109992743A (zh) * 2017-12-29 2019-07-09 华为技术有限公司 矩阵乘法器
CN110770697A (zh) * 2018-09-25 2020-02-07 深圳市大疆创新科技有限公司 数据处理装置和方法
US20210124560A1 (en) * 2019-10-25 2021-04-29 Arm Limited Matrix Multiplication System, Apparatus and Method
CN113110822A (zh) * 2021-04-20 2021-07-13 安徽芯纪元科技有限公司 一种可配置矩阵乘法装置及算法
CN113918120A (zh) * 2021-10-19 2022-01-11 Oppo广东移动通信有限公司 计算装置、神经网络处理设备、芯片及处理数据的方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213962A (zh) * 2017-07-07 2019-01-15 华为技术有限公司 运算加速器
CN109992743A (zh) * 2017-12-29 2019-07-09 华为技术有限公司 矩阵乘法器
CN110770697A (zh) * 2018-09-25 2020-02-07 深圳市大疆创新科技有限公司 数据处理装置和方法
US20210124560A1 (en) * 2019-10-25 2021-04-29 Arm Limited Matrix Multiplication System, Apparatus and Method
CN113110822A (zh) * 2021-04-20 2021-07-13 安徽芯纪元科技有限公司 一种可配置矩阵乘法装置及算法
CN113918120A (zh) * 2021-10-19 2022-01-11 Oppo广东移动通信有限公司 计算装置、神经网络处理设备、芯片及处理数据的方法

Also Published As

Publication number Publication date
CN113918120A (zh) 2022-01-11

Similar Documents

Publication Publication Date Title
KR102443546B1 (ko) 행렬 곱셈기
CN112214726B (zh) 运算加速器
US10942986B2 (en) Hardware implementation of convolutional layer of deep neural network
US10459876B2 (en) Performing concurrent operations in a processing element
US20210117810A1 (en) On-chip code breakpoint debugging method, on-chip processor, and chip breakpoint debugging system
WO2023065983A1 (fr) Appareil informatique, dispositif de traitement de réseau neuronal, puce et procédé de traitement de données
US9886377B2 (en) Pipelined convolutional operations for processing clusters
CN109522052B (zh) 一种计算装置及板卡
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
WO2019205617A1 (fr) Procédé et appareil de calcul pour la multiplication de matrices
TW202123093A (zh) 實行卷積運算的系統及方法
WO2019084788A1 (fr) Appareil de calcul, circuit et procédé associé pour réseau neuronal
WO2021232422A1 (fr) Dispositif arithmétique à réseau neuronal et procédé de commande de celui-ci
CN112765540A (zh) 数据处理方法、装置及相关产品
CN116090518A (zh) 基于脉动运算阵列的特征图处理方法、装置以及存储介质
KR102372869B1 (ko) 인공 신경망을 위한 행렬 연산기 및 행렬 연산 방법
GB2582868A (en) Hardware implementation of convolution layer of deep neural network
CN110929854B (zh) 一种数据处理方法、装置及硬件加速器
CN112784206A (zh) winograd卷积运算方法、装置、设备及存储介质
Kong et al. A high efficient architecture for convolution neural network accelerator
EP4174644A1 (fr) Appareil informatique, puce de circuit intégré, carte de circuit, dispositif électronique et procédé de calcul
TWI798591B (zh) 卷積神經網路運算方法及裝置
JP7368512B2 (ja) 計算装置、集積回路チップ、ボードカード、電子デバイスおよび計算方法
CN112214727B (zh) 运算加速器
CN116757913A (zh) 矩阵数据的处理方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882594

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE