CN113569193A - Matrix vector operation method, equipment and storage medium for neural network model - Google Patents

Matrix vector operation method, equipment and storage medium for neural network model Download PDF

Info

Publication number
CN113569193A
CN113569193A CN202110096963.6A CN202110096963A CN113569193A CN 113569193 A CN113569193 A CN 113569193A CN 202110096963 A CN202110096963 A CN 202110096963A CN 113569193 A CN113569193 A CN 113569193A
Authority
CN
China
Prior art keywords
matrix
vector
data
register
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110096963.6A
Other languages
Chinese (zh)
Inventor
王天舟
周城
李珽光
李世迪
张冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110096963.6A priority Critical patent/CN113569193A/en
Publication of CN113569193A publication Critical patent/CN113569193A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The application relates to a matrix vector operation method, equipment and a storage medium for a neural network model, wherein the method comprises the following steps: acquiring a first matrix; each matrix data in the first matrix is integer data; the first matrix comprises at least two matrix areas; each matrix area comprises at least two matrix data positioned on the same matrix register; acquiring a first vector; based on each matrix area, obtaining a vector data subset corresponding to each matrix area; acquiring an operation result corresponding to each matrix area based on each matrix area and the vector data subset corresponding to each matrix area; and acquiring the operation result of the first matrix and the first vector based on the operation result corresponding to each matrix area. By the scheme, in the vector matrix operation related to the neural network model, parallel operation between a plurality of matrix data on a single matrix register and the vector data can be realized, and the operation speed between the matrix and the vector is improved.

Description

Matrix vector operation method, equipment and storage medium for neural network model
Technical Field
The present application relates to the field of computers, and in particular, to a matrix vector operation method, device, and storage medium for a neural network model.
Background
In various computer-based algorithms, there are a large number of matrix-by-vector calculation operations.
In the related art, in order to ensure the accuracy of the neural network model, a computer usually stores model matrix data through floating point type data, during the calculation operation of the neural network model, matrix data and corresponding vector data are extracted one by one to perform calculation operation, and are accumulated to corresponding registers, and after the calculation operation of the matrix and the vector is completed, the values stored in the registers corresponding to the matrix are the operation results of the matrix vector operation.
In the technical scheme, the matrix data are extracted one by one and are subjected to calculation operation with the corresponding vector data, and the calculation rate is low.
Disclosure of Invention
The embodiment of the application provides a matrix vector operation method, a device, equipment and a storage medium for a neural network model, which can realize the parallel operation of a matrix and a vector and improve the operation speed between the matrix and the vector, and the technical scheme is as follows:
in one aspect, a matrix vector operation method for a neural network model is provided, the method including:
acquiring a first matrix; each matrix data in the first matrix is integer data; the first matrix comprises at least two matrix areas; each matrix area comprises at least two matrix data positioned on the same matrix register; each matrix data in the first matrix is a weight parameter in a neural network model;
acquiring a first vector; each vector data in the first vector is integer data;
based on each matrix area, obtaining a vector data subset corresponding to each matrix area; the subset of vector data is a subset of the first vector;
acquiring an operation result corresponding to each matrix area based on each matrix area and the vector data subset corresponding to each matrix area;
and acquiring the operation results of the first matrix and the first vector based on the operation results corresponding to the matrix areas.
In yet another aspect, there is provided a matrix vector operation apparatus for a neural network model, the apparatus including:
the first matrix acquisition module is used for acquiring a first matrix; each matrix data in the first matrix is integer data; the first matrix comprises at least two matrix regions; each matrix area comprises at least two matrix data positioned on the same matrix register; each matrix data in the first matrix is a weight parameter in a neural network model;
the first vector acquisition module is used for acquiring a first vector; each vector data in the first vector is integer data;
a vector data subset obtaining module, configured to obtain, based on each matrix region, a vector data subset corresponding to each matrix region; the subset of vector data is a subset of the first vector;
a region operation result obtaining module, configured to obtain an operation result corresponding to each matrix region based on each matrix region and a vector data subset corresponding to each matrix region;
and the operation result acquisition module is used for acquiring operation results of the first matrix and the first vector based on the operation results corresponding to the matrix areas.
In a possible implementation manner, the regional operation result obtaining module includes:
the first traversal submodule is used for traversing the matrix data of the first matrix area and storing the matrix data of the first matrix area into a first matrix register according to a traversal sequence; the first matrix region is any one of the respective matrix regions; the first matrix register is a matrix register corresponding to the first matrix region;
a first vector subset obtaining submodule, configured to obtain a first vector data subset corresponding to the first matrix region;
a first operation set obtaining sub-module, configured to copy the first vector data subset based on the matrix data of the first matrix region to obtain the first vector operation set, and store the first vector operation set in the first vector register; the first vector register is a vector register corresponding to the first matrix region;
and the first result obtaining submodule is used for obtaining an operation result corresponding to the first matrix area based on the first matrix register and the first vector register.
In one possible implementation manner, the first result obtaining sub-module is further configured to,
reading matrix data in the first matrix register and a first vector operation set in the first vector register;
and acquiring an operation result corresponding to the first matrix area based on the matrix data in the first matrix register and the first vector operation set in the first vector register.
In one possible implementation manner, the first matrix obtaining module includes:
the second matrix acquisition submodule is used for acquiring a second matrix; each matrix data in the second matrix is integer data;
a second partition determining submodule for determining a partition size of the second matrix based on the number of bits of the matrix register;
the second matrix partitioning submodule is used for partitioning the second matrix based on the partition size to obtain each second area of the second matrix and the arrangement sequence of each second area;
the second region traversal submodule is used for respectively performing line-by-line traversal on each second region and respectively storing the matrix data of each second region to the matrix register corresponding to each second region according to the line-by-line traversal sequence;
and the first matrix acquisition submodule is used for acquiring the first matrix based on the matrix register corresponding to each second area and the arrangement sequence of each second area.
In one possible implementation manner, the second partition determining sub-module includes:
a matrix data bit number obtaining unit, configured to obtain a data bit number of each piece of matrix data in the second matrix;
and the second partition determining unit is used for determining the partition size of the second matrix based on the bit number of the matrix register and the data bit number of each matrix data in the second matrix.
In one possible implementation manner, the first vector obtaining module includes:
the second vector acquisition submodule is used for acquiring a second vector; each vector data of the second vector is floating point type data;
the vector quantization layer number obtaining submodule is used for obtaining the vector quantization layer number based on the digit of the vector register;
and the vector data quantization submodule is used for quantizing each vector data in the second vector based on the vector quantization layer number to obtain the first vector.
In a possible implementation manner, the second matrix obtaining sub-module includes:
a third matrix obtaining unit configured to obtain a third matrix; each matrix data of the third matrix is floating point type data;
a matrix quantization layer number obtaining unit, configured to obtain a matrix quantization layer number based on the matrix bit number of the matrix register;
and the matrix quantization unit is used for performing quantization processing on the third matrix based on the number of matrix quantization layers to obtain the second matrix.
In one possible implementation, the third matrix is a matrix corresponding to the neural network weights of the first model;
the third matrix acquisition unit is configured to,
acquiring a training sample set;
and training the first model based on the training sample set to obtain the third matrix.
In one possible implementation manner, the matrix quantization unit includes:
a quantization model obtaining subunit, configured to perform quantization training on the first model based on the training sample set and the number of matrix quantization layers to obtain the quantization model;
a second matrix obtaining subunit, configured to obtain the second matrix based on the quantization weight of the quantization model; the quantization weights are network weights of the quantization model.
In one possible implementation manner, the training sample set includes training samples and target sample values corresponding to the training samples;
the quantization model obtaining subunit is configured to,
obtaining a target predicted value based on the training sample and the first model;
acquiring gradient data corresponding to the first model based on the target predicted value and the target sample value;
quantizing the gradient data based on the number of matrix quantization layers to obtain quantized gradient data;
and training the first model based on the quantitative gradient data to obtain the quantitative model.
In a possible implementation manner, the operation result obtaining module includes:
the operation register determining submodule is used for determining operation result registers corresponding to the matrix areas;
and the operation result accumulation submodule is used for accumulating the operation results corresponding to the matrix areas into the operation result registers corresponding to the matrix areas to obtain the operation results of the first matrix and the first vector.
In a possible implementation manner, the operation result accumulation submodule includes:
the digit expansion module is used for carrying out digit expansion on the operation result corresponding to each matrix area to obtain an expansion result corresponding to each matrix area;
and the extended result accumulation module is used for accumulating the extended results corresponding to the matrix areas into the operation result registers corresponding to the matrix areas to obtain the operation results of the first matrix and the first vector.
In yet another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the above-mentioned matrix vector operation method for a neural network model.
In yet another aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and loaded and executed by a processor to implement the above matrix vector operation method for a neural network model.
In yet another aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the above matrix vector operation method for the neural network model.
The technical scheme provided by the application can comprise the following beneficial effects:
in the process of realizing the operation of the first matrix and the first vector, the first matrix is divided into a plurality of matrix areas, each matrix area comprises at least two integer data positioned on the same matrix register, a vector data subset corresponding to the matrix area is obtained according to each matrix area, and the operation result of the first matrix and the first vector is obtained according to the operation result between the matrix area and the vector data subset. Through the scheme, in the matrix vector operation related to the neural network model, the matrix area can be divided into the matrix areas, at least two integer data located on the same matrix register exist in each matrix area, when the processor realizes the operation of the first matrix and the first vector, the parallel operation between a plurality of matrix data on a single matrix register and vector data can be realized, and the operation speed between the matrix and the vector is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 illustrates a schematic diagram of a computer system provided by an exemplary embodiment of the present application.
Fig. 2 is a schematic flow diagram illustrating a method of matrix vector operations for a neural network model in accordance with an exemplary embodiment.
FIG. 3 is a method flow diagram illustrating a method of matrix vector operations for a neural network model in accordance with an exemplary embodiment.
Fig. 4 is a schematic diagram illustrating a second region arrangement sequence corresponding to the embodiment shown in fig. 3.
Fig. 5 shows a schematic diagram of traversing the second area according to the embodiment shown in fig. 3.
Fig. 6 is a schematic diagram illustrating an arrangement sequence of the first matrix according to the embodiment shown in fig. 3.
Fig. 7 is a schematic operation flow diagram of a matrix vector operation method according to the embodiment shown in fig. 3.
Fig. 8 shows a schematic view of a model file scan according to the embodiment shown in fig. 3.
Fig. 9 is a flow chart illustrating a model accelerated parallel computing according to the embodiment shown in fig. 3.
Fig. 10 is a block diagram illustrating a structure of a matrix vector operation apparatus according to an exemplary embodiment.
FIG. 11 illustrates a block diagram of a computer device, according to an exemplary embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
First, terms related to embodiments of the present application will be described.
1) Artificial Intelligence (AI)
Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
2) Machine Learning (Machine Learning, ML)
Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
3) Quantification (quantification)
Quantization is the process of approximating a continuous value (or a large number of possible discrete values) of data to a finite number (or fewer) of discrete values, and is primarily applied in the conversion of continuous data to discrete data. Discrete signals generally do not require a quantization process, but may not be discrete in the range or require a quantization process.
The matrix vector operation method for the neural network model provided by the embodiment of the application can be applied to computer equipment with certain data processing capacity. In a possible implementation manner, the feature extraction model training method provided in the embodiment of the present application may be applied to a personal computer, a workstation, or a server, that is, the acceleration operation of matrix multiplication vector by the matrix vector operation method may be performed by the personal computer, the workstation, or the server. In a possible implementation manner, the matrix vector operation method provided by the embodiment of the application can be used in scenes where the corresponding functions need to be realized through matrix vector calculation in a neural network, such as game development, motion control, personalized data processing and the like. For an application scenario with many different objects, the features are not the same from object to object, and each character has its own non-equal personalized weight network. The matrix vector operation method provided by the embodiment of the application can accelerate the matrix vector operation in the scene, so that the execution efficiency of the neural network in a Central Processing Unit (CPU) with data parallel capability is improved (for example, the CPU with an X86 architecture and an ARM architecture).
Referring to FIG. 1, a schematic diagram of a computer system provided by an exemplary embodiment of the present application is shown. The computer system 200 includes a terminal 110 and a server 120, wherein the terminal 110 and the server 120 perform data communication through a communication network, optionally, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.
The terminal 110 is installed with an application program that needs to perform matrix vector operation, and the application program may be an Artificial Intelligence (AI) application program that needs to implement matrix vector operation, such as a virtual reality application program, a game application program, a payment application program, and the like, which is not limited in this embodiment of the present application.
Optionally, the computer device 110 may be a mobile terminal such as a smart phone, a tablet computer, a laptop portable notebook computer, or the like, or a terminal such as a desktop computer, a projection computer, or the like, or an intelligent terminal having a data processing component, which is not limited in this embodiment of the application.
The server 120 may be implemented as one server, or may be implemented as a server cluster formed by a group of servers, which may be physical servers or cloud servers. In one possible implementation, server 120 is a backend server for applications in computer device 110.
The terminal and the server are both provided with a final execution unit which can execute information processing and program operation, the final execution unit can be a CPU or an SOC (System on chip), the CPU and the SOC are provided with an arithmetic logic component and a register, and the register is used for storing the calculation information obtained by the arithmetic logic component in the operation process or after the operation is finished.
In a possible implementation manner of this embodiment, the server 120 trains a neural network model through a preset training sample set, and stores a weight matrix corresponding to the neural network model into a storage module of the server 120. When a terminal needs to realize forward neural network prediction operation corresponding to the terminal through a neural network model of a server, a feature vector corresponding to the terminal can be uploaded to the server through a communication network, so that the forward neural network prediction operation corresponding to the terminal is realized through a matrix vector operation method shown in the embodiment of the application.
In another possible implementation manner of this embodiment, the server 120 trains a neural network model through a preset training sample set, and issues a weight matrix corresponding to the neural network model to the terminal, and when the terminal needs to implement a forward neural network prediction operation corresponding to the terminal through the neural network model, a feature vector corresponding to the terminal may be directly obtained, and the forward neural network prediction operation corresponding to the terminal is implemented through the matrix vector operation method shown in this application embodiment through the weight matrix corresponding to the neural network model and the feature vector corresponding to the terminal.
Fig. 2 is a schematic flow diagram illustrating a method of matrix vector operations for a neural network model in accordance with an exemplary embodiment. The method may be performed by a computer device, which may be the terminal 110 or the server 120 in the embodiment shown in fig. 1 described above. As shown in fig. 2, the flow of the matrix vector operation method may include the following steps:
step 201, acquiring a first matrix; each matrix data in the first matrix is integer data; the first matrix comprises at least two matrix areas; each matrix area contains at least two matrix data located on the same matrix register.
Wherein each matrix data in the first matrix is a weight parameter in the neural network model.
INTEGER (INTEGER) data is numerical data that does not contain fractional parts. Integer data is used to represent integers only and is stored in binary form.
Step 202, acquiring a first vector; each vector data in the first vector is integer data.
Step 203, obtaining a vector data subset corresponding to each matrix area based on each matrix area; the subset of vector data is a subset of the first vector.
Step 204, obtaining the operation result corresponding to each matrix area based on each matrix area and the vector data subset corresponding to each matrix area.
Each matrix area comprises at least two matrix data positioned on the same matrix register, when a CPU or an SOC can calculate vector data corresponding to the matrix data according to a plurality of matrix data on the same matrix register, and the matrix data are integer data, parallel operation of the matrix data on the same matrix register can be realized.
Step 205, obtaining the operation result of the first matrix and the first vector based on the operation result corresponding to each matrix area.
In a possible implementation manner, the operation results corresponding to the matrix areas are accumulated according to the position relationship between the matrix areas, so as to obtain the operation results of the first matrix and the first vector.
The operation result of each matrix area can be the operation result corresponding to different vector dimensions, the operation result of each matrix area is accumulated according to the dimension of the operation result, the operation result of the first matrix and the first vector can be obtained, and the operation process is executed by taking the matrix area as the minimum operation unit, so that the parallel calculation of each matrix data in the matrix area is realized, and the operation speed and the operation efficiency of the matrix vector operation are improved.
To sum up, in the solution shown in the embodiment of the present application, in the process of implementing the operation of the first matrix and the first vector, the first matrix is divided into a plurality of matrix areas, each matrix area includes at least two integer data located on the same matrix register, a vector data subset corresponding to each matrix area is obtained according to each matrix area, and the operation result of the first matrix and the first vector is obtained according to the matrix area and the operation result between the vector data subsets. Through the scheme, in the matrix vector operation related to the neural network model, the matrix area is divided into the matrix areas, at least two integer data located on the same matrix register exist in each matrix area, when the processor realizes the operation of the first matrix and the first vector, the parallel operation between a plurality of matrix data on a single matrix register and vector data can be realized, and the operation speed between the matrix and the vector is improved.
FIG. 3 is a method flow diagram illustrating a method of matrix vector operations for a neural network model in accordance with an exemplary embodiment. The method may be performed by a computer device, which may be the terminal 110 or the server 120 in the embodiment shown in fig. 1 described above. As shown in fig. 3, the flow of the matrix vector operation method may include the following steps:
step 301, a second matrix is obtained.
And each matrix data in the second matrix is integer data. I.e. the respective matrix data values in the second matrix are stored in the computer device in the form of integer data.
In one possible implementation, a third matrix is obtained; each matrix data of the third matrix is floating point type data; acquiring the number of matrix quantization layers based on the number of data bits of the matrix register; and based on the matrix quantization layer number, performing quantization processing on the third matrix to obtain the second matrix.
The third matrix is a matrix corresponding to floating point type data, the number of quantization layers of the matrix can be determined according to the number of data bits of the matrix register, and quantization processing is performed on the third matrix according to the number of quantization layers of the matrix, so that a quantized second matrix is obtained.
The matrix register is used for storing matrix data corresponding to each matrix in the matrix vector operation process.
In the embodiment of the present application, a plurality of matrix data may exist on the same matrix register, so that when the CPU reads the matrix register, the CPU may perform an operation on the plurality of matrix data on the matrix register through the CPU instruction set at the same time, so as to implement parallel processing of the plurality of matrix data on the register.
In one possible implementation, the third matrix is a matrix corresponding to the neural network weights of the first model; acquiring a training sample set; and training the first model based on the training sample set to obtain the third matrix.
Wherein the first model may be a neural network model obtained based on a training sample set, and the third matrix is a matrix of network weights of the first model. When training is completed, the matrix corresponding to the neural network weights of the first model may be stored in the computer device as floating point data.
In a possible implementation manner, the first model is quantized and trained based on the training sample set and the number of matrix quantization layers to obtain the quantization model; and acquiring the second matrix based on the quantization weight of the quantization model.
The network weight of the neural network model is usually scattered in the value range, and if the floating point type data of the corresponding network weight of the neural network model is directly changed into the integer type data, the offset of the network weight is too large, which may cause the network weight of the neural network model to be changed into the integer type data, the neural network model cannot achieve the predetermined data processing effect, so that the first model is firstly subjected to quantization training according to the training sample set and the number of the matrix quantization layers needing to be quantized, and a quantization model with the network weight scattered in the value range is obtained, so that the network weight of the neural network model cannot deviate from the expected effect when the data type of the network weight is changed from the floating point type into the integer type.
In one possible implementation, the training sample set includes training samples and target sample values corresponding to the training samples; obtaining a target predicted value based on the training sample and the first model; acquiring gradient data corresponding to the first model based on the target predicted value and the target sample value; quantizing the gradient data based on the number of matrix quantization layers to obtain quantized gradient data; and training the first model based on the quantization gradient data to obtain the quantization model.
After the first model is obtained by training according to the training sample set, the first model can be subjected to quantization training according to the training sample set and a quantization operator, so that the weight of the first model is in discrete distribution on a value range.
In one possible implementation, the training process of the quantization model is as follows:
initializing network weights
Figure BDA0002914702360000121
Error of calculating output unit (prediction-actual)
Calculate the weight// inverse transfer of Δ Wh for all hidden layers to output layers (Δ Wh is the gradient of hidden layers to output layers)
Quantizing 2^ N levels for Δ Wh
Calculate the weight/continued backward propagation of Δ Wi for all input layers to hidden layers (Δ Wi is the gradient of input layer to hidden layer)
Quantizing 2^ N levels for delta Wi
Updating network weights// input layers is not altered by error estimation
All samples of the util are correctly classified or meet other stopping criteria
return the network
In a possible implementation manner, the quantization weight of the quantization model is floating point data, and the format of the quantization weight of the quantization model is changed to obtain the second matrix.
When a quantization model is obtained after quantization training, the quantization model is a model in which weights (i.e., quantization weights) are discretely distributed after quantization training, at this time, although the quantization weights are still floating-point data, because the quantization levels are fully considered in the training process, the quantization weights are gradually close to regions corresponding to the quantization levels, at this time, the quantization weights of the quantization model are changed from floating-point to integer, loss values of the weights are small, and the second matrix can still have most of the features of the quantization weights of the quantization model. And after the weight corresponding to the model is changed from the floating point model to integer data, the storage capacity occupied by the weight of the model is smaller, a plurality of integer data can be simultaneously positioned in one data register, and the parallel computation of the plurality of integer data in the data register is realized through the specified instruction of the CPU, so that the computation efficiency of the model is improved.
Step 302, determining the partition size of the second matrix based on the number of bits of the matrix register.
The second matrix can be partitioned according to the number of bits of the matrix register, so that each area partitioned by the second matrix can be stored in the matrix register according to a certain sequence.
In one possible implementation manner, the data bit number of each matrix data in the second matrix is obtained; and determining the partition size of the second matrix based on the bit number of the matrix register and the data bit number of each matrix data in the second matrix.
The number of bits of each matrix data in the second matrix is the same and is determined by the quantization level corresponding to the second matrix, for example, when the quantization level of the second matrix is 255, the number of bits of the matrix data is 8.
Wherein each matrix element in the partition of the second matrix needs to be put into the same matrix register, the number of bits of the matrix register and the number of bits of the respective matrix elements stored by the matrix register need to be taken into account, e.g., when the matrix register is a 128-bit register, when the quantization level of the second matrix is 8, a maximum of 16 binary data with 8 bits can be stored in the matrix register, the partition size of the second matrix may be a weight region with an area of 8 (let the area occupied by a weight in the matrix be a unit area), that is, the partition size of the second matrix may be 8, that is, the partition mode of the second matrix may be 1 × 8, 2 × 4, 4 × 2, etc., wherein the former digit is the number of transverse elements of the region and the latter digit is the number of longitudinal elements of the region.
Step 303, partitioning the second matrix based on the partition size, and obtaining each second area of the second matrix and an arrangement order of the second areas.
After the partition size is determined, the second matrix is partitioned, and each second region of the second matrix and an arrangement order of each second region can be determined, wherein the arrangement order is used for indicating the step of performing the operation on the second region and the first vector.
In one possible implementation, the arrangement order of the second regions is formed according to a first arrangement rule.
Wherein, the first ordering rule may be from top to bottom and from left to right.
In another possible implementation manner, the arrangement order of the second areas is determined according to the first arrangement rule and a first threshold.
The first threshold may be used to indicate the number of regions that the vertical axis can accommodate in the first arrangement rule corresponding to the second region each time the regions are arranged from top to bottom.
In a possible implementation manner, please refer to fig. 4, which shows a schematic diagram of a second region arrangement sequence corresponding to an embodiment of the present application. As shown in fig. 4, taking the partition size of the second matrix as 8 as an example, each of the regions 401, 402, 403, and 404 contains matrix data with a size of 2 × 4, and the regions are sorted from top to bottom and from left to right, that is, the four second regions are sorted from 401 to 402 to 403 to 404.
And step 304, respectively traversing each second area line by line, and storing each matrix data of the second area to a data register corresponding to each second area according to the sequence of the line-by-line traversal.
Please refer to fig. 5, which illustrates a schematic diagram of traversing the second area according to an embodiment of the present application. As shown in fig. 5, taking one of the second regions 501 in the second matrix as an example, the order of traversal line by line in the second region 501 is from a1 to B1, B1 to a2, a2 to B2 … … a4 to B4, so as to form an arrangement order of the respective matrix data in the second region.
Step 305, obtaining the first matrix based on the data register of each second region and the arrangement order of each second region.
Please refer to fig. 6, which illustrates a schematic diagram of an arrangement sequence of a first matrix according to an embodiment of the present application.
Fig. 6 is a schematic diagram of an arrangement order of the first matrix obtained according to the order obtained after traversal is performed on the second region shown in fig. 5, where, for example, a portion 601 in fig. 6 is an arrangement order obtained after rearrangement according to the traversal order of each data in one of the second regions shown in fig. 5, and at this time, when the CPU reads the register corresponding to the portion 601, the CPU can directly read according to the arrangement order of the data, so that the data reading efficiency is improved.
When the third matrix corresponding to the first model is floating-point data, the first model can obtain a quantization model according to quantization training, and after the quantization weight corresponding to the quantization model is changed through the data type, the quantization weight is changed into integer data, and a second matrix is formed. And the first matrix is obtained by scanning and rearranging according to the specified sequence according to the second matrix, so that each matrix data in the first matrix is still the data corresponding to the quantization weight, and each matrix data in the rearranged first matrix is still used as the weight parameter in the neural network model.
Step 306, acquiring a second vector; each vector data of the second vector is floating point type data.
In one possible implementation, the second vector is vector data obtained through a neural network model.
When the second vector is used to represent feature information of an object, such as image or language information, the feature information of the object may be processed by a corresponding neural network model to obtain a second vector corresponding to the object, and in this case, in order to ensure storage accuracy of the vector, each vector value in the vector may be stored by floating point type data.
Step 307, based on the number of bits of the vector register, the number of vector quantization layers is obtained.
The vector register is a register for storing vector data corresponding to each vector in the matrix vector operation process. The vector registers can be stored with a plurality of vector data, so that when the CPU reads the vector registers, the CPU instruction set can simultaneously operate the vector data on the vector registers, thereby implementing parallel processing of the vector data on the registers.
In one possible implementation, the number of bits of the vector register is the same as the number of bits of the matrix register.
In order to ensure that the vector data stored in the vector register and the matrix data stored in the matrix register can be directly operated, the bit number of the vector register and the bit number of the matrix register can be ensured to be the same.
Step 308, quantizing each vector data in the second vector based on the vector quantization layer number to obtain the first vector.
In a possible implementation manner, each vector data in the second vector is quantized to obtain a quantized vector; performing data format conversion on each floating point type data in the quantized vector to obtain the first vector; each vector data in the first vector is integer data.
309, obtaining a vector data subset corresponding to each matrix area based on each matrix area; the subset of vector data is a subset of the first vector.
The matrix areas are all located in the matrix registers corresponding to the matrix areas, the vector data subsets corresponding to the matrix areas are obtained according to the matrix areas, calculation can be directly performed with the matrix data of the matrix areas located in the matrix registers according to the vector data subsets, namely, calculation between the matrix areas is achieved through the subsets corresponding to the matrix areas in the first vectors, namely, data parallel processing is achieved, one data in the matrix and one data in the vector do not need to be taken out independently for calculation, and calculation efficiency is improved.
Step 310, obtaining an operation result corresponding to each matrix area based on each matrix area and the vector data subset corresponding to each matrix area.
Because the operation between the matrix and the vector is linear operation and the operation between the matrix and the vector has no time sequence, the operation between the matrix and the vector can be divided into the operation between a plurality of matrix modules and a plurality of vector sets, and the parallel processing of the operation is realized.
In one possible implementation manner, traversing the matrix data of the first matrix area, and storing the matrix data of the first matrix area into a first matrix register according to a traversal sequence; the first matrix region is any one of the matrix regions; the first matrix register is a matrix register corresponding to the first matrix region; acquiring a first vector data subset corresponding to the first matrix area; copying the first vector data subset based on the matrix data of the first matrix area to obtain the first vector operation set, and storing the first vector operation set in the first vector register; the first vector register is a vector register corresponding to the first matrix area; based on the first matrix register and the first vector register, an operation result corresponding to the first matrix area is obtained.
When the matrix data of the first matrix area is operated with the first vector data subset in the first vector, traversing the matrix data of the first matrix area, wherein the size of the first matrix area is determined according to the number of bits of a matrix register, so that the matrix data of the first matrix area can be put into the first matrix register; meanwhile, a first vector data subset corresponding to the first matrix region is obtained, and since the vector data in the first vector data subset generally needs to act simultaneously with a plurality of data in the first matrix, the first vector data subset is copied according to the number and structure of the matrix data in the first matrix region, so that the copied first vector operation set can correspond to all the matrix data in the first matrix region.
When the first vector operation set can correspond to all matrix data in the first matrix region, that is, the number of the first vector operation set is the same as that of the data in the first matrix region, the first vector operation set is stored in a first vector register, and the CPU performs operation on the first matrix register and the first vector register through an instruction set to obtain an operation result corresponding to the first matrix region.
In one possible implementation, reading the matrix data in the first matrix register and a first vector operation set in a first vector register; and acquiring an operation result corresponding to the first matrix area based on the matrix data in the first matrix register and the first vector operation set in the first vector register.
When the CPU implements the operation between the matrix data in the first matrix register and the first vector operation set in the first vector register through the instruction set, it needs to first read the matrix data in the first matrix register and the first vector operation set in the first vector register, and perform the operation according to the matrix data and the first vector operation set to obtain the operation result corresponding to the first matrix area.
Step 311, obtaining the operation result of the first matrix and the first vector based on the operation result corresponding to each matrix area.
After the matrix vector operation has been completed on each matrix area and the vector data corresponding to each matrix area, the operation result corresponding to each matrix area is obtained, and the operation result of the first matrix and the first vector, that is, the matrix vector operation result required by the scheme shown in the embodiment of the present application, can be obtained according to the operation result corresponding to each matrix area.
In a possible implementation manner, an operation result register corresponding to each matrix region is determined; and accumulating the operation result corresponding to each matrix area into an operation result register corresponding to each matrix area to obtain the operation results of the first matrix and the first vector.
After the operation result corresponding to each matrix region is obtained, the operation result corresponding to each matrix region needs to be put into the operation result register corresponding to each matrix region, so that the operation result corresponding to each matrix region can be accumulated and arranged according to the operation result register.
In a possible implementation manner, performing digit expansion on the operation result corresponding to each matrix region to obtain an expansion result corresponding to each matrix region; and accumulating the extension result corresponding to each matrix area into an operation result register corresponding to each matrix area to obtain the operation results of the first matrix and the first vector.
After the operation results corresponding to each matrix region are obtained, if the operation result corresponding to any matrix region is directly put into the operation result register, for example, 8 operation results of 16 bits can be stored in the 128-bit register, at this time, although the number of the storage registers is saved, when the data of other matrix regions are accumulated into the operation result register, data overflow is easily generated, the sum of the operation results of a plurality of matrix data obviously exceeds 16 bits, and therefore, bit expansion needs to be performed on the operation results corresponding to each matrix region, that is, high bits are reserved so as to store carry items generated by a plurality of operation results.
To sum up, in the solution shown in the embodiment of the present application, in the process of implementing the operation of the first matrix and the first vector, the first matrix is divided into a plurality of matrix areas, each matrix area includes at least two integer data located on the same matrix register, a vector data subset corresponding to each matrix area is obtained according to each matrix area, and the operation result of the first matrix and the first vector is obtained according to the matrix area and the operation result between the vector data subsets. Through the scheme, in the matrix vector operation related to the neural network model, the matrix area is divided into the matrix areas, at least two integer data located on the same matrix register exist in each matrix area, when the processor realizes the operation of the first matrix and the first vector, the parallel operation of a plurality of integer data on one matrix register and vector data corresponding to the matrix area can be realized, and the operation speed between the matrix and the vector is improved.
Please refer to fig. 7, which is a schematic diagram illustrating an operation flow of a matrix vector operation method according to an embodiment of the present application, the matrix vector operation method being applied to accelerate the computation-intensive vector matrix operation in the neural network deployment process. As shown in fig. 7, the flow of the matrix vector operation method when applied to accelerating neural network deployment is as follows:
and S701, training a model low bit.
Because the model parameters (i.e. the weights of the neural network) trained by the neural network model are usually stored in the computer memory by floating point type data, the consumed storage data is large, and the amount of the computing resources consumed by the CPU in the floating point operation is large and the computing rate is slow. Therefore, under the condition of ensuring the accuracy of the model as much as possible, the neural network model may be subjected to quantitative training, so that the weights of the neural network model become weights with discrete value ranges, wherein the process of performing quantitative training on the neural network model is shown in the embodiment corresponding to fig. 3, and details are not repeated here. At the moment, the weight of the neural network model is converted from floating point data into integer data, and a model file is generated according to the integer data, so that the effect of the neural network model is not obviously reduced.
In the embodiment of the present application, for the forward prediction operation Y ═ W × X + B that needs to be performed by the neural network model, the quantization operation may be performed on the matrix through the above scheme, and in order to ensure consistency of data types, while quantizing the matrix parameters (i.e., model weights), the vector data X needs to be quantized, and the floating point data needs to be converted into integer data. In the embodiment of the present application, the register used in the CPU calculation is a 128-bit register, where the quantization level of the matrix may be N-7, i.e., 127 layers of quantization operations are performed on the matrix, and the quantization level of the vector may be N-8, i.e., 255 layers of quantization operations are performed on the vector data. For the neural network model, the weight range of the model is (0,1), and the values of the feature vectors corresponding to different objects are not limited, so that higher-level quantization operation can be performed on the vectors to ensure the accuracy of the matrix vector operation. At this time, the value of W is all in Q1V, wherein V is 0- (2^7-1), and Q1 is the quantization width corresponding to the matrix; the values of X will all fall within Q2X V where V0- (2^8-1) and Q2 is the quantization width for the vector.
And S702, scanning and rearranging the model data file.
When the weight of the model is updated again, the matrix can be rescanned according to a certain sequence, so that the efficiency of data processing of a subsequent CPU is improved. Please refer to fig. 8, which illustrates a schematic view of a model file scan according to an embodiment of the present application. As shown in fig. 8, fig. 8 is a matrix corresponding to the weights of the neural network model, wherein the matrix is stored in the computer device from left to right and from top to bottom. In the embodiment of the application, a scanning mode may be adopted to perform block grouping by taking 2X8 as a unit, as shown in part 801 in fig. 8, scanning traversal is performed in a zigzag manner in the grouping, row traversal with a column span of 2 is performed on all blocks, and all the column blocks are completed by traversal; the next 2 rows of blocks (i.e., element 802 in FIG. 8) are then traversed until all blocks in the matrix have been traversed. And traversing the model matrix after the traversal is finished, rearranging the data of the accessed elements according to the traversal sequence, and keeping the elements as a file.
And S703, loading the model to accelerate parallel computation.
Refer to fig. 9, which illustrates a flow diagram of a model accelerated parallel computing according to an embodiment of the present application. As shown in fig. 9, in S901, 16 pieces of 8-bit X data are read and put into Val 128X. That is, first, 16 elements corresponding to the orientation quantity data are read and stored in a 128-bit register, wherein each X data is an 8-bit binary integer data. S902, the number of groups is determined to be 0 (i.e., group 0). S903, reading two 8-bit X data of the 16 8-bit X data corresponding to the 0 th group, copying 8 copies to obtain 8 copies of the same two 8-bit X data, and storing the two 8 copies in the vector register Val128Cpt of the 0 th group. S904, read 16 pieces of 8-bit W data (i.e. 16 pieces of 8-bit W data corresponding to Block 1), and store them into the 128-bit matrix register Val128Block1 corresponding to Block1 in the reading order. S905 calculates the intra-group convolution in the vector register Val128Cpt and the 128-bit matrix register Val128Block1 of the 0 th group. Namely, it can be calculated from the Packed Multiply ADD of the CPU (WA1 × XA + WB1 × XB), (WA2 × XA + WB2 × XB), (WA3 × XA + WB3 × XB), (WA4 × XA + WB4 × XB), (WA5 × XA + WB5 XB), (WA6 XA + WB6 × XB WA), (7 × XA + WB7 × XB), (WA8 × XA + WB8 × XB) 8-bit integers are stored in the register V128short 1. S906, the lower four bits of V128short1 are accumulated to IntVal1, and the upper four bits of V128short1 are accumulated to IntVal 2. That is, 8 16-bit integers from 128-bit registers are grouped into two 128-bit registers, each register containing 4 32-bit integers, using shift and bit-expansion instructions of the CPU. At this point, the number of bits per integer is increased from 16 bits to 32 bits, preventing subsequent accumulations of the value into the respective register from causing the value in that register to overflow.
S907, again reading two 8-bit X data of the 16 8-bit X data corresponding to the group number, copying 8 parts of the two 8-bit X data to obtain 8 parts of the same two 8-bit X data, and storing the two 8-bit X data in the vector register Val128Cpt of the 0 th group. S908, read 16 pieces of 8-bit W data (i.e. 16 pieces of 8-bit W data corresponding to Block 2), and store them into the 128-bit matrix register Val128Block2 corresponding to Block2 in the reading order. S909, the intra-group convolution in the vector register Val128Cpt and the 128-bit matrix register Val128Block2 of the 0 th group is calculated. Namely, it can be calculated from the Packed Multiply ADD of the CPU (WA1 × XA + WB1 × XB), (WA2 × XA + WB2 × XB), (WA3 × XA + WB3 × XB), (WA4 × XA + WB4 × XB), (WA5 × XA + WB5 × XB), (WA6 × XA + WB6 × XB), (WA7 × XA + WB7 × XB), (WA8 × XA + WB8 × XB)8 16-bit integers are stored in the register V128short 2. S910, accumulating the lower four bits of V128short2 to IntVal3, and accumulating the upper four bits of V128short1 to IntVal 4. Similarly, the 8 16-bit integers in the 128-bit registers are grouped into two 128-bit registers, each register containing 4 32-bit integers, using shift and bit-expansion instructions of the CPU. At this point, the number of bits per integer is increased from 16 bits to 32 bits, preventing subsequent accumulation of the value into the respective register from causing the value in that register to overflow.
S911, after finishing the calculation operation of the group of vectors and the matrix, reading the vector value corresponding to the first group by using the group count +1, reading the vector value corresponding to the first group when reading the vector next time, performing the operation again on the vector value of the first group, the Block3 and the Block4, accumulating the low four-bit result corresponding to the Block3 to the IntVal1, and accumulating the high four-bit result to the IntVal 2; the low four-bit result corresponding to Block4 is accumulated to IntVal3 and the high four-bit result is accumulated to IntVal 4.
S912, the above operations are repeated 8 times until completion (WO1 × XO + WP1 × XP), (WO2 × XO + WP2 × XP), (WO3 × XO + WP3 × XP), (WO4 × XO + WP4 × XP), (WO5 × XO + WP5 × XP), (WO6 × XO + WP6 × XP), (WO7 × XO + WP7 × XP), (WO8 × XO + WP8 × XP), i.e., GroupCount 8, all the columns are traversed at this time, the output Index is 16 integers starting from Loopcount 16, and when Loopcount is 1, the output Index is 16, and the outputs int 2, val, 3, val 73729 are stored as a total of 16 bits. The 16 pieces of 32-piece integer data are the operation results obtained by calculating the matrix area and the vector of the pair of blocks 1-16.
S913, when the CPU does not complete traversal of all rows, which indicates that the number of rows in the matrix is greater than 16, the above operations need to be performed on the remaining matrix regions through the schemes from S901 to S912, and the operations continue to be accumulated in the registers IntVal1, IntVal2, IntVal3, and IntVal 4.
When the CPU has traversed all rows, 16 integers in the IntVal1, the IntVal2, the IntVal3, and the IntVal4 are the final result data after the matrix and vector operations that need to be obtained in the embodiment of the present application.
It should be noted that, in the embodiment of the present application, the quantization levels of W and X and the scan rearrangement order of the weight matrix may all be changed according to the model and the application requirements, but the quantization levels of W and X need to satisfy a certain condition, when the quantization level of W is a and the quantization level of X is B, the number of data bits after W and X are multiplied in the subsequent matrix and vector multiplication operation is a + B, in the above scheme, the number of bits of the register is 128, and one register needs to store all matrix data of one Block, therefore, the number of bits of the matrix data in the Block should be less than 8, and when the number of bits of the matrix data is 8 and the number of bits of the vector data is also 8, i.e., when the quantization levels a is 8 and B is 8, the number of bits of the product of W and X may be 16 bits, for the operation WA1 + WB1 XB, the final operation result may exceed 16 bits, the requirement that 8 operation results can be simultaneously stored in one 128-bit register cannot be met, at the moment, the precision can be properly sacrificed, and the quantization level of W can be reduced, so that the data quantity which can be stored in the same register can be increased, and the parallel operation efficiency is obviously improved.
In a possible implementation manner, the scanning and rephotography sequence of the weight matrix may also be that blocks are grouped by taking 4X6 as a unit, scanning traversal is performed in the groups in a zigzag manner, row traversal with a row-column span of 4 is performed on all blocks, and all rows of blocks are completed by traversal; the next 4 rows of blocks are then traversed until all blocks within the matrix have been traversed. And traversing the model matrix after the traversal is finished, rearranging the data of the accessed elements according to the traversal sequence, and keeping the elements as a file.
When the CPU subsequently calculates the packet Multiply ADD, the matrix register stores 16 data from a1 to D1, a4 to D4, and the vector register stores four copies (a, B, C, D), so that the CPU packet Multiply ADD calculates (WA1 XA + WB1 XB + WC1 + WD1 XD), (WA1 XA + WB1 XB +1 XC + WD1 XD), (WA1 XA + WB1 XB + XC 1 XC + WD1 XD), (WA8 XA + WB 6866 XB + XB 1 XC +1 XC XD), stores four integer bits from XC + WB1 XD, and shifts the CPU packet to 128 register 128 bit, so that the CPU can be repeatedly repeated and repeated by repeating the following operations.
It can be seen that the change of the scan method can change the number of computation instructions to be called in the computation process, for example, in the change of the scan method from 2X8 to 4X4, the sequence of matrix and vector operations changes, but in any scan method, one register contains data of one Block (i.e. 16 data) at the same time, and the 16 data are operated with vector data corresponding to a vector at the same time, so that the parallel computation between a matrix and a vector is realized, and the computation speed of the CPU on the computation between the matrix and the vector is improved. The method and the device have the advantages that the model training is obtained through the training to obtain the quantitative weight, the specially customized scanning method is used for rearranging the matrix data to obtain the model file which is beneficial to the CPU calculation, and then the model file is loaded and optimized to calculate, so that the data bit width and the parallel capability of the CPU are furthest exerted to realize the acceleration of the calculation and the improvement of the calculation efficiency, the time latency of program operation can be reduced in the application scene with the personalized neural network prediction requirement, the throughput concurrency of a host, a server and a mobile phone is improved, and the network prediction efficiency and the power consumption can be improved in the embedded application scene of the mobile phone and the like.
Fig. 10 is a block diagram illustrating a structure of a matrix vector operation apparatus for a neural network model according to an exemplary embodiment. The matrix vector operation device may implement all or part of the steps in the method provided by the embodiment shown in fig. 2 or fig. 3, and the matrix vector operation device includes:
a first matrix obtaining module 1001 configured to obtain a first matrix; each matrix data in the first matrix is integer data; the first matrix comprises at least two matrix regions; each matrix area comprises at least two matrix data positioned on the same matrix register; each matrix data in the first matrix is a weight parameter in a neural network model;
a first vector obtaining module 1002, configured to obtain a first vector; each vector data in the first vector is integer data;
a vector data subset obtaining module 1003, configured to obtain, based on each matrix area, a vector data subset corresponding to each matrix area; the subset of vector data is a subset of the first vector;
a region operation result obtaining module 1004, configured to obtain an operation result corresponding to each matrix region based on each matrix region and the vector data subset corresponding to each matrix region;
an operation result obtaining module 1005, configured to obtain an operation result of the first matrix and the first vector based on the operation result corresponding to each matrix region.
In a possible implementation manner, the region operation result obtaining module 1004 includes:
the first traversal submodule is used for traversing the matrix data of the first matrix area and storing the matrix data of the first matrix area into a first matrix register according to a traversal sequence; the first matrix region is any one of the respective matrix regions; the first matrix register is a matrix register corresponding to the first matrix region;
a first vector subset obtaining submodule, configured to obtain a first vector data subset corresponding to the first matrix region;
a first operation set obtaining sub-module, configured to copy the first vector data subset based on the matrix data of the first matrix region to obtain the first vector operation set, and store the first vector operation set in the first vector register; the first vector register is a vector register corresponding to the first matrix region;
and the first result obtaining submodule is used for obtaining an operation result corresponding to the first matrix area based on the first matrix register and the first vector register.
In one possible implementation manner, the first result obtaining sub-module is further configured to,
reading matrix data in the first matrix register and a first vector operation set in the first vector register;
and acquiring an operation result corresponding to the first matrix area based on the matrix data in the first matrix register and the first vector operation set in the first vector register.
In a possible implementation manner, the first matrix obtaining module 1001 includes:
the second matrix acquisition submodule is used for acquiring a second matrix; each matrix data in the second matrix is integer data;
a second partition determining submodule for determining a partition size of the second matrix based on the number of bits of the matrix register;
the second matrix partitioning submodule is used for partitioning the second matrix based on the partition size to obtain each second area of the second matrix and the arrangement sequence of each second area;
the second region traversal submodule is used for respectively performing line-by-line traversal on each second region and respectively storing the matrix data of each second region to the matrix register corresponding to each second region according to the line-by-line traversal sequence;
and the first matrix acquisition submodule is used for acquiring the first matrix based on the matrix register corresponding to each second area and the arrangement sequence of each second area.
In one possible implementation manner, the first vector obtaining module 1002 includes:
the second vector acquisition submodule is used for acquiring a second vector; each vector data of the second vector is floating point type data;
the vector quantization layer number obtaining submodule is used for obtaining the vector quantization layer number based on the digit of the vector register;
and the vector data quantization submodule is used for quantizing each vector data in the second vector based on the vector quantization layer number to obtain the first vector.
In a possible implementation manner, the second matrix obtaining sub-module includes:
a third matrix obtaining unit configured to obtain a third matrix; each matrix data of the third matrix is floating point type data;
a matrix quantization layer number obtaining unit, configured to obtain a matrix quantization layer number based on the matrix bit number of the matrix register;
and the matrix quantization unit is used for performing quantization processing on the third matrix based on the number of matrix quantization layers to obtain the second matrix.
In one possible implementation, the third matrix is a matrix corresponding to the neural network weights of the first model;
the third matrix acquisition unit is configured to,
acquiring a training sample set;
and training the first model based on the training sample set to obtain the third matrix.
In one possible implementation manner, the matrix quantization unit includes:
a quantization model obtaining subunit, configured to perform quantization training on the first model based on the training sample set and the number of matrix quantization layers to obtain the quantization model;
a second matrix obtaining subunit, configured to obtain the second matrix based on the quantization weight of the quantization model; the quantization weights are network weights of the quantization model.
In one possible implementation manner, the training sample set includes training samples and target sample values corresponding to the training samples;
the quantization model obtaining subunit is configured to,
obtaining a target predicted value based on the training sample and the first model;
acquiring gradient data corresponding to the first model based on the target predicted value and the target sample value;
quantizing the gradient data based on the number of matrix quantization layers to obtain quantized gradient data;
and training the first model based on the quantitative gradient data to obtain the quantitative model.
In a possible implementation manner, the operation result obtaining module 1005 includes:
the operation register determining submodule is used for determining operation result registers corresponding to the matrix areas;
and the operation result accumulation submodule is used for accumulating the operation results corresponding to the matrix areas into the operation result registers corresponding to the matrix areas to obtain the operation results of the first matrix and the first vector.
In a possible implementation manner, the operation result accumulation submodule includes:
the digit expansion module is used for carrying out digit expansion on the operation result corresponding to each matrix area to obtain an expansion result corresponding to each matrix area;
and the extended result accumulation module is used for accumulating the extended results corresponding to the matrix areas into the operation result registers corresponding to the matrix areas to obtain the operation results of the first matrix and the first vector.
FIG. 11 illustrates a block diagram of a computer device 1100, according to an exemplary embodiment of the present application. The computer device 1100 may be a user terminal or a server in the system shown in fig. 1.
Generally, the computer device 1100 includes: a processor 1101 and a memory 1102.
Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 1101 may also include a main processor and a coprocessor. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit), and the processor 1101 may further include an AI (Artificial Intelligence) processor for Processing computing operations related to machine learning.
Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement all or part of the steps of the above-described method embodiments of the present application.
In some embodiments, when the computer device is implemented as a user terminal, the computer device 1100 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Optionally, the peripheral device includes: at least one of an audio circuit 1104, a display screen 1105, an image capture component 1106, an audio circuit 1107, a positioning component 1108, and a power supply 1109.
The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102.
The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, etc. The radio frequency circuit 1104 may communicate with other computer devices via at least one wireless communication protocol. In some embodiments, the rf circuit 1104 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 1105 is used to display a UI (User Interface). When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105.
The image capture component 1106 is used to capture images or video. In some embodiments, the image acquisition component 1106 may also include a flash.
The audio circuitry 1107 may include a microphone and a speaker. In some embodiments, the audio circuitry 1107 may also include a headphone jack.
The Location component 1108 is used to locate the current geographic Location of the computer device 1100 for navigation or LBS (Location Based Service).
The power supply 1109 is used to provide power to the various components within the computer device 1100.
In some embodiments, the computer device 1100 also includes one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyro sensor 1112, pressure sensor 1113, fingerprint sensor 1114, optical sensor 1115, and proximity sensor 1116.
Those skilled in the art will appreciate that the configuration illustrated in FIG. 11 does not constitute a limitation of the computer device 1100, and may include more or fewer components than those illustrated, or may combine certain components, or may employ a different arrangement of components.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as a memory comprising computer programs (instructions), executable by a processor of a computer device to perform the methods shown in the various embodiments of the present application, is also provided. For example, the non-transitory computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method shown in the above embodiments.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings and described above, and that various modifications and changes can be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (15)

1. A method of matrix vector operation for a neural network model, the method comprising:
acquiring a first matrix; each matrix data in the first matrix is integer data; the first matrix comprises at least two matrix regions; each matrix area comprises at least two matrix data positioned on the same matrix register; each matrix data in the first matrix is a weight parameter in a neural network model;
acquiring a first vector; each vector data in the first vector is integer data;
based on each matrix area, obtaining a vector data subset corresponding to each matrix area; the subset of vector data is a subset of the first vector;
acquiring an operation result corresponding to each matrix area based on each matrix area and the vector data subset corresponding to each matrix area;
and acquiring the operation result of the first matrix and the first vector based on the operation result corresponding to each matrix area.
2. The method according to claim 1, wherein obtaining the operation result corresponding to each matrix region based on each matrix region and the vector data subset corresponding to each matrix region comprises:
traversing the matrix data of the first matrix area, and storing the matrix data of the first matrix area into a first matrix register according to a traversal sequence; the first matrix region is any one of the respective matrix regions; the first matrix register is a matrix register corresponding to the first matrix region;
acquiring a first vector data subset corresponding to the first matrix area;
copying the first vector data subset based on the matrix data of the first matrix region to obtain the first vector operation set, and storing the first vector operation set in the first vector register; the first vector register is a vector register corresponding to the first matrix region;
and acquiring an operation result corresponding to the first matrix area based on the first matrix register and the first vector register.
3. The method according to claim 2, wherein the obtaining the operation result corresponding to the first matrix region based on the first matrix register and the first vector register comprises:
reading matrix data in the first matrix register and a first vector operation set in the first vector register;
and acquiring an operation result corresponding to the first matrix area based on the matrix data in the first matrix register and the first vector operation set in the first vector register.
4. The method of claim 1, wherein obtaining the first matrix comprises:
acquiring a second matrix; each matrix data in the second matrix is integer data;
determining a partition size of the second matrix based on a number of bits of the matrix register;
partitioning the second matrix based on the partition size to obtain each second area of the second matrix and the arrangement sequence of each second area;
respectively traversing each second area line by line, and respectively storing the matrix data of each second area to a matrix register corresponding to each second area according to the sequence of the line by line traversal;
and acquiring the first matrix based on the matrix register corresponding to each second region and the arrangement sequence of each second region.
5. The method of claim 4, wherein determining the partition size of the second matrix based on the number of bits of the matrix register comprises:
acquiring the data bit number of each matrix data in the second matrix;
and determining the partition size of the second matrix based on the bit number of the matrix register and the data bit number of each matrix data in the second matrix.
6. The method of claim 1, wherein the obtaining the first vector comprises:
acquiring a second vector; each vector data of the second vector is floating point type data;
acquiring the number of vector quantization layers based on the number of bits of the vector register;
and quantizing each vector data in the second vector based on the vector quantization layer number to obtain the first vector.
7. The method of claim 4, wherein obtaining the second matrix comprises:
acquiring a third matrix; each matrix data of the third matrix is floating point type data;
acquiring the number of matrix quantization layers based on the matrix digit of the matrix register;
and quantizing the third matrix based on the number of matrix quantization layers to obtain the second matrix.
8. The method of claim 7, wherein the third matrix is a matrix corresponding to the neural network weights of the first model;
the obtaining a third matrix includes:
acquiring a training sample set;
and training the first model based on the training sample set to obtain the third matrix.
9. The method of claim 8, wherein the quantizing the third matrix based on the number of matrix quantization layers to obtain the second matrix comprises:
performing quantization training on the first model based on the training sample set and the matrix quantization layer number to obtain the quantization model;
obtaining the second matrix based on the quantization weight of the quantization model; the quantization weights are network weights of the quantization model.
10. The method of claim 9, wherein the set of training samples includes training samples and target sample values corresponding to the training samples;
the performing quantization training on the first model based on a training sample set and the number of matrix quantization layers to obtain the quantization model includes:
obtaining a target predicted value based on the training sample and the first model;
acquiring gradient data corresponding to the quantization model based on the target predicted value and the target sample value;
quantizing the gradient data based on the number of matrix quantization layers to obtain quantized gradient data;
and training the first model based on the quantitative gradient data to obtain the quantitative model.
11. The method according to claim 1, wherein the obtaining the operation result of the first matrix and the first vector based on the operation result corresponding to each matrix region comprises:
determining an operation result register corresponding to each matrix area;
and accumulating the operation results corresponding to the matrix areas into operation result registers corresponding to the matrix areas to obtain the operation results of the first matrix and the first vector.
12. The method according to claim 11, wherein the accumulating the operation result corresponding to each matrix area into the operation result register corresponding to each matrix area to obtain the operation result of the first matrix and the first vector comprises:
performing digit expansion on the operation result corresponding to each matrix area to obtain an expansion result corresponding to each matrix area;
and accumulating the extension result corresponding to each matrix area to an operation result register corresponding to each matrix area to obtain the operation result of the first matrix and the first vector.
13. An apparatus for matrix vector operation of a neural network model, the apparatus comprising:
the first matrix acquisition module is used for acquiring a first matrix; each matrix data in the first matrix is integer data; the first matrix comprises at least two matrix regions; each matrix area comprises at least two matrix data positioned on the same matrix register;
the first vector acquisition module is used for acquiring a first vector; each vector data in the first vector is integer data;
a vector data subset obtaining module, configured to obtain, based on each matrix region, a vector data subset corresponding to each matrix region; the subset of vector data is a subset of the first vector;
a region operation result obtaining module, configured to obtain an operation result corresponding to each matrix region based on each matrix region and a vector data subset corresponding to each matrix region;
and the operation result acquisition module is used for acquiring the operation results of the first matrix and the first vector based on the operation results corresponding to the matrix areas.
14. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of matrix vector operations for a neural network model as claimed in any one of claims 1 to 12.
15. A computer-readable storage medium, having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement a method of matrix vector operations for a neural network model as claimed in any one of claims 1 to 12.
CN202110096963.6A 2021-01-25 2021-01-25 Matrix vector operation method, equipment and storage medium for neural network model Pending CN113569193A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110096963.6A CN113569193A (en) 2021-01-25 2021-01-25 Matrix vector operation method, equipment and storage medium for neural network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110096963.6A CN113569193A (en) 2021-01-25 2021-01-25 Matrix vector operation method, equipment and storage medium for neural network model

Publications (1)

Publication Number Publication Date
CN113569193A true CN113569193A (en) 2021-10-29

Family

ID=78160951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110096963.6A Pending CN113569193A (en) 2021-01-25 2021-01-25 Matrix vector operation method, equipment and storage medium for neural network model

Country Status (1)

Country Link
CN (1) CN113569193A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722669A (en) * 2021-11-03 2021-11-30 海光信息技术股份有限公司 Data processing method, device, equipment and storage medium
CN117992578A (en) * 2024-04-02 2024-05-07 淘宝(中国)软件有限公司 Method for processing data based on large language model, large language model and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722669A (en) * 2021-11-03 2021-11-30 海光信息技术股份有限公司 Data processing method, device, equipment and storage medium
CN113722669B (en) * 2021-11-03 2022-01-21 海光信息技术股份有限公司 Data processing method, device, equipment and storage medium
CN117992578A (en) * 2024-04-02 2024-05-07 淘宝(中国)软件有限公司 Method for processing data based on large language model, large language model and electronic equipment

Similar Documents

Publication Publication Date Title
CN110378468B (en) Neural network accelerator based on structured pruning and low bit quantization
CN106228238B (en) Accelerate the method and system of deep learning algorithm on field programmable gate array platform
CN112840356B (en) Operation accelerator, processing method and related equipment
CN106779057B (en) Method and device for calculating binary neural network convolution based on GPU
Choi et al. An energy-efficient deep convolutional neural network training accelerator for in situ personalization on smart devices
CN113569193A (en) Matrix vector operation method, equipment and storage medium for neural network model
CN111768458A (en) Sparse image processing method based on convolutional neural network
EP4128066A1 (en) Feature reordering based on sparsity for improved memory compression transfers during machine learning jobs
CN116601585A (en) Data type aware clock gating
US20200218777A1 (en) Signal Processing Method and Apparatus
US11775808B2 (en) Neural network computation device and method
CN109902821B (en) Data processing method and device and related components
US20240112460A1 (en) Apparatus, method, and computer-readable medium for robust response to adversarial perturbations using hyperdimensional vectors
CN111506520B (en) Address generation method, related device and storage medium
KR20220161339A (en) Feature reordering based on similarity for improved memory compression transfer in machine learning tasks
US10559093B2 (en) Selecting encoding options
CN115293076B (en) Method for generating circuit, electronic device and storage medium
CN110349074A (en) Edge and advanced treating hardware
US11636569B1 (en) Matrix transpose hardware acceleration
US20230100930A1 (en) Mixing sparsity compression
CN112132272B (en) Computing device, processor and electronic equipment of neural network
CN111723917B (en) Operation method, device and related product
CN116781484B (en) Data processing method, device, computer equipment and storage medium
US20240095493A1 (en) Desparsified convolution for sparse tensors
CN116186526B (en) Feature detection method, device and medium based on sparse matrix vector multiplication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40055206

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination