CN116258185A

CN116258185A - Processor, convolution network computing method with variable precision and computing equipment

Info

Publication number: CN116258185A
Application number: CN202310041659.0A
Authority: CN
Inventors: 薛盛可; 鲁路; 李颖敏
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-01-11
Filing date: 2023-01-11
Publication date: 2023-06-13

Abstract

The embodiment of the specification provides a processor, a convolution network computing method with variable precision and computing equipment, wherein the processor abstracts the most intensive computation in a convolution neural network into convolution network instructions in an instruction set expansion mode, so that hardware acceleration of the processor for convolution network computation is realized, and the computing speed and efficiency of the convolution neural network are improved. In addition, the convolution network instruction comprises precision parameters, so that convolution network calculation of variable precision data can be supported, and the applicability of the processor is improved.

Description

Processor, convolution network computing method with variable precision and computing equipment

Technical Field

Embodiments in the present specification relate to the field of machine learning technology, and in particular, to an acceleration technique based on an instruction set in the field of machine learning technology, and more particularly, to a processor, a convolution network computing method with variable precision, and a computing device.

Background

Convolutional neural networks (Convolutional Neural Networks, CNN), which are a type of feedforward neural network (Feedforward Neural Networks) that contains convolutional computations and has a Deep structure, are one of the representative algorithms of Deep Learning (Deep Learning), and in some cases, convolutional neural networks may also be referred to as convolutional networks. Convolutional neural networks have the capability of token learning (Representation Learning) and are capable of performing a Shift-invariant classification (Shift-Invariant Classification) on input information in their hierarchical structure, and are therefore also referred to as "Shift-invariant artificial neural networks (SIANN). The convolutional neural network is excellent in research directions such as image recognition, voice detection, natural language processing and the like, and a high-precision result is obtained.

However, with the development of convolutional neural networks, the application range of convolutional neural networks is continuously expanded, so that the applicability of the processor to calculation of different convolutional neural networks is necessarily improved, and the processor can meet the calculation requirements of the neural networks in different application scenes.

Disclosure of Invention

Various embodiments in the present disclosure provide a processor, a convolution network computing method with variable precision, and a computing device, so as to achieve the purpose of improving the applicability of the convolution neural network.

In a first aspect, an embodiment of the present specification provides a processor, comprising: the memory access unit and the convolution network calculation module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the access unit is used for storing data to be operated;

the convolution network computing module is used for acquiring target precision data to be computed from the access storage unit according to a convolution network instruction, and performing convolution network computing on the acquired target precision data to be computed, wherein the convolution network computing comprises at least one of convolution, pooling and activation computing; the convolutional network instructions include a precision parameter, the target precision corresponding to the precision parameter.

In a second aspect, an embodiment of the present disclosure provides a variable precision convolutional network computing method implemented based on a processor, where the processor includes a memory access unit and a convolutional network computing module, and the variable precision convolutional network computing method includes:

According to a convolution network instruction, acquiring target precision data to be operated from the memory unit, and carrying out convolution network calculation on the acquired target precision data to be operated, wherein the convolution network calculation comprises at least one of convolution, pooling and activation calculation; the convolutional network instructions include a precision parameter, the target precision corresponding to the precision parameter.

In a third aspect, an embodiment of the specification provides a computing device comprising a processor as described in any of the preceding claims.

In a fourth aspect, embodiments of the present disclosure provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements a variable precision convolutional network computing method as described above.

In a fifth aspect, the present description embodiments provide a computer program product or a computer program, the computer program product comprising a computer program stored in a computer readable storage medium; the processor of the computer device reads the computer program from the computer readable storage medium, and the processor implements the steps of the variable precision convolutional network calculation method described above when executing the computer program.

According to the embodiments provided by the specification, the most intensive calculation in the convolutional neural network is abstracted into the convolutional network instruction by means of instruction set expansion, so that hardware acceleration of the convolutional network calculation of the processor is realized, and the calculation speed and efficiency of the convolutional neural network are improved. In addition, the convolution network instruction comprises precision parameters, and can support convolution network calculation of variable precision data, so that a processor can support convolution network calculation of low precision data, can support convolution network calculation of high precision data, widens the applicable scene of the processor, and improves the applicability of the processor.

Drawings

FIG. 1 is a schematic diagram of a processor according to one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a convolutional network computing module according to one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a vector register file according to one embodiment of the present disclosure;

FIG. 4 is a schematic illustration of a convolution window sliding provided by one embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a wallace tree multiplier according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of valid data within bytes provided in one embodiment of the present disclosure;

Fig. 7 is a schematic flow chart of a variable-precision convolutional network calculation method according to an embodiment of the present disclosure.

Detailed Description

Unless defined otherwise, technical or scientific terms used in the embodiments of the present specification should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present specification belongs. The terms "first," "second," and the like, as used in the embodiments of the present disclosure, do not denote any order, quantity, or importance, but rather are used to avoid intermixing of the components.

Throughout the specification, unless the context requires otherwise, the word "plurality" means "at least two", and the word "comprising" is to be construed as open, inclusive meaning, i.e. as "comprising, but not limited to. In the description of the present specification, the terms "one embodiment," "some embodiments," "example embodiments," "examples," "particular examples," or "some examples," etc., are intended to indicate that a particular feature, structure, material, or characteristic associated with the embodiment or example is included in at least one embodiment or example of the present specification. The schematic representations of the above terms do not necessarily refer to the same embodiment or example.

The technical solutions of the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

First, terms that may be referred to in the present specification will be described.

Instruction set (Instruction set) or Instruction set architecture: is the programming-related part of the computer architecture, including basic data types, instruction sets, registers, addressing modes, interrupts, exception handling, and external I/O. The instruction set architecture contains a series of opcodes (i.e., opcodes (machine language)) and basic commands that are executed by a particular processor. Or a set of all instructions that a processor is capable of executing may be referred to as an instruction set.

The reduced instruction set (Reduced Instruction Set Computer, RISC) is a design model of computer processors. The design thought simplifies the number of instructions and the addressing mode, so that the implementation is easier, the parallel execution degree of the instructions is better, and the efficiency of a compiler is higher.

RISC-V instruction set is an open source instruction set architecture based on the principle of reduced instruction set.

The instruction set extension (Instruction Set Extension), the high-level instruction functions above the basic instruction set, are divided into standard instruction set extensions of official design and custom instruction set extensions of user self-design.

Pipeline (Pipeline) is a method of driving a processor to operate based on instructions. Current classical pipeline models may include fetch (Instruction Fetch, IF) -decode (Instruction Decode, ID) -Execute (EX) -Write Back (WB), where the decode stage may be divided into two segments, referred to as ID1 and ID2, respectively, and thus such pipeline models may also be referred to as 5-segment pipeline models. In this set of pipeline models, the working logic roughly includes: the Instruction (IF) is fetched from the cache or memory, translated into a specific function (ID) that the processor can understand, and according to the translation result, the operation (EX) is performed, and the operation result is written back into memory (WB).

Processor Front-end (Processor Front-end) refers to the hardware architecture of the Processor that is responsible for the fetching, decoding, and transmitting functions.

Processor Back-end (Processor Back-end), which refers to the hardware architecture of the Processor that is responsible for the Execution and write-Back of instructions, may also be referred to as an Execution Engine (Execution Engine) in some cases.

Convolutional neural network (Convolutional Neural Networks, CNN), a feed-forward neural network, consisting essentially of one or more convolutional layers and fully-connected layers, while also including a pooling layer.

Variable accuracy (Variable Precision) data of a plurality of different accuracies are used at different layers of the convolutional neural network, even at different regions of the same layer.

As described above, convolutional neural networks are very important neural network structures in deep learning, and are mainly used in the fields of image processing, audio processing, natural language processing, and the like. Along with the increase of the data scale, the convolutional neural network has higher requirements on the computational power, access bandwidth and the like of the carried hardware, and various hardware acceleration schemes are proposed for the convolutional neural network. These hardware acceleration schemes can be divided into four types, depending on the platform, GPU (Graphics Processing Unit, graphics processor), TPU (Tensor Processing Unit ), FPGA (Field Programmable Gate Array, programmable gate array), ASIC (Application Specific Integrated Circuit ). The TPU and GPU schemes have the strongest calculation power, but have huge power consumption, and are not friendly to small-scale CNN and embedded environments. FPGA and ASIC solutions have low power consumption, but generally only accelerate specific network structures or specific network layers, which is not flexible enough.

In order to provide a more flexible and efficient acceleration means, the embodiment of the specification aims at providing an acceleration scheme based on an instruction set, the acceleration scheme is particularly optimized for a convolutional neural network, the most common convolutional network calculation (convolution, pooling and activation) in the convolutional neural network is abstracted into instructions, the purpose of hardware acceleration by an instruction set expansion mode is achieved, the purpose of hardware acceleration is not required to be limited to a specific platform, the calculation involved by a single convolutional network instruction is less, the single convolutional network instruction can be well combined with a processor pipeline, and the scheme is more flexible and efficient.

In addition, in the processor provided in the embodiments of the present disclosure, the convolutional network computing module for convolutional network computing acceleration is integrated in the processor, that is, directly combined with the processor, without considering an additional interconnection design, and can offset hardware resource overhead and power consumption overhead.

The processor provided in the embodiments of the present specification will be exemplarily described with reference to the accompanying drawings.

One embodiment of the present disclosure provides a processor, as shown in fig. 1, comprising: a memory unit 21 and a convolutional network calculation module 22; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory unit 21 is configured to store data to be operated.

The convolutional network calculation module 22 is configured to obtain target precision data to be operated from the memory access unit 21 according to a convolutional network instruction, and perform convolutional network calculation on the obtained target precision data to be operated, where the convolutional network calculation includes at least one of convolution, pooling and activation calculation; the convolutional network instructions include a precision parameter, the target precision corresponding to the precision parameter.

The data to be operated may be various data to be input into the convolutional neural network for operation, and may be image data, sound data or text data, which is not limited in this specification.

The memory Unit 21 may refer to an LSU (Load/Store Unit) in an Execution stage (Execution) of the processor 100. In some embodiments, the execution stage may further include an arithmetic logic unit, a multiplication and division unit, a control and status register unit, a memory access sequencing instruction unit, and the like, which are relatively independent from the convolutional network calculation module 22, and specific functions of these units are not described in detail herein. In the processor 100, the front end 10 is configured to fetch and decode, and transmit decoded instructions (e.g., convolutional network instructions) to the back end 20, as described above, so that the back end 20 performs corresponding instruction operations.

As described above, the memory unit 21 and the convolutional network calculation module 22 may be part of the processor back end 20 according to the functional division, and the processor back end 20 may further include a write-back stage 23, where after the convolutional network calculation module 22 performs the convolutional network calculation, since the result of the convolutional network calculation is scalar data, the calculation result may directly enter the write-back stage 23 along with the pipeline, be written into the general register file by the write-back stage 23, and be written back into the memory of the processor 100 by the standard storage instruction.

The convolutional network instruction may be an extended instruction based on a RISC-V instruction set, i.e., the convolutional network instruction may be based on requirements of the RISC-V instruction set with respect to the extended instruction, designed for instruction format, etc. Additionally, the instruction format of the convolutional network instructions may be designed with consideration given to the format requirements of the processor 100 architecture for the instructions.

In this embodiment, by means of instruction set expansion, the most intensive calculation in the convolutional neural network is abstracted into the convolutional network instruction, so that hardware acceleration of the convolutional network calculation of the processor 100 is realized, and the calculation speed and efficiency of the convolutional neural network are improved. In addition, the convolution network instruction includes precision parameters, so that the processor can support convolution network calculation of variable precision data, and can support convolution network calculation of low precision data, and also can support convolution network calculation of high precision data, thereby widening applicable scenes of the processor (such as convolution network calculation requirements of low precision data in a mobile application scene, convolution network calculation requirements of high precision data in a server application scene, and the like), and improving applicability of the processor 100.

In order to separate the computation and memory access of the convolutional network instructions, the convolutional network instructions are enabled to meet the RISC-V reduced instruction set architecture, improving the practicality of the processor 100. In one embodiment of the present specification, the convolutional network instructions include a calculation instruction and a memory access instruction, the memory access instruction including the precision parameter;

the convolutional network calculation module 22 is specifically configured to obtain, according to the access instruction, data to be calculated of the target precision from the access unit 21, and perform convolutional network calculation on the obtained data to be calculated of the target precision according to the calculation instruction.

In this embodiment, a special memory access instruction is used to load data, and then another calculation instruction is used to calculate, so as to separate calculation and memory access. In some embodiments, the computational instructions are further refined to subdivide the computational instructions into convolution instructions, pooling instructions, and activation instructions to reduce the amount of data required for each computational instruction to operate, reducing instruction granularity, so that the convolution network instructions can be better fused with the processor 100 pipeline.

In an alternative embodiment, the function of the convolution instruction may be: the data in the convolution window of k×k is subjected to vector inner product operation, and the result is written into an rd register (target register).

The function of the pooling instruction is to perform maximum or average pooling operation on the data in the k×k pooling window according to a specified algorithm, and write the result into the rd register. When the maximum window size supported by the processor is K, the value range of K can be 0-K, namely 0-K is more than or equal to K. In some embodiments, the processor may support a maximum window size K in a range of values less than or equal to 15.

The activate command may support a ReLU (Rectified Linear Unit, linear correction unit) activate function, which functions to compare input data to zero values, returning to a maximum value. The amount of data that the activate instruction can process at one time is determined by the processor bit width together with the target precision that can be obtained from the first register number 0. If the processor is 64 bits and the data precision is 16 bits, one activate instruction completes the 4 data activate operation.

In this embodiment, as described above, the separation of the calculation instruction and the memory access instruction is implemented, so that the convolution network instruction meets the requirements of the RISC-V reduced instruction set architecture, so that the processor can be obtained by expanding the instruction set architecture on the basis of the RISC-V reduced instruction set architecture-based processor, the modification to the existing processor architecture is reduced, and the processor provided by the embodiment of the present application is easy to implement.

In some embodiments of the present description, in addition to refining the compute instruction, the memory access instruction is refined to include a direct vector load instruction and a precision load instruction, wherein the precision load instruction includes the precision parameter.

Accordingly, referring to fig. 2, the convolutional network calculation module 22 may include: vector access unit 221, vector register file 222, and calculation unit 223; wherein, the liquid crystal display device comprises a liquid crystal display device,

the vector access unit 221 includes a plurality of first registers, and the vector register file 222 includes a plurality of second registers and a plurality of third registers corresponding to the second registers one by one;

the vector access unit 221 is configured to obtain data to be operated with target precision from the access unit 21 according to the precision load instruction, and write the obtained data to be operated and the target precision of the data to be operated into the first register; the vector access unit 221 is further configured to read, according to the direct vector load instruction, a target precision of data to be operated from a first register corresponding to the direct vector load instruction, request the data to be operated from the access unit 21 according to the read target precision, splice the data to be operated returned by the access unit 21 into a data vector, store the data vector into the second register, and write the target precision of each data to be operated in the data vector into a third register corresponding to the second register;

The vector register file 222 is configured to transmit the data vector stored in the second register and the target precision stored in the third register to the computing unit 223;

the calculating unit 223 is configured to convolve, pool, or activate calculation on the data vector according to the calculation instruction and the target precision of each data to be calculated in the data vector.

In this embodiment, besides separating the memory access instruction and the calculation instruction, the vector memory access unit 221 and the calculation unit 223 that are responsible for processing the memory access instruction and the calculation instruction are also respectively designed, and meanwhile, the memory structure (vector register file 222) of the memory access data is designed, so that the instruction flow direction of the calculation and memory access is clear and definite, which is beneficial to improving the stability of instruction processing, avoiding the error of the processing result possibly caused by the cross instruction processing, ensuring the accuracy of the processing result provided by the processor, and improving the robustness of the processor.

In addition, in the vector access unit 221, the target precision of the data to be operated is recorded by the first register, and the data precision recording based on the precision load instruction is realized. Specifically, still referring to fig. 2, the precision load instruction is transmitted to the vector memory access unit 221 as a memory access instruction, and the vector memory access unit 221 reads the target precision of the data to be operated from the memory access unit 21 according to the precision load instruction, and writes the target precision into the first register of the vector memory access unit 221, thereby realizing the writing of the target precision.

When the direct vector load instruction is transmitted to the vector memory access unit 221, the vector memory access unit 221 reads the target precision of the data to be operated from the first register corresponding to the direct vector load instruction according to the first register recorded by the direct vector load instruction, the information such as the vector length and the memory access address can be calculated according to the target precision of the data to be operated, and the memory access request containing the information is sent to the memory access unit 21, so that the memory access unit 21 requests the data to be operated, the data to be operated returned by the memory access unit 21 may be batched, therefore, the vector memory access unit 221 needs to splice the data to be operated returned by the memory access unit 21 into the data vector to be stored in the second register, and in order to determine the precision of each data in the data vector, the target precision of each data to be operated in the data vector also needs to be written into the third register corresponding to the second register.

The second register and the third register are in one-to-one correspondence, the second register may be a global register, and the third register may be a private register, which is invisible to a user and is only passively updated by direct vector load instructions. Since the third register is used for recording data precision, the third register may be smaller, e.g. the third register may be half the size of the second register, e.g. the second register is an 8bit register, and the third register may be a 4bit register.

With respect to the possible structure of the vector register file 222, reference may be made to fig. 3, wherein the plurality of second registers are divided into a first register set and a second register set, each of the first register set and the second register set comprises K register subgroups, the register subgroups being composed of K of the second registers; k is the maximum window size supported by the processor 100;

the register file corresponds to a row of the largest window supported by the processor 100.

Referring to fig. 3, in order to match with the feature of dividing the data to be operated on of the convolutional neural network into a main matrix and a convolutional kernel, so that the structure of the vector register file 222 is more suitable for storing the data to be operated on of the convolutional neural network, in this embodiment, a plurality of second registers are divided into two register groups for storing the main matrix and the convolutional kernel, respectively, which are respectively used as basic access objects of the vector register file 222. The address of the register may be formed of 5 bits, the high 1bit data representing a large group, in which 1bit data, 0 may be used to represent the primary matrix, and 1 represents the convolution kernel; the low 4bit data represents a different set of registers. Therefore, the main matrix and the convolution kernels of the data to be operated are respectively stored in different register groups and are matched with the structural characteristics of the data to be operated, so that when the convolution kernels of different data to be operated are the same, multiplexing of the convolution kernels can be realized in a mode of multiplexing the register groups, the data quantity required to be stored by a processor is reduced, and the hardware requirement of the processor is reduced.

In order to improve the efficiency of convolution operation, in an embodiment of the present disclosure, the memory access instruction further includes: a partial vector load instruction;

the vector access unit 221 is further configured to read, according to the partial vector load instruction, a column of data to be operated and a target precision of the data to be operated, which have recently entered the convolution window, from a first register corresponding to the partial vector load instruction, splice the read column of data to be operated and other columns of data to be operated of the convolution window into a data vector, store the data vector into the second register, and write the target precision of each of the data to be operated in the data vector into a third register corresponding to the second register.

Referring to FIG. 4, FIG. 4 shows a sliding diagram of a convolution window, in which data to be calculated input to a convolution neural network is stored in a matrix arrangement as shown in FIG. 4, i.e., in FIG. 4, a _ij The matrix elements constituting the data to be operated on are represented, i=0, 1,2 … … N-1, j=0, 1,2 … … N-1. In the process of sliding the convolution window, firstly, calculating the data convolution result in the convolution window 1, then calculating the data convolution result in the convolution window 2, wherein the data between the two windows are partially repeated, and if the data of the convolution window 1 can be multiplexed, repeated data access can be greatly reduced. Therefore, in this embodiment, a partial vector load instruction is designed to perform multiplexing of partial data loading of a convolution window and data of a previous window, so that a large amount of repeated data does not need to be repeatedly read, access of the repeated data is greatly reduced, and the convolution operation rate is improved.

Still referring to fig. 4, in some embodiments, to obtain significant performance improvement, the data to be computed input to the convolutional neural network may be stored in a sequential order. When the output of each layer of the convolutional neural network is carried out, the sequence is the same as the input, and the output is carried out according to the sequence. Therefore, in the calculation process of the whole convolutional neural network, the data to be operated as the input matrix is kept in the sequence, no additional mechanism is needed for maintenance, and the sequence transformation is only needed at the beginning and the end of the whole convolutional neural network.

In some embodiments of the present description, a viable construction of the calculation unit 223 is provided, the calculation unit 223 comprising a convolution calculation unit 223, a pooling calculation unit 223 and an activation calculation unit 223; wherein, the liquid crystal display device comprises a liquid crystal display device,

the convolution calculating unit 223 includes a wallace tree multiplier, which is configured to perform convolution calculation on the data vector according to the convolution instruction and the target precision of each data to be calculated in the data vector;

the pooling calculation unit 223 is configured to perform pooling calculation on the data vector according to the pooling instruction and the target precision of each data to be operated in the data vector;

The activation calculation unit 223 is configured to perform activation calculation on the data vector according to the activation instruction and the target precision of each data to be operated in the data vector.

The structure of the Wallace tree multiplier is shown in FIG. 5, and the Wallace tree multiplier (Wallace Tree Multiplier) is composed of a Booth Encoder (Encoder), a selector (Switch), a Wallace tree array (Wallace Tree Arrays), a register (register) and an adder (adder). The size of the Wallace tree multiplier depends on the size K of the largest window supported by the processor 100, which may be 5 in some embodiments, since the convolution kernel size typically does not exceed 5X 5.

The convolution instruction needs to calculate the inner product of the main matrix vector and the convolution kernel vector within the convolution window, i.e. the multiplication and summation of up to 25 pairs of data (e.g. x0, y0, x1, y1 … … x23, y23, x24, y 24). In the Wallace tree multiplier architecture, a Booth encoder is used to split the multiplication into multiple additions, producing a series of partial products. The selector may also be referred to as a transposed array, which is used to convert a partial product of B bits into B a-bit data (e.g., converting a partial product of 25 bits into 32 25 bits of data), and the wallace tree array functions as a compression addition, where a series of full adders convert the multiple data additions converted by the selector into 2 number additions.

Here we do not use 25 independent multipliers but combine the partial products of all multiplications and send them into a larger wallace tree array, completing both the multiplications and the summation calculations. With maximum precision, each multiplication is 16bit data multiplied by 8bit convolution kernel data. We use a 2-bit Booth encoder, each multiplication resulting in 4 partial products, 100 partial products total. All partial integrals are divided into 4 groups, and 4 25bit Wallace tree arrays are transmitted to obtain 8 addends. The 8 numbers (S0, C0, S1, C1, S2, C2, S3, C3) are then summed by an adder to obtain the final calculation result. Even with the above-described calculation structure, the entire calculation flow is long, and it is necessary to divide into a plurality of cycles. Therefore, we choose to insert the register after the Wallace tree array, put the 8 sums into the next cycle, and finally the adder sums the 8 sums, that is, the register is used to store the calculation result output by the Wallace tree array, so that the adder sums the calculation result stored by the register in the next cycle.

The number of partial products depends on the convolution kernel accuracy, with a smaller number of partial products being produced as the accuracy decreases. For a 4bit precision convolution kernel, 50 partial products are generated, and 2 groups of 25bit Wallace tree arrays are needed; for 2bit and 1bit precision multipliers, generating 25 partial products requires 1 set of 25bit Wallace tree arrays. In the latter case, after passing through the Wallace tree, only 2 addends are generated, and the final addition can be completed in the same period. Meanwhile, the low-precision Wallace tree can multiplex 4 groups of high-precision Wallace trees, so that hardware cost is reduced.

For the case where the actual convolution window is less than 5, there are a number of implementation methods. For example, another independent computing structure can be designed, the data size of the small window is smaller, and the small window can be realized through a Wallace tree with lower digits, so that a plurality of cycles are not needed to finish the computation. However, the independent computing structure also brings additional hardware overhead, so that in order to reuse the existing structure as much as possible and save resources, we choose a computing structure in which all window sizes use the largest window. And for the small window, filling the data of the spare position to 0.

The Wallace tree multiplier is adopted as the convolution calculation unit 223, so that the convolution calculation requirements of data to be calculated with different precision can be met, and the convolution calculation with variable precision can be realized. In addition, the calculation process of the Wallace tree multiplier is highly compatible with the variable-precision convolution calculation characteristics provided by the embodiment of the specification, namely the number of Wallace tree arrays participating in calculation can be adjusted to meet the data calculation requirements of different precision, and the convolution calculation speed of the processor provided by the embodiment of the specification is improved.

For the pooling calculation unit 223, in one embodiment of the present specification, the pooling calculation unit 223 performs pooling calculation on the data vector, specifically for performing maximum pooling calculation on the data vector by adopting a binary search comparison tree method, or performing average pooling calculation on the data vector by adopting a summation average method;

The method for summing and averaging comprises the following steps: the sum of all the data in the pooling window is calculated and divided by the number of data in the pooling window.

In this embodiment, since the maximum pooling calculation and the average pooling calculation have little influence on the data accuracy, the same calculation structure (i.e., the pooling calculation unit 223) is multiplexed, so that the architecture of the processor can be simplified.

Optionally, when the size of the pooling window is 2 or 4, dividing the calculation of the number of data in the pooling window by the register shift is realized;

converting the divisor to 2 by multiplying the divisor and the dividend simultaneously by a non-zero integer when the size of the pooling window is 9 or 25 ^N N is an integer greater than 1, after which pair 2 is implemented by register shifting ^N Is a division of the (c).

Specifically, the pooling instruction needs to handle two pooling algorithms, max-pooling and mean-pooling. The former calculates the maximum value within the pooling window and the latter calculates the average value within the pooling window.

The implementation of maximum pooling is simple, and we use binary search comparison trees. The binary search comparison tree is also called binary search tree (Binary Search Tree), and satisfies the following conditions: if the left subtree is not empty, the values of all nodes on the left subtree are smaller than the root node; if the right subtree is not empty, the values of all nodes on the right subtree are larger than the root node; its left and right subtrees are also binary search comparison trees. The binary search comparison tree has efficient insertion, deletion and query operations, and is beneficial to improving the execution efficiency of the pooled instruction. The focus here is on the implementation of the mean pooling. The averaging can be divided into two steps, first calculating the sum of all data in the window, and then dividing by the number of data. Summing can be easily achieved by means of an addition tree, but division is relatively complex. If a usual multi-cycle divider is employed, the delay of the instruction is too large. Consider that the division here is integer division and that the divisor has only a limited number of cases: 1. 4, 9, 16, 25. Where division 1 is not calculated and divisions 4, 16 can be directly implemented by register shifting. We therefore want to have the division by 9 and 25 be achieved by some simple operation through approximation calculation.

Depending on the nature of the division, the dividend and the divisor are multiplied by a non-zero integer at the same time, with the division result unchanged. When the divisor is large, the difference between the divisor itself and the result of dividing by the divisor plus or minus 1 may be less than 1, where the result is rounded up to be identical. Based on the above properties, the pooling window is 3, i.e. divided by 9, for example. Assuming sum is the sum result, the division result may be written as:

in the above formula, > >24 represents a register shift to the right by 24 bits. The result is completely accurate within a sum of less than or equal to 2-28, and accords with the accuracy range of maximum 16 bits of input data.

For the case where the pooling window is 5, i.e. divided by 25, a similar approach is also used for processing. It should be noted, however, that division by 25 requires division into two divided by 5 to calculate, otherwise the accuracy range cannot be satisfied.

By converting division by 9 and division by 25 into register shift, the pooled computation can be realized by the register shift, so that the pooled computation is more fit with the operation mode of the processor 100, and the computation efficiency is improved.

Since variable precision neural network computation of the data to be computed is introduced, in order to clearly mark the valid data position in the low precision data, in some embodiments, the memory address of the memory instruction includes a byte address and an intra-byte address, where the byte address is used to represent the memory address of the data to be computed to be read, and the intra-byte address represents the valid data position in the data to be computed to be read.

Specifically, if the target precision of the data to be operated to be read is greater than or equal to 8 bits, the address in the byte is null;

if the target precision of the data to be read is 4 bits, the address in the byte is 1 bit, and the value of the address in the byte with 1 bit is used for representing the position of the effective 4-bit data in the data to be read;

if the target precision of the data to be operated to be read is 2 bits, the address in the byte is 2 bits, and the value of the address in the byte of the 2 bits is used for representing the position of the effective 2-bit data in the byte in the data to be operated to be read;

if the target precision of the data to be read is 1 bit, the address in the byte is 3 bits, and the value of the address in the byte with 3 bits is used for representing the position of the effective 1-bit data in the byte in the data to be read.

In this embodiment, the address format of the extended vector memory access instruction is changed from original byte addressing to a new address format consisting of two parts { byte address, intra-byte address }. When the target precision of the data to be operated is more than or equal to 8 bits, the address part in the byte does not exist, and the address format is degraded into the original byte-by-byte addressing format. When the target precision of the data to be operated is 4 bits, the address in the byte occupies 1 bit; when the target precision of the data to be operated is 2 bits, the address in the byte occupies 2 bits, and so on.

For example, in one byte of data, when the target precision of the data to be operated is greater than or equal to 8 bits, the data in the byte can be considered as valid data, and the address in the byte is not required to mark the valid data. When the target precision of the data to be operated is 4 bits, the valid data may be in the first half of the byte or in the second half of the byte, so that 1-bit data is required to be marked with a value of 0 or 1 to determine whether the valid data is in the first half or the second half of the byte. And when the target precision of the data to be operated is 2 bits, the distribution condition of the effective data in the bytes is 4, so that the 4 values of the address in the 2 bits are needed to mark the position in the bytes where the effective data is located. When the target precision of the data to be operated is 1bit, the distribution condition of the effective data in the byte is 8, so that 8 values of the address in the 3-bit byte are needed to mark the position in the byte where the effective data is located.

Therefore, by setting the memory address to be in the form of byte address and byte address, the memory address can accurately record the positions of effective data in various data with target precision, on the basis, the processor can filter invalid data in the data, reduce the data quantity required to be processed by the processor, and improve the data processing efficiency of the processor.

In one embodiment of the present application, a method is provided for using custom instructions (convolution network instructions) on software, in particular, the convolution network instructions may be encapsulated inline assembler functions. The convolution network instruction is packaged into an inline assembly function, so that the convolution network instruction can be used as a general instruction to be called and directly inserted into a context, and the calling requirement of the convolution neural network on the convolution network instruction is met.

Specifically, taking the processor 100 of the RISC-V architecture as an example, there are generally 3 software methods for implementing instruction set extension on the RISC-V architecture: (1) directly using custom instructions by compilation; (2) modifying the compiler to support custom instructions; (3) Standard instructions that are already present but not used are used instead. Modifying the compiler is complex, and it is possible to modify the assembler and linker. While being less flexible, the compiler needs to be re-modified whenever custom instruction functions are adjusted. Method 3 is most convenient but can only be used as a temporary solution. In this approach, the extended instruction set has no own encoding format and must rely on existing instructions. When these unused standard instructions need to be reused, conflicts can arise. In summary, the first method is adopted in the embodiments of the present disclosure, and the custom instruction is implemented in a simple and general manner by assembling the custom instruction.

In the actual packaging process, a volatile (variable) keyword can be used for guaranteeing that a convolution network instruction is not changed by a compiler, and an inline (function) keyword can be used for guaranteeing that the compiler does not compile the convolution network instruction into a function, but is directly inserted into a context.

After realizing the convolutional network instruction based on the inline function, a higher-level convolutional neural network programming library is needed to conveniently construct and run the CNN network model. The overall composition of the CNN programming library mainly includes: basic data structure, input output API (Application Programming Interface, application program interface), operator API, utility API.

The basic data structure includes: image data type (image_t) supporting input image and network intermediate image, convolution kernel data type (kernel_t) supporting convolution kernel, full connection data type (fc_filter_t) supporting full connection parameters. Input APIs fall into two categories, randomly generated and imported from existing data. In addition, it should be noted that the operator API can be divided into two types, where one type of operator supports different columns in the same image to have different accuracies. The other type of operators support the same data precision of the whole image, but the data precision is variable, and support the 16, 8, 4 and 2bit data precision. With the CNN programming library, the operators of the main stream NN framework can be easily corresponding to the operators of the main stream NN framework, and the existing network model can be conveniently transplanted. It should be noted that, when constructing CNN programming libraries, we mainly consider usability, and do not perform special optimization at the software level. Meanwhile, the operation structures of the two operators are the same, and only the instructions used by the core calculation are different, so that irrelevant factors can be conveniently eliminated, and performance comparison can be easily carried out.

Exemplary method

The embodiment of the specification also provides a variable-precision convolutional network computing method, as shown in fig. 7, based on the implementation of a processor, the processor comprises a memory access unit and a convolutional network computing module, and the variable-precision convolutional network computing method comprises the following steps:

s101: according to a convolution network instruction, acquiring target precision data to be operated from the memory unit, and carrying out convolution network calculation on the acquired target precision data to be operated, wherein the convolution network calculation comprises at least one of convolution, pooling and activation calculation; the convolutional network instructions include a precision parameter, the target precision corresponding to the precision parameter.

The convolution network computing method with variable precision provided in this embodiment belongs to the same application conception as the processor provided in the foregoing embodiment of the present disclosure, and technical details not described in detail in this embodiment may refer to specific processing contents of the processor provided in the foregoing embodiment of the present disclosure, which are not described herein again.

Example apparatus, electronic device, storage Medium, and software

One embodiment of the present specification also provides a variable precision convolutional network computing device. The variable precision convolutional network computing device may include:

The convolution calculation module is used for acquiring target precision data to be operated from the access unit according to a convolution network instruction, and carrying out convolution network calculation on the acquired target precision data to be operated, wherein the convolution network calculation comprises at least one of convolution, pooling and activation calculation; the convolutional network instructions include a precision parameter, the target precision corresponding to the precision parameter.

The present description also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a computer, causes the computer to perform the variable precision convolutional network calculation method of any one of the above embodiments.

The present description also provides a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the variable precision convolutional network calculation method of any one of the above embodiments.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in this specification are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related country and region, and are provided with corresponding operation entries for the user to select authorization or rejection.

It will be appreciated that the specific examples herein are intended only to assist those skilled in the art in better understanding the embodiments of the present description and are not intended to limit the scope of the present description.

It should be understood that, in various embodiments of the present disclosure, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

It will be appreciated that the various embodiments described in this specification may be implemented either alone or in combination, and are not limited in this regard.

Unless defined otherwise, all technical and scientific terms used in the embodiments of this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this specification belongs. The terminology used in the description is for the purpose of describing particular embodiments only and is not intended to limit the scope of the description. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will be appreciated that the processor of the embodiments of the present description may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It will be appreciated that the memory in the embodiments of this specification may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), or a flash memory, among others. The volatile memory may be Random Access Memory (RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present specification.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and unit may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this specification, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present specification may be integrated into one processing unit, each unit may exist alone physically, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present specification may be essentially or portions contributing to the prior art or portions of the technical solutions may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present specification. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, etc.

The foregoing is merely specific embodiments of the present disclosure, but the scope of the disclosure is not limited thereto, and any person skilled in the art who is skilled in the art can easily think about variations or substitutions within the scope of the disclosure of the present disclosure, and it is intended to cover the variations or substitutions within the scope of the disclosure. Therefore, the protection scope of the present specification shall be subject to the protection scope of the claims.

Claims

1. A processor, comprising: the memory access unit and the convolution network calculation module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the access unit is used for storing data to be operated;

2. The processor of claim 1, wherein the convolutional network instructions comprise a calculate instruction and a memory instruction, the memory instruction comprising the precision parameter;

the convolution network calculation module is specifically configured to obtain, according to the access instruction, data to be calculated of the target precision from the access unit, and perform convolution network calculation on the obtained data to be calculated of the target precision according to the calculation instruction.

3. The processor of claim 2, wherein the compute instruction comprises at least one of a convolution instruction, a pooling instruction, and an activate instruction, the memory access instruction comprises a direct vector load instruction and a precision load instruction, the precision load instruction comprising the precision parameter;

The convolutional network calculation module comprises: the vector access unit, the vector register file and the computing unit; wherein, the liquid crystal display device comprises a liquid crystal display device,

the vector access unit comprises a plurality of first registers, and the vector register file comprises a plurality of second registers and a plurality of third registers which are in one-to-one correspondence with the second registers;

the vector access unit is used for acquiring data to be operated with target precision from the access unit according to the precision loading instruction, and writing the acquired data to be operated and the target precision of the data to be operated into the first register; the vector access unit is further configured to read, according to the direct vector load instruction, target precision of data to be operated from a first register corresponding to the direct vector load instruction, request the data to be operated to an access unit according to the read target precision, splice the data to be operated returned by the access unit into a data vector, store the data vector into the second register, and write the target precision of each data to be operated in the data vector into a third register corresponding to the second register;

the vector register file is used for transmitting the data vector stored in the second register and the target precision stored in the third register to the computing unit;

The computing unit is used for carrying out convolution, pooling or activation computation on the data vector according to the computing instruction and the target precision of each piece of data to be operated in the data vector.

4. A processor according to claim 3, wherein the plurality of second registers are divided into a first register set and a second register set, each of the first register set and the second register set comprising K register sub-sets, the register sub-sets being made up of K of the second registers; k is the maximum window size supported by the processor;

the register file corresponds to a row of a maximum window supported by the processor.

5. The processor of claim 4, wherein the memory access instruction further comprises: a partial vector load instruction;

the vector access unit is further configured to read, according to the partial vector load instruction, a column of data to be operated and a target precision of the data to be operated, which have recently entered the convolution window, from a first register corresponding to the partial vector load instruction, splice the read column of data to be operated and other columns of data to be operated of the convolution window into a data vector, store the data vector into the second register, and write the target precision of each of the data to be operated in the data vector into a third register corresponding to the second register.

6. A processor according to claim 3, wherein the computing unit comprises a convolution computing unit, a pooling computing unit and an activation computing unit; wherein, the liquid crystal display device comprises a liquid crystal display device,

the convolution calculation unit comprises a Wallace tree multiplier, wherein the Wallace tree multiplier is used for carrying out convolution calculation on the data vector according to the convolution instruction and the target precision of each piece of data to be calculated in the data vector;

the pooling calculation unit is used for carrying out pooling calculation on the data vector according to the pooling instruction and the target precision of each piece of data to be operated in the data vector;

the activation calculation unit is used for performing activation calculation on the data vector according to the activation instruction and the target precision of each piece of data to be operated in the data vector.

7. The processor of claim 6, wherein the pooling calculation unit performs pooling calculation on the data vector, specifically for performing maximum pooling calculation on the data vector by adopting a binary search comparison tree method, or performing average pooling calculation on the data vector by adopting a summation average method;

8. The processor of claim 7, wherein when the size of the pooling window is 2 or 4, the calculation of dividing by the number of data in the pooling window is performed by a register shift;

9. The processor according to any one of claims 2 to 8, wherein the memory address of the memory instruction comprises a byte address and an intra-byte address, wherein the byte address is used to represent a memory address of data to be operated on to be read, and the intra-byte address is used to represent a valid data location in the data to be operated on to be read.

10. The processor according to claim 9, wherein the intra-byte address is null if the target precision of the data to be operated to be read is greater than or equal to 8 bits;

11. The processor of any one of claims 1 to 8, wherein the convolutional network instruction is a packed inline assembler function.

12. The convolution network computing method with variable precision is characterized by being realized based on a processor, wherein the processor comprises a memory access unit and a convolution network computing module, and the convolution network computing method with variable precision comprises the following steps:

13. A computing device, comprising: a processor as claimed in any one of claims 1 to 11.