CN109358900B

CN109358900B - Artificial neural network forward operation device and method supporting discrete data representation

Info

Publication number: CN109358900B
Application number: CN201811233426.6A
Authority: CN
Inventors: 刘少礼; 于涌; 陈云霁; 陈天石
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2016-04-15
Filing date: 2016-04-15
Publication date: 2020-07-03
Anticipated expiration: 2036-04-15
Also published as: CN107301453B; CN107301453A; CN109358900A

Abstract

The invention provides a device supporting discrete data representation and used for executing artificial neural network forward operation. The forward operation of the multilayer artificial neural network supporting discrete data representation can be realized by using the device. Data such as weight values, neurons and the like in the forward operation process can be represented in a discrete form. E.g., -1, -1/2, 0, 1/2, 1, etc., are not continuous data. Modules are provided that support discrete data operations. Basic operations such as multiplication, addition, etc. of successive data are replaced with different bit operations such as exclusive or of data, inequality, etc. depending on the value of the discrete data. A module is provided for converting continuous data to discrete data. Supporting batch normalization (batch normalization) calculations using the above-described apparatus is provided.

Description

Artificial neural network forward operation device and method supporting discrete data representation

Technical Field

The present invention relates generally to artificial neural networks, and more particularly to an apparatus and method for performing artificial neural network forward operations, in which data supports discrete data representations. And for discrete data, bit-wise operations such as exclusive-or, non-equal substitution for operations such as multiplication and the like are used for basic operations of continuous data.

Background

The multilayer artificial neural network is widely applied to the fields of pattern recognition, image processing, function approximation, optimization calculation and the like, and in recent years, the multilayer artificial neural network is more and more widely concerned by academia and industry due to higher recognition accuracy and better parallelism.

One known method to support multi-layer artificial neural network forward operations is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. Another known method to support forward training of multi-layer artificial neural networks is to use a Graphics Processor (GPU). The method supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit.

Both of these devices are continuous data used in data storage and operation. The storage of continuous data requires more resources, such as 32-bit floating point data, and 32 bits are required to store the data. The computation of continuous data requires complicated implementation of functional components such as adders and multipliers.

The discrete data representation indicates a storage manner of replacing continuous data by a specific number. For example, the data-1, -1/8, 1/8, 1 four digits may be represented by four digits 00, 01, 10, 11, respectively. This storage approach is different from the continuous storage approach. In sequential storage, binary number 00/01/10/11 represents 0/1/2/3 four sequential digits in decimal notation. By this index-like representation we replace the real data of the discontinuous discretization with formally continuous numbers. The stored numbers are not continuous and are therefore called discrete data representations.

In a conventional arithmetic device for calculating a multilayer artificial neural network, a method of representing data by using continuous data such as a floating point number or a fixed point number is known. Because the precision of the weight of the multilayer neural network is higher and the quantity of the weight is larger, the representation mode of continuous data brings larger overhead in both operation and storage. In the discrete data representation, operations such as multiplication of continuous data may be replaced by operations such as bitwise exclusive-or and shift of data. Thereby greatly reducing the number of multiplier components. And with several bits of discretized data, the advantages over conventional 32-bit floating point storage are also apparent.

Disclosure of Invention

One aspect of the present invention provides an apparatus for performing artificial neural network forward operation supporting discrete data representation, including an instruction cache unit, a controller unit, a data access unit, an interconnection module, a master operation module, and a plurality of slave operation modules, wherein:

the instruction cache unit is used for reading in the instruction through the data access unit and caching the read instruction;

the controller unit is used for reading an instruction from the instruction cache unit and decoding the instruction into a microinstruction for controlling the behaviors of the interconnection module, the main operation module and the slave operation module;

the data access unit is used for writing discrete data or continuous data into corresponding data cache units of the main operation module and each slave operation module from an external address space or reading the discrete data or the continuous data from the data cache units to the external address space;

in the stage that each layer of neural network starts forward calculation, a main operation module transmits discrete or continuous input neuron vectors of the layer to all slave operation modules through an interconnection module, after the calculation process of the slave operation modules is completed, the interconnection module gradually splices discrete or continuous output neuron values of each slave operation module into an intermediate result vector, wherein when the input data is mixed data of discrete data and continuous data, the slave operation modules adopt preset corresponding calculation modes aiming at different discrete data;

and the main operation module is used for finishing subsequent calculation by utilizing the intermediate result vector, and when the input data is mixed data of discrete data and continuous data, the main operation module adopts a preset corresponding calculation mode aiming at different discrete data.

Alternatively, discrete data representation refers to a representation in which a particular discrete number is substituted for the actual continuous data.

Optionally, the plurality of slave operation modules calculate respective discrete or continuous output neuron values in parallel by using the same discrete or continuous input neuron vector and respective different discrete or continuous weight vectors.

Optionally, the main operation module performs any one of the following operations on the intermediate result vector:

an offset operation, adding an offset to the intermediate result vector;

activating the intermediate result vector, wherein an active function is any one of sigmoid, tanh, relu and softmax;

sampling operation, comparing the intermediate result vector with a random number, outputting 1 if the intermediate result vector is larger than the random number, and outputting 0 if the intermediate result vector is smaller than the random number; or

Pooling operations, including maximum pooling or average pooling.

Optionally, the slave operation module includes an input neuron buffering unit for buffering discrete or continuous input neuron vectors.

Optionally, the interconnection module forms a data path of continuous or discretization data between the master operation module and the plurality of slave operation modules, and may be implemented in different interconnection topologies. In one embodiment, the H-tree has an H-tree structure, and is a binary tree path composed of a plurality of nodes, each node sends upstream discrete or continuous data to two downstream nodes in the same way, and combines the discrete or continuous data returned by the two downstream nodes and returns the combined data to the upstream node.

Optionally, the main operation module includes an operation unit, a data dependency relationship determination unit, and a neuron cache unit, where:

the neuron cache unit is used for caching discrete or continuous expressed input data and output data used by the main operation module in the calculation process;

the operation unit completes various operation functions of the main operation module, and when the input data is mixed data of discrete data and continuous data, a preset corresponding calculation mode is adopted for different discrete data;

the data dependency relation judging unit is a port of the operation unit read-write neuron cache unit, ensures that consistency conflict does not exist in the reading and writing of continuous data or discrete data in the neuron cache unit, is responsible for reading and inputting discrete or continuous neuron vectors from the neuron cache unit and sending the discrete or continuous neuron vectors to the slave operation module through the interconnection module; and

the intermediate result vectors from the interconnect module are sent to the arithmetic unit.

Optionally, each slave operation module includes an operation unit, a data dependency relationship determination unit, a neuron buffer unit, and a weight buffer unit, where:

the arithmetic unit receives the microinstruction sent by the controller unit and carries out arithmetic logic operation, and when the input data is mixed data of discrete data and continuous data, a preset corresponding calculation mode is adopted for different discrete data;

the data dependency relationship judging unit is responsible for reading and writing operations of the neuron cache unit supporting discrete data representation and the weight cache unit supporting discrete data representation in the calculation process, and consistency conflict does not exist between the reading and writing of the neuron cache unit supporting discrete data representation and the weight cache unit supporting discrete data representation;

the neuron caching unit caches data of input neuron vectors and output neuron values obtained by calculation of the slave operation module; and

the weight buffer unit buffers the weight vector which is required by the slave operation module in the calculation process and is expressed discretely or continuously.

Optionally, the data dependency relationship determining unit ensures that there is no consistency conflict between read and write by: judging whether the dependency relationship exists between the micro instructions which are not executed yet and the data of the micro instructions in the executing process, if not, allowing the micro instructions to be immediately transmitted, otherwise, waiting until all the micro instructions which are depended by the micro instructions are completely executed and then allowing the micro instructions to be transmitted.

Optionally, the operation unit in the master operation module or the slave operation module includes an operation decision unit and a mixed data operation unit, when the input data is mixed data, the operation decision unit decides what operation should be performed on the mixed data according to discrete data therein, and then the mixed data operation unit performs the corresponding operation according to the decision result of the operation decision unit.

Optionally, the operation unit in the master operation module or the slave operation module further includes at least one of a discrete data operation unit and a continuous data operation unit, and a data type determination unit, when all the input data are discrete data, the discrete data operation unit performs corresponding operations according to the input discrete data through table lookup, and when all the input data are continuous data, the continuous data operation unit performs corresponding operations.

Optionally, the apparatus further includes a continuous discrete transformation unit, where the continuous discrete transformation unit includes a preprocessing module, a distance calculation module, and a judgment module, and if M discrete data are used, M is 2M, M is greater than or equal to 1, and the discrete data are respectively corresponding to M values in a predetermined interval [ -zone, zone ], where:

the preprocessing module preprocesses input continuous data x by using clip (-zone, zone) operation to obtain preprocessed data y in an interval [ -zone, zone ], wherein y is ═ zone if x is less than or equal to the zone, y is ═ zone if x is greater than or equal to the zone, and y is ═ zone if x is less than or equal to the zone, and y is ═ x if the-zone is less than x;

the distance calculation module calculates the distance between the preprocessed data y and each numerical value; and

the judgment module calculates and outputs discrete data based on the distance.

Optionally, the predetermined interval [ -zone, zone ] is [ -1,1] or [ -2,2 ]; and/or the absolute value of the M numbers is the inverse of a power of 2; and/or the determining module performs: outputting discrete data corresponding to the numerical value closest to the preprocessed data y, and outputting the discrete data corresponding to any one of the two numerical values if the two numerical values are equal to the preprocessed data; or calculating the normalized probability of the preprocessed data y to any one of two values closest to each other, comparing the normalized probability corresponding to any one of the two values with the random number z between (0,1) generated by the random number generation module, and outputting the discrete data if the z is smaller than the probability, otherwise, outputting another discrete data.

Another aspect of the present invention provides a method of performing a single-layer artificial neural network forward operation using the above apparatus. Through the provided instruction set, the controller controls the data such as neurons, weight values, constant quantities and the like required by the read-in operation. These data may or may not be represented as discrete data. And then the main operation module, the slave operation module and the interconnection module complete the process of multiplying, adding and biasing activation of the neuron data and the weight data. Particularly, when the multiplication operation is performed on the data represented by the discrete data, the multiplication operation is replaced by the bit operation on the relevant data according to the value of the discrete data. For example, the weight data is represented by 1-bit discrete data, 0 represents +1,1 represents-1, and the multiplication of the weight is realized by xoring the sign bits of the data multiplied by the weight.

Another aspect of the present invention provides a method of supporting artificial neural network Batch Normalization (Batch Normalization) using the above apparatus. Through the provided instruction set, the controller controls the data access unit to read in input data, and then controls the master-slave operation module to obtain the mean value and the variance of each position according to the size of the batch or use the set mean value variance. The controller then controls the input data at the corresponding location minus the mean divided by the square difference. Finally, the controller multiplies the processed data by the learning parameter and adds another learning parameter.

Another aspect of the present invention provides a method for performing a forward operation of a multi-layered artificial neural network using the above apparatus. The implementation process is similar to that of a single-layer neural network, and after the execution of the artificial neural network of the previous layer is finished, the operation instruction of the next layer takes the output neuron address of the previous layer stored in the main operation unit as the input neuron address of the current layer. Similarly, the weight address and the offset address in the instruction are also changed to the corresponding address of the current layer.

The invention may be applied in the following (including but not limited to) scenarios: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet computer, an intelligent terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device; various vehicles such as airplanes, ships, vehicles, and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, range hoods and the like; and various medical devices including nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.

Drawings

For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 shows an example block diagram of the overall structure of an apparatus for performing artificial neural network forward operations in support of discrete data representations, according to an embodiment of the present invention.

Fig. 2 schematically shows the structure of an H-tree module (one implementation of an interconnection module) in an apparatus for performing artificial neural network forward operations supporting discrete data representation according to an embodiment of the present invention.

FIG. 3 illustrates an example block diagram of a structure of a main operation module in an apparatus for performing artificial neural network forward operations supporting discrete data representation in accordance with an embodiment of the present invention.

FIG. 4 illustrates an example block diagram of a slave operation module structure in an apparatus for performing artificial neural network forward operations in support of discrete data representations in accordance with an embodiment of the present invention.

FIG. 5 shows an example block diagram of a neural network forward operation process, according to an embodiment of the present invention.

FIG. 6 illustrates an example block diagram of a neural network reverse training process that supports discrete data representation in accordance with an embodiment of the present invention.

FIG. 7 is a flow chart illustrating operation of a single-layer artificial neural network in accordance with an embodiment of the present invention.

Fig. 8 shows an exemplary structure of an arithmetic unit according to an embodiment of the present invention.

FIG. 9 illustrates an example structure of a continuous discrete translation module for continuous data and discrete data translation, according to an embodiment of the invention.

Like devices, components, units, etc. are designated with like reference numerals throughout the drawings.

Detailed Description

Other aspects, advantages and salient features of the invention will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the invention.

In the present invention, the terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation; the term "or" is inclusive, meaning and/or.

In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.

The forward operation of the multilayer artificial neural network supporting discrete data representation according to the embodiment of the invention comprises a plurality of neurons with two layers or more than two layers. For each layer, the input neuron vector and the weight vector are subjected to dot product operation, and the result is subjected to an activation function to obtain an output neuron. Wherein the activation function can be a sigmoid function, tanh, relu, softmax function, and the like, and supports discretization representation or continuous representation of the activated output neurons.

For the dot product operation of the input neuron vector represented by the discrete data or the weight vector represented by the discrete data, the device supports the conversion of the dot product operation into the shifting, negating and exclusive or allelic operation of the data. For the data representation mode, the device supports data discrete representation or non-discrete representation, a user can customize which data of which layer adopts a discrete representation form or non-discrete representation, and can customize the number of bits of the discrete data according to specific needs, so as to replace the number of the represented real data, for example, the discrete data with the number of bits set to 1 bit, 2 bits, 3 bits, etc., can represent 2, 4, 8 real data respectively.

FIG. 1 shows an example block diagram of the overall structure of an apparatus for performing artificial neural network forward operations in support of discrete data representations, according to an embodiment of the present invention. As shown in fig. 1, the apparatus includes an instruction cache unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a master operation module 5, a plurality of slave operation modules 6, and optionally a continuous discrete conversion module 7. The instruction cache unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, the master operation module 5, the slave operation module 6, and the continuous discrete transformation module 7 may be implemented by hardware circuits (for example, including but not limited to an FPGA, a CGRA, an application specific integrated circuit ASIC, an analog circuit, a memristor, etc.). In particular, the apparatus may provide storage and operational support for discrete data.

The instruction cache unit 1 reads in instructions through the data access unit 3 and caches the read instructions.

The controller unit 2 reads instructions from the instruction cache unit 1 and translates the instructions into micro-instructions that control the behavior of other modules, such as the data access unit 3, the master operation module 5, and the slave operation module 6.

The data access unit 3 can access and store an external address space, and directly read and write data to each cache unit in the device to finish the loading and storage of the data. The data is represented discretely or non-discretely. The cell is designed to read data in discrete representations.

The interconnection module 4 is used for connecting the master operation module and the slave operation module, and can be realized into different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, a bus structure and the like)

Fig. 2 schematically shows an embodiment of the interconnect module 4: and an H tree module. The H-tree module 4 constitutes a data path between the master operation module 5 and the plurality of slave operation modules 6, and has an H-tree structure. The H tree is a binary tree path formed by a plurality of nodes, each node sends upstream data to two downstream nodes in the same way, combines the data returned by the two downstream nodes and returns the data to the upstream node. For example, at the beginning of the calculation stage of each layer of artificial neural network, the neuron data in the master operation module 5 can be a discrete representation or a non-discrete representation and is sent to each slave operation module 6 through the H-tree module 4; after the calculation process of the slave operation module 6 is completed, the values of the neurons output from the slave operation module are gradually pieced together in the H-tree to form a complete vector composed of neurons as an intermediate result vector. For the operation of discrete data representation, we specifically mention the operation module dedicated to discrete data operation inside the master-slave operation module, see fig. 7. The explanation is made by a neural network full connection layer, and if the device shares N slave operation modules, the intermediate result vector is segmented according to N, each segment has N elements, and the ith slave operation module calculates the ith element in each segment. N elements are spliced into a vector with the length of N through the H tree module and returned to the main operation module. Therefore, if the network only has N output neurons, each slave operation unit only needs to output the value of a single neuron, and if the network has m × N output neurons, each slave operation unit needs to output m neuron values. The H-tree module supports discrete data representation during both storage and transmission of data.

Fig. 3 is a block diagram illustrating an example of the structure of the main operation module 5 in the apparatus for performing artificial neural network forward operation according to the embodiment of the present invention. As shown in fig. 3, the main operation module 5 includes an operation unit 51, a data dependency relationship determination unit 52, and a neuron buffer unit 53 supporting discrete data representation.

The neuron buffer unit 53 supporting discrete data representation is used for buffering input data and output data used in the calculation process of the main operation module 5.

The arithmetic unit 51 performs various arithmetic functions of the main arithmetic module 5. For the case that all the operation factors are discrete data, the addition, subtraction, multiplication and division operation of the discrete data and the discrete data can be realized by looking up the table. E.g., 2 bits of discrete data, may represent 4 consecutive data values. There were 4 × 4 — 16 combinations for 4 consecutive data. For each operation of addition, subtraction, multiplication and division, an index table of 4 × 4 can be created and maintained, and the corresponding calculated value can be found through the index table. 4 index tables of 4 x 4 are needed for 4 operations.

For the case that the operation factor includes discrete data and continuous data, corresponding bit operations can be preset for addition, subtraction, multiplication and division operations for different discrete data. For example, the dot product operation of discrete data and continuous data may be replaced by bitwise exclusive-or followed by multiplication by 2 to the corresponding bitwise power and then cumulative summation. For example, for multiplication operations, if there is a discrete representation of multiplication factor data, multiplication operations with consecutive data represented by the discrete data can be replaced by operations corresponding to discrete data index (e.g., bitwise exclusive-or, negation, shift, etc. operations on the corresponding data), thereby reducing the number of multiplier components. For example, -1/2 times 16 for a multiplication operation of continuous data with discrete data. A conventional multiplier block would multiply-1/2 directly with 16. In the arithmetic unit 51, since the possibility of discrete data is less, the function of the arithmetic unit can be replaced by a method of switching judgment such as a lookup index. For example, a discrete data representation method of-1/2 may be specified as 01. If an operation factor is-1/2, the discrete data received by the operation unit 51 is 01. The operation unit 51 performs an operation corresponding to the discrete data 01. By inverting the 8-bit fixed point number for 16 to represent the 00010000 sign bit, shifting 1 bit to the right gives 10001000, decimal-8. For a divide operation, 16 is divided by-2. Where 16 is continuous data and 2 is discrete data. If the discrete data-2 binary representation is specified as 10. The arithmetic unit performs a division operation corresponding to the discrete data 10. The result is obtained by inverting the sign bit after a right shift of 0001000 by 1 bit for an 8 bit fixed point number of 16 to obtain 10001000, in decimal notation-8. The addition and subtraction operations are similar to the process described above. And according to the binary system of the discrete data as an index, indexing to operations of left shift, right shift, exclusive or and the like according to bits. This operation enables an addition or subtraction operation with the real data represented by the discrete data.

The dependency relationship determination unit 52 is a port of the operation unit 51 for reading and writing the neuron buffer unit 53, and can ensure the consistency of reading and writing data in the neuron buffer unit. Meanwhile, the data dependency relationship determination unit 52 is also responsible for sending the read data to the slave operation module through the interconnection module 4, and the output data of the slave operation module 6 is directly sent to the operation unit 51 through the interconnection module 4. The instruction output by the controller unit 2 is sent to the calculation unit 51 and the data dependency relationship judgment unit 52 to control the behavior thereof.

Fig. 4 shows an example block diagram of the structure of the slave operation module 6 in the apparatus for performing artificial neural network forward operation supporting discrete data representation according to the embodiment of the present invention. As shown in fig. 4, each slave operation module 6 includes an operation unit 61, a data dependency relationship determination unit 62, a neuron buffering unit 63 supporting discrete data representation, and a weight value buffering unit 64 supporting discrete data representation.

The arithmetic unit 61 receives the microinstruction issued by the controller unit 2 and performs arithmetic logic operations. For the case that all the operation factors are discrete data, the addition, subtraction, multiplication and division operation of the discrete data and the discrete data can be realized by looking up the table. E.g., 2 bits of discrete data, may represent 4 consecutive data values. There were 4 × 4 — 16 combinations for 4 consecutive data. For each operation of addition, subtraction, multiplication and division, an index table of 4 × 4 can be created and maintained, and the corresponding calculated value can be found through the index table. 4 index tables of 4 x 4 are needed for 4 operations.

The data dependency relationship determination unit 62 is responsible for reading and writing operations on the neuron cache unit in the calculation process. Before the data dependency judgment unit 62 performs the read/write operation, it is first ensured that there is no read/write consistency conflict for the data used between the instructions. For example, all microinstructions destined for the data dependency unit 62 are stored in an instruction queue within the data dependency unit 62, in which queue a read instruction must wait until the dependent write instruction is executed if the read data range of the read instruction conflicts with the write data range of the write instruction located earlier in the queue.

The neuron buffer unit 63 supporting discrete data representation buffers the input neuron vector data and the output neuron value data of the slave operation module 6. The data may be stored and transmitted in the form of discrete data.

The weight buffer unit 64 supporting discrete data representation buffers the weight data required by the slave operation module 6 in the calculation process. The data may or may not be discretely represented according to user definition. For each slave operation module 6, only the weights between all input neurons and part of the output neurons are stored. Taking the fully-connected layer as an example, the output neurons are segmented according to the number N of the slave operation units, and the weight corresponding to the nth output neuron of each segment is stored in the nth slave operation unit.

The slave operation module 6 realizes the first half part which can be parallel in the forward operation process of each layer of artificial neural network. The data storage and operation in the module support discrete data representation. Taking an artificial neural network fully-connected layer (MLP) as an example, the process is y ═ f (wx + b), wherein multiplication of a weight matrix w and an input neuron vector x can be divided into unrelated parallel computing subtasks, out and in are column vectors, each slave computing module 6 only computes the product of a corresponding part of scalar elements in and a column corresponding to the weight matrix w, each obtained output vector is a partial sum to be accumulated of a final result, and the partial sums are added pairwise in the interconnection module 4 to obtain a final result. This result may be represented as discrete data. The calculation process becomes a parallel process of calculating partial sums and a subsequent process of accumulation. Each slave operation module 6 calculates an output neuron value, and all the output neuron values are spliced in the interconnection module 4 to obtain an intermediate result vector. Each slave operation module 6 only needs to calculate the output neuron value corresponding to the module in the intermediate result vector y. The interconnection module 4 sums all neuron values output from the operation module 6 to obtain a final intermediate result vector y. The main operation module 5 performs subsequent calculations based on the intermediate result vector y, such as biasing, pooling (e.g., max pooling (MAXPOOLING) or mean pooling (AVGPOOLING)), activating, sampling, and the like.

Fig. 8 shows a block diagram of the arithmetic unit, which can be used for the arithmetic unit 51 in the master arithmetic block or the arithmetic unit 61 in the slave arithmetic block. The input data during the operation can be discrete data or continuous data. The data type determination unit 71 determines whether the input data is all continuous data, all discrete data, or mixed data containing both continuous data and discrete data. When the input data are all continuous data, the continuous data operation unit 72 performs the corresponding operation.

When the input data are all discrete data, the discrete data operation unit 73 performs a corresponding operation. For the case that all the operation factors are discrete data, the addition, subtraction, multiplication and division operation of the discrete data and the discrete data can be realized by looking up the table. E.g., 2 bits of discrete data, may represent 4 consecutive data values. There were 4 × 4 — 16 combinations for 4 consecutive data. For each operation of addition, subtraction, multiplication and division, an index table of 4 × 4 is created and maintained, and the corresponding calculated value is found through the index table. 4 index tables of 4 x 4 are needed for 4 operations.

When the input data is mixed data, the arithmetic decision unit 74 decides what kind of operation should be performed thereon based on the discrete data therein. The corresponding operations may be preset for different discrete data, respectively. Then, the mixed data operation unit 75 performs a corresponding operation according to the determination result of the operation determination unit 74. For the case that the operation factor includes discrete data and continuous data, corresponding bit operations can be preset for addition, subtraction, multiplication and division operations for different discrete data. For example, the dot product operation of discrete data and continuous data may be replaced by bitwise exclusive-or followed by multiplication by 2 to the corresponding bitwise power and then cumulative summation. For example, for multiplication operations, if there is a discrete representation of multiplication factor data, multiplication operations with consecutive data represented by the discrete data can be replaced by operations corresponding to discrete data index (e.g., bitwise exclusive-or, negation, shift, etc. operations on the corresponding data), thereby reducing the number of multiplier components. For example, -1/2 times 16 for a multiplication operation of continuous data with discrete data. A conventional multiplier block would multiply-1/2 directly with 16. In the arithmetic unit 51, since the possibility of discrete data is less, the function of the arithmetic unit can be replaced by a method of switching judgment such as a lookup index. For example, a discrete data representation method of-1/2 may be specified as 01. If an operation factor is-1/2, the discrete data received by the operation unit 51 is 01. The operation unit 51 performs an operation corresponding to the discrete data 01. By inverting the 8-bit fixed point number for 16 to represent the 00010000 sign bit, shifting 1 bit to the right gives 10001000, decimal-8. For a divide operation, 16 is divided by-2. Where 16 is continuous data and 2 is discrete data. If the discrete data-2 binary representation is specified as 10. The arithmetic unit performs a division operation corresponding to the discrete data 10. The result is obtained by inverting the sign bit after a right shift of 0001000 by 1 bit for an 8 bit fixed point number of 16 to obtain 10001000, in decimal notation-8. The addition and subtraction operations are similar to the process described above. And according to the binary system of the discrete data as an index, indexing to operations of left shift, right shift, exclusive or and the like according to bits. This operation enables an addition or subtraction operation with the real data represented by the discrete data.

Fig. 9 shows a continuous discrete conversion unit. The user may define whether to employ the module to convert continuous data to discrete data or not. Continuous data is input, and discrete data is output. The unit comprises a random number generation module, a judgment module and an operation module. The input continuous data is compared with the result after operation by the judgment module through the random number, and the random number is judged to fall in which interval, so that the specific value of the output discrete data is determined. For example, user definition produces binary discrete data. For any continuous data x that is input. The result y, y ═ abs (clip (-1,1)) is calculated via the arithmetic block. And then, through a judgment module, if the random number is larger than y, the output discrete data is 1, otherwise, the output discrete data is 0. Discrete data 1 and 0 represent-1 and +1 of continuous data, respectively. And storing the obtained discrete data back to the memory. Waiting for the operation units in the master-slave operation module to use, and generating corresponding operation.

The weight data and the output and input data in the forward process can be represented by discrete data or not. For the multiplication operation of continuous data, the multiplication operation of continuous data can be replaced by an exclusive or, negation, displacement and the like based on discrete data. For example, the weight is represented by 1-bit discrete data, 0 represents +1,1 represents-1, and the multiplication operation of the weight is realized by XOR of sign bits of data multiplied by the weight.

According to the embodiment of the invention, an instruction set for executing the forward operation of the artificial neural network on the device is also provided. The instruction set comprises a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, a MOVE instruction and the like, wherein:

configuring various constants required by calculation of a current layer by the CONFIG instruction before calculation of each layer of artificial neural network is started;

the COMPUTE instruction completes the arithmetic logic calculation of each layer of artificial neural network;

the IO instruction realizes reading input data required by calculation from an external address space and storing the data back to the external space after the calculation is finished, and the data support discretization representation;

the NOP instruction is responsible for emptying all microinstructions in all microinstruction cache queues in the current device, and all instructions before the NOP instruction are guaranteed to be finished. NOP instructions do not contain any operations themselves;

the JUMP instruction is responsible for the JUMP of the next instruction address to be read from the instruction cache unit by the controller and is used for realizing the JUMP of a control flow;

the MOVE instruction is responsible for carrying data at one address in the internal address space of the device to another address in the internal address space of the device, and the process is independent of the arithmetic unit and does not occupy the resources of the arithmetic unit in the execution process.

FIG. 5 shows an example block diagram of a neural network forward operation process, according to an embodiment of the present invention. In different slave operation modules 6, the input neuron vector and the weight vector of the slave operation module 6 are respectively subjected to dot product operation to obtain corresponding output neuron values, all the output neuron values form an intermediate result vector, the intermediate result vector is subjected to offset vector addition and activation operation to obtain a final output neuron vector of the layer of neural network, and the formula is described as out ═ f (w × in + b), wherein out is the output neuron vector, in is the input neuron vector, b is the offset vector, w is the weight matrix, and f is the activation function. The weight vector of each slave operation module 6 is the column vector corresponding to the slave operation module 6 in the weight matrix. The interconnection module sends the input neuron vectors [ in0, … and inN ] to all the slave arithmetic units and temporarily stores the input neuron vectors in the neuron buffer units. For the ith slave unit, calculate the dot product of its corresponding weight vector [ w _ i0, …, w _ iN ] and the input neuron vector. The results output by the slave operation units are spliced into a complete output vector through the interconnection module and returned to the main operation unit, and activation operation is carried out in the main operation unit to obtain the final output neuron vector [ out0, out1, out2, …, outN ].

FIG. 6 is a block diagram illustrating one implementation of artificial neural network forward computation with a single layer supporting discrete data representations, according to one embodiment. The flow chart describes the process of implementing an artificial neural network forward operation of a single-layer discrete data representation as shown in figure 5 using the apparatus and instruction set of the present invention.

Step S1.1, storing the initial instruction into an instruction storage unit 1;

step S1.2, reading an instruction from the instruction storage unit 1;

step S1.3, decoding the instruction;

s1.4, performing corresponding operation according to the control signal obtained by decoding;

step S1.5, the operation result is written back to the corresponding storage.

In step S1.1, an initialization IO command may be stored for carrying subsequent commands.

In step S1.2, the readable instructions include, but are not limited to, a CONFIG instruction, a complete instruction, an IO instruction, a NOP instruction, a JUMP instruction, a MOVE instruction, and the like.

In step S1.3, a control signal of the corresponding module is obtained by decoding according to the operation type (CONFIG, complete, IO, NOP, JUMP, MOVE, etc.) of the instruction. And for the CONFIG instruction, decoding to obtain configuration information for configuring the rest modules. For the COMPUTE instruction, a control signal of a master-slave operation module is obtained through decoding, and corresponding operations adopted by different discrete data are controlled. And for the IO instruction, decoding to obtain a control signal of the data access module. And for the NOP instruction, the actual control signal is not generated, and the NOP instruction is only used for emptying the control signals in all control signal buffer queues in the current device, so that all instructions before the NOP instruction are completely executed. For a JUMP instruction, a control signal of a JUMP instruction stream is obtained. For MOVE commands, control signals are obtained for data transfer within the device.

In step S1.4, the modules 2-6 perform corresponding operations according to the control signals. Taking the example of executing the complete instruction supporting the forward direction of the neural network represented by discrete data as an example, the interconnection module sends the input neuron vector [ in0, …, inN ] to all the slave operation modules, and temporarily stores the input neuron vector in the neuron cache unit. For the ith slave operation module, the dot product of its corresponding weight vector [ w _ i0, …, w _ iN ] and the input neuron vector is calculated. The results output by the slave operation module are spliced into a complete output vector through the interconnection module and returned to the main operation module, and activation operation is carried out in the main operation module to obtain the final output neuron vector [ out0, out1, out2, …, outN ].

In step S1.5, each module writes back the operation result to the corresponding cache. Taking the forward operation of the neural network represented by discrete data as an example, the output neuron vector obtained by the main operation module is written back to the storage unit.

FIG. 7 is another more detailed implementation showing a single-layer artificial neural network forward operation, in accordance with one embodiment. The flow chart describes the process of implementing a single layer neural network forward operation as shown in figure 4 using the apparatus and instruction set of the present invention.

In step S1, an IO instruction is pre-stored at the first address of instruction cache unit 1.

In step S2, the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction cache unit 1, and according to the translated micro instruction, the data access unit 3 reads all corresponding artificial neural network operation instructions from the external address space and caches them in the instruction cache unit 1.

In step S3, the controller unit 2 then reads in the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit 3 reads all the data (e.g., including input neuron vectors, interpolation tables, constant tables, offsets, etc.) required by the main operation module 5 from the external address space to the neuron cache unit 53 of the main operation module 5, which data supports a discrete representation, which may be wholly or partially discrete.

In step S4, the controller unit 2 then reads in the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit 3 reads the weight matrix data required by the operation module 6 from the external address space, where the data supports discrete representation, and may be wholly discrete or partially discrete.

At step S5, the controller unit 2 then reads in the next CONFIG instruction from the instruction cache unit, and based on the translated microinstructions, the device configures the various constants needed for the layer neural network computation. For example, the

arithmetic units

51, 61 configure the value of the unit internal register according to the parameter in the microinstruction, where the parameter includes, for example, the precision setting of the calculation of the present layer, the data of the activation function (e.g., precision bit of the calculation of the present layer, the rang parameter of the Lrn layer algorithm, the reciprocal of the window size of the averagepoolling layer algorithm, etc.).

In step S6, the controller unit 2 then reads in the next complete instruction from the instruction cache unit, and according to the translated microinstruction, the master computing module 5 first sends the input neuron vectors to each slave computing module 6 through the interconnection module 4, and stores the input neuron vectors in the neuron cache unit 63 of the slave computing module 6.

In step S7, according to the microinstruction decoded by the compote instruction, the operation unit 61 of the slave operation module 6 reads the weight vector (corresponding to the column vector of the slave operation module 6 in the weight matrix) from the weight buffer unit 64, reads the input neuron vector from the neuron buffer unit, completes the dot product operation of the weight vector and the input neuron vector, returns the intermediate result through interconnection, and for the discrete data, it self-defines the exclusive or allelic operation instead of the dot product operation or not. For example, for a 1-bit discrete data representation, 0 represents +1, and 1 represents-1, the multiplication of the weights is implemented by xoring the sign bits of the data multiplied by the weights. .

In step S8, in the interconnect block 4, the intermediate results returned from the operation block 6 are pieced together step by step into a complete intermediate result vector.

In step S9, the main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the neuron cache unit 53 according to the microinstruction decoded by the component instruction, adds the offset vector to the vector returned by the interconnection module 4, and activates the addition result. And writes the final output neuron vector back to the neuron buffer unit 53.

In step S10, the controller unit then reads in the next IO instruction from the instruction cache unit, and based on the translated microinstruction, the data access unit 3 stores the output neuron vector in the neuron cache unit 53 to the external address space specified address, and the operation ends.

The operation steps of Batch Normalization (Batch Normalization) for the artificial neural network are similar to the above process. With the provided instruction set, the controller completes the following process. The controller controls the data access unit to read in input data, and then controls the master-slave operation module to calculate the mean value and the variance of each position according to the size of the batch or use the set mean value variance. The controller then controls the input data at the corresponding location minus the mean divided by the square difference. Finally, the controller multiplies the processed data by the learning parameter and adds another learning parameter.

For the multilayer artificial neural network, the implementation process is similar to that of a single-layer neural network, and after the execution of the upper-layer artificial neural network is finished, the operation instruction of the next layer takes the output neuron address of the upper layer stored in the main operation unit as the input neuron address of the layer. Similarly, the weight address and the offset address in the instruction are also changed to the corresponding address of the current layer.

By adopting the device and the instruction set for executing the artificial neural network forward operation, the problems of insufficient operation performance of a CPU and a GPU and high front-end decoding overhead are solved. The support for the forward operation of the multilayer artificial neural network is effectively improved.

By adopting the special on-chip cache for the forward operation of the multilayer artificial neural network, the reusability of input neurons and weight data is fully mined, the data are prevented from being read to the memory repeatedly, the memory access bandwidth is reduced, and the problem that the memory bandwidth becomes the bottleneck of the forward operation performance of the multilayer artificial neural network is avoided.

By adopting the discrete data representation method, compared with the floating point number, fixed point number and other representation methods, the storage energy consumption and other expenses of the device are greatly reduced. The structure layout can be optimized in a limited area, and indexes such as operation speed or performance energy consumption ratio are improved.

The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially. And for the representation of discrete data, it should be understood that which data is represented discretized and which is represented contiguously can be selected. The spirit of whether data is represented discretely is throughout the operation.

In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. An apparatus for performing artificial neural network forward operations supporting discrete data representation, comprising an instruction cache unit, a controller unit, a data access unit, a computation unit, and a continuous discrete transformation module, wherein:

the data access unit is used for accessing and storing an external address space, directly reading and writing data to each cache unit in the device and finishing the loading and storing of the data;

the instruction cache unit is used for reading in the instruction;

the continuous discrete conversion module is used for executing interchange between discrete data types and continuous data types on data;

a controller unit for reading instructions from the instruction cache unit and decoding the instructions into microinstructions of the compute unit behavior;

the calculation unit is used for executing the operation between the input neuron vector of the layer and the weight vector to obtain a calculation result, wherein the input neuron vector is all discrete data or mixed data; the computing unit specifically includes: the system comprises an interconnection module, a master operation module and a slave operation module; the interconnection module is used for connecting the master operation module and the slave operation module;

the master operation module is used for transmitting the input neuron vectors of the layer to all the slave operation modules through the interconnection module;

the slave operation module is used for performing calculation on the input neuron vector and the weight vector corresponding to the slave operation module to obtain an output neuron value and sending the output neuron value to the master operation module through the interconnection module; at least one of the input neuron vector and the weight vector is discrete data;

the interconnection module is used for obtaining an intermediate result vector according to the output neuron value and sending the intermediate result vector to the main operation module;

the main operation module is used for executing subsequent calculation on the intermediate result to obtain a calculation result;

if the operation factors are all discrete data, the discrete data operation unit executes corresponding operation; if the operation factor is mixed data, presetting corresponding bit operation for addition, subtraction, multiplication and division operation aiming at different discrete data;

the mixed data is as follows: mixed data containing both continuous data and discrete data.

2. The apparatus of claim 1, wherein the slave computing module is a plurality of slave computing modules,

the interconnection module is specifically used for splicing the output neuron values of the slave operation modules step by step into an intermediate result vector.

3. The apparatus of claim 2, wherein,

the plurality of slave operation modules are specifically configured to perform parallel computation on the same input neuron vector and different weight vectors corresponding to the respective slave operation modules to obtain a plurality of output neuron values.

4. The apparatus of claim 2, wherein the subsequent calculation is any one of:

a biasing operation, an activating operation, a pooling operation, or a sampling operation;

the active function of the activation operation is any one of nonlinear functions sigmoid, tanh, relu and softmax;

the pooling operation includes maximum pooling or average pooling.

5. The apparatus of claim 2, wherein the slave operation module comprises an input neuron buffering unit for buffering an input neuron vector, the input neuron vector being discrete data or mixed data.

6. The apparatus of claim 2, wherein the master operation module comprises: the device comprises an operation unit, a data dependency relation judgment unit and a neuron cache unit, wherein:

the neuron cache unit is used for caching input data and output data of the main operation module in the calculation process;

an arithmetic unit for executing various operations of the main arithmetic module;

and the data dependency relationship judging unit is used for reading the input neuron vector from the neuron cache unit, sending the input neuron vector to the slave operation module through the interconnection module, receiving the intermediate result vector of the interconnection module and sending the intermediate result vector to the operation unit.

7. The apparatus of claim 6,

the operation unit is specifically used for realizing addition, subtraction, multiplication and division operations of the discrete data and the discrete data through table lookup when all the operation factors are the discrete data;

or the operation unit is specifically configured to preset a corresponding bit operation for addition, subtraction, multiplication and division operations for different discrete data when the operation factor includes discrete data and continuous data.

8. The apparatus of claim 2, wherein each slave operation module comprises: the device comprises a slave operation unit, a slave data dependency relation judgment unit, a slave neuron caching unit and a weight caching unit, wherein:

the slave operation unit is used for receiving the microinstruction sent by the controller unit and performing arithmetic logic operation;

the slave data dependency relationship judging unit is used for executing the read-write operation of the neuron caching unit and the weight caching unit;

the slave neuron caching unit is used for caching data of input neuron vectors and output neuron values obtained by calculation of the slave operation module;

and the weight caching unit is used for caching the weight vector required by the slave operation module in the calculation process.

9. The apparatus of claim 8,

the slave operation unit is specifically used for realizing addition, subtraction, multiplication and division operation of the discrete data and the discrete data by looking up a table when all the operation factors are the discrete data;

or the slave operation unit is specifically configured to preset a corresponding bit operation for addition, subtraction, multiplication and division operations for different discrete data when the operation factor includes discrete data and continuous data.

10. The apparatus according to claim 6 or 8, wherein the data dependency relationship determining unit or the slave data dependency relationship determining unit is specifically configured to determine whether there is a dependency relationship between the first data of the unexecuted control signal and the second data of the control signal in the process of being executed, and if there is no dependency relationship, allow the unexecuted control signal to be executed immediately, if there is a dependency relationship; after all the control signals having a dependency relationship with the unexecuted control signal are completely executed, the unexecuted control signal is allowed to be executed.

11. The apparatus of claim 6 or 7, wherein the arithmetic unit or slave arithmetic unit comprises: an operation decision unit and a mixed data operation unit;

an operation decision unit for deciding, when the input data is mixed data, that a first operation should be performed on the mixed data according to discrete data in the mixed data;

and the mixed data operation unit is used for executing the first operation according to the decision result of the operation decision unit.

12. A method for performing artificial neural network forward operation supporting discrete data representation, the method being applied to an artificial neural network forward operation device, the forward operation device comprising an instruction cache unit, a controller unit, a data access unit, a computation unit and a continuous discrete transformation module, wherein: the method comprises the following steps:

the data access unit accesses the external address space and directly reads and writes data to each cache unit in the device to finish the loading and storage of the data;

the continuous discrete conversion module executes interchange between discrete data types and continuous data types on data;

the controller unit reads the instruction from the instruction cache unit and decodes the instruction into a microinstruction which controls the behavior of the computing unit;

the calculation unit executes the operation between the input neuron vector of the layer and the weight vector to obtain a calculation result, wherein the input neuron vector is all discrete data or mixed data; the computing unit specifically includes: the system comprises an interconnection module, a master operation module and a slave operation module; the calculation unit executes the operation between the input neuron vector and the weight vector of the layer to obtain a calculation result, specifically including:

the master operation module transmits the input neuron vectors of the layer to all the slave operation modules through the interconnection module; the slave operation module performs calculation on the input neuron vector and the weight vector corresponding to the slave operation module to obtain an output neuron value, and sends the output neuron value to the master operation module through the interconnection module; at least one of the input neuron vector and the weight vector is discrete data;

the interconnection module obtains an intermediate result vector according to the output neuron value and sends the intermediate result vector to the main operation module; the main operation module executes subsequent calculation on the intermediate result;

13. The method of claim 12, wherein the slave operation module is a plurality of slave operation modules, and the obtaining of the intermediate result vector by the interconnection module according to the output neuron value comprises:

and the interconnection module gradually splices the output neuron values of the slave operation modules into an intermediate result vector.

14. The method according to claim 13, wherein the slave operation module performs calculation on the input neuron vector and the weight vector corresponding to the slave operation module to obtain the output neuron value specifically includes:

and the plurality of slave operation modules execute parallel calculation on the same input neuron vector and different weight vectors corresponding to the slave operation modules to obtain a plurality of output neuron values.

15. The method of claim 13, wherein the subsequent calculation is any one of:

the pooling operation includes maximum pooling or average pooling.

16. The method of claim 13, wherein the slave operation module comprises an input neuron buffering unit for buffering input neuron vectors, the input neuron vectors being either discrete data or mixed data.

17. The method of claim 13, wherein the master operation module comprises: the method specifically comprises the following steps of:

the neuron cache unit caches input data and output data of the main operation module in the calculation process;

the operation unit executes various operations of the main operation module;

the data dependency relation judging unit reads an input neuron vector from the neuron cache unit and sends the input neuron vector to the slave operation module through the interconnection module; and receiving the intermediate result vector of the interconnection module and sending the intermediate result vector to the arithmetic unit.

18. The method according to claim 17, wherein the arithmetic unit performs various operations of the main arithmetic module specifically including:

when all the operation factors are discrete data, the operation unit realizes the addition, subtraction, multiplication and division operation of the discrete data and the discrete data through table lookup;

or when the operation factor comprises discrete data and continuous data, the operation unit presets corresponding bit operation for addition, subtraction, multiplication and division operation aiming at different discrete data.

19. The method of claim 13, wherein each slave operation module comprises: the method specifically comprises the following steps of:

receiving the microinstruction sent by the controller unit from the arithmetic unit and carrying out arithmetic logic operation;

the slave data dependency relationship judging unit executes read-write operation of the neuron caching unit and the weight caching unit; caching data of input neuron vectors and output neuron values obtained by calculation of the slave operation module from a neuron caching unit; the weight caching unit caches the weight vector required by the slave operation module in the calculation process.

20. The method of claim 19, wherein receiving the microinstructions issued by the controller unit and performing arithmetic logic operations from the arithmetic unit specifically comprises:

when all the operation factors are discrete data, the slave operation unit realizes addition, subtraction, multiplication and division operation of the discrete data and the discrete data through table lookup;

or when the operation factor comprises discrete data and continuous data, the slave operation unit presets corresponding bit operation for addition, subtraction, multiplication and division operation aiming at different discrete data.

21. The method of claim 17 or 19, wherein the arithmetic unit or slave arithmetic unit comprises: the method specifically comprises the following steps:

when the input data is mixed data, the operation decision unit decides to execute a first operation on the mixed data according to discrete data in the mixed data; the mixed data operation unit executes a first operation according to the decision result of the operation decision unit.