WO2017177442A1 - 支持离散数据表示的人工神经网络正向运算装置和方法 - Google Patents

支持离散数据表示的人工神经网络正向运算装置和方法 Download PDF

Info

Publication number
WO2017177442A1
WO2017177442A1 PCT/CN2016/079431 CN2016079431W WO2017177442A1 WO 2017177442 A1 WO2017177442 A1 WO 2017177442A1 CN 2016079431 W CN2016079431 W CN 2016079431W WO 2017177442 A1 WO2017177442 A1 WO 2017177442A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
module
discrete
unit
neuron
Prior art date
Application number
PCT/CN2016/079431
Other languages
English (en)
French (fr)
Inventor
刘少礼
于涌
陈云霁
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to US16/093,956 priority Critical patent/US20190138922A1/en
Priority to PCT/CN2016/079431 priority patent/WO2017177442A1/zh
Priority to EP16898260.1A priority patent/EP3444757B1/en
Publication of WO2017177442A1 publication Critical patent/WO2017177442A1/zh
Priority to US16/182,420 priority patent/US20190073584A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • the present invention relates generally to artificial neural networks, and more particularly to an apparatus and method for performing artificial neural network forward operations, the data of the present invention supporting discrete data representations.
  • operations such as multiplication of continuous data, such as multiplication, are replaced by bitwise operations such as XOR, NAND, and the like.
  • Multi-layer artificial neural networks are widely used in the fields of pattern recognition, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kir in recent years due to their high recognition accuracy and good parallelism. The industry is getting more and more attention.
  • One known method of supporting multi-layer artificial neural network forward operations is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • Another known method of supporting multi-layer artificial neural network forward training is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit.
  • Discrete data representation refers to the way in which continuous data is replaced by a specific number. For example, four numbers 00, 01, 10, and 11 can represent four numbers of data-1, -1/8, 1/8, and 1, respectively. This storage method is different from the continuous storage method. In the continuous storage mode, the binary number 00/01/10/11 represents four consecutive numbers of 0/1/2/3 in decimal. Through this index-like representation, we replace the discontinuously discretized real data with a formal continuous number. The stored numbers are not contiguous, so they are called discrete data representations.
  • the current computing device for computing a multi-layer artificial neural network is known in terms of data representation by continuous data such as floating point numbers or fixed point numbers. Because the precision of multi-layer neural network weights is high and the number is large, the representation of continuous data brings more overhead in both operation and storage. In the way of discrete data representation, it is possible to replace the multiplication of continuous data by operations such as bitwise XOR or shift of data. This greatly reduces the number of multiplier components. And using several bits of discretized data, compared to traditional The advantages of 32-bit floating point storage are also obvious.
  • An aspect of the present invention provides an apparatus for performing an artificial neural network forward operation supporting a discrete data representation, including an instruction cache unit, a controller unit, a data access unit, an interconnect module, a main operation module, and a plurality of slave operations. Module, where:
  • the instruction cache unit is configured to read in an instruction through the data access unit and cache the read instruction
  • the controller unit is configured to read an instruction from the instruction cache unit and decode the instruction into a micro-instruction that controls the interconnect module, the main operation module, and the slave operation module;
  • the data access unit is configured to write discrete data or continuous data from the external address space to the main data processing unit and the corresponding data buffer unit of each of the slave operation modules or read the discrete data or the continuous data from the data buffer unit to the external address space;
  • Each layer of neural network begins the phase of forward calculation.
  • the main operation module transmits the discrete or continuous input neuron vector of the layer to all the slave modules through the interconnect module. After the calculation process of the slave module is completed, the interconnect module is stepped.
  • the discrete or continuous output neuron values of each slave arithmetic module are combined into an intermediate result vector, wherein when the input data is mixed data of discrete data and continuous data, the corresponding calculation is preset from the computing module for different discrete data. the way;
  • the main operation module is used to complete the subsequent calculation by using the intermediate result vector.
  • the main operation module adopts a corresponding calculation manner preset for different discrete data.
  • discrete data representation refers to the representation of replacing real continuous data with a particular discrete number.
  • the plurality of slave arithmetic modules calculate the respective discrete or continuous output neuron values in parallel using the same discrete or continuous input neuron vector and respective different discrete or continuous weight vectors.
  • the main operation module performs any of the following operations on the intermediate result vector:
  • the intermediate result vector is activated, and the activation function active is any one of the nonlinear functions sigmoid, tanh, relu, softmax or a linear function;
  • Pooling operations including maximum pooling or average pooling.
  • the slave arithmetic module includes an input neuron cache unit for buffering discrete or continuous input neuron vectors.
  • the interconnect module constitutes a data path of continuous or discretized data between the main operation module and the plurality of slave operation modules, and can be implemented into different interconnection topologies.
  • the H-tree is a binary tree path composed of a plurality of nodes, each node transmitting the upstream discrete or continuous data to the downstream two nodes in the same manner, and downstream. The discrete or continuous data returned by the two nodes is merged and returned to the upstream node.
  • the main operation module includes an operation unit, a data dependency determination unit, and a neuron cache unit, wherein:
  • the neuron cache unit is configured to buffer the input data and the output data of the discrete or continuous representation used by the main operation module in the calculation process;
  • the operation unit completes various calculation functions of the main operation module.
  • the input data is mixed data of discrete data and continuous data, a corresponding calculation manner preset for different discrete data is adopted;
  • the data dependency judgment unit is a port for the operation unit to read and write the neuron cache unit, and ensures that there is no consistency conflict between the continuous data or the discrete data read and write in the neuron cache unit, and is responsible for reading the input discrete or continuous from the neuron cache unit.
  • the intermediate result vector from the interconnect module is sent to the arithmetic unit.
  • each slave operation module includes an operation unit, a data dependency determination unit, a neuron cache unit, and a weight buffer unit, wherein:
  • the arithmetic unit receives the micro-instruction issued by the controller unit and performs arithmetic logic operation.
  • the input data is mixed data of discrete data and continuous data, a corresponding calculation manner preset for different discrete data is adopted;
  • the data dependency judgment unit is responsible for reading and writing operations to the neuron cache unit supporting the discrete data representation and the weight buffer unit supporting the discrete data representation in the calculation process, ensuring the support of the discrete data representation of the neuron cache unit and supporting the discrete data representation. There is no consistency conflict between the reading and writing of the weight buffer unit;
  • the neuron cache unit buffers data of the input neuron vector and the output neuron value calculated by the slave computing module
  • the weight buffer unit buffers the weight vector of the discrete or continuous representation that the slave module requires during the calculation process.
  • the data dependency determining unit ensures that there is no consistency conflict between the read and write by determining whether there is a dependency between the microinstruction that has not been executed and the data of the microinstruction in the process of executing, and if not, allowing the The microinstruction is transmitted immediately, otherwise the microinstruction is allowed to be transmitted after all the microinstructions on which the microinstruction depends are executed.
  • the operation unit in the main operation module or the slave operation module includes an operation decision unit and a mixed data operation unit.
  • the operation decision unit determines, according to the discrete data therein, what operation should be performed on the mixed data. Then, the mixed data operation unit performs a corresponding operation based on the determination result of the operation decision unit.
  • the operation unit in the main operation module or the slave operation module further includes at least one of a discrete data operation unit and a continuous data operation unit, and a data type determination unit, when the input data is all discrete data, by the discrete
  • the data operation unit performs a corresponding operation by looking up the table according to the input discrete data.
  • the continuous data operation unit performs a corresponding operation.
  • the distance calculation module calculates a distance between the preprocessed data y and each of the above values
  • the judging module calculates and outputs the discrete data based on the distance.
  • the predetermined interval [-zone,zone] is [-1,1] or [-2,2]; and/or the absolute value of the M values is a reciprocal of the power of 2; and/or the determining module executes: And outputting the discrete data corresponding to the value closest to the preprocessed data y, if two values are equal to the preprocessed data, outputting the discrete data corresponding to any one of the two; or calculating the preprocessed data y respectively
  • the normalized probability of any one of the two nearest values, the normalized probability corresponding to any of the two values and the random number generated The random number z comparison between (0, 1) generated by the module, if the z is less than the probability, the discrete data is output, otherwise another discrete data is output.
  • Another aspect of the present invention provides a method of performing a single layer artificial neural network forward operation using the above apparatus.
  • the controller controls the data such as neurons, weights, and constant quantities required for the read operation. These data may or may not be represented by discrete data.
  • the main operation module, the slave operation module, and the interconnection module complete the process of multiplying the neuron data and the weight data by the offset activation.
  • the multiplication operation is replaced by the bit operation on the related data according to the value of the discrete data.
  • the weight data is represented by 1 bit of discrete data, 0 represents +1, and 1 represents -1, and the multiplication of the weight is realized by XORing the sign bit of the data multiplied by the weight.
  • Another aspect of the present invention provides a method of supporting artificial neural network batch normalization using the above apparatus.
  • the controller controls the data access unit to read the input data, and then controls the master-slave operation module to find the mean and variance of the respective positions according to the batch size or use the set mean variance.
  • the controller then controls the input data for the corresponding location minus the mean divided by the variance.
  • the controller controls multiplying the processed data by the learning parameters and adding another learning parameter.
  • Another aspect of the present invention provides a method of performing a multi-layer artificial neural network forward operation using the above apparatus.
  • the implementation process is similar to the single-layer neural network.
  • the next-level operation instruction will use the output neuron address of the upper layer stored in the main operation unit as the input neuron of this layer. address.
  • the weight address and offset address in the instruction are also changed to the address corresponding to this layer.
  • FIG. 1 shows an example block diagram of an overall structure of an apparatus for performing artificial neural network forward operations supporting discrete data representations in accordance with an embodiment of the present invention.
  • FIG. 2 is a diagram schematically showing the structure of an H-tree module (an embodiment of an interconnection module) in an apparatus for performing artificial neural network forward operation supporting discrete data representation according to an embodiment of the present invention.
  • FIG. 3 illustrates an example block diagram of a main operational module structure in an apparatus for performing artificial neural network forward operations that support discrete data representations in accordance with an embodiment of the present invention.
  • FIG. 4 illustrates an example block diagram of a slave arithmetic module structure in an apparatus for performing artificial neural network forward operations that support discrete data representations in accordance with an embodiment of the present invention.
  • FIG. 5 illustrates an example block diagram of a neural network forward operation process in accordance with an embodiment of the present invention.
  • FIG. 6 shows an example block diagram of a neural network reverse training process that supports discrete data representations in accordance with an embodiment of the present invention.
  • Figure 7 shows a flow chart of a single layer artificial neural network operation in accordance with an embodiment of the present invention.
  • FIG. 8 shows an example structure of an arithmetic unit according to an embodiment of the present invention.
  • FIG. 9 illustrates an example structure of a continuous discrete conversion module for continuous data and discrete data conversion in accordance with an embodiment of the present invention.
  • the forward operation of a multi-layer artificial neural network supporting discrete data representation includes two or more layers of multiple neurons.
  • the input neuron vector is first subjected to a dot product with the weight vector, and the result is the output neuron through the activation function.
  • the activation function may be a sigmoid function, a nonlinear function such as a tanh, relu, or softmax function, and supports discretization or continuous representation of the output neuron after activation.
  • the activation function can also be a linear function.
  • the device supports the conversion of the dot product operation into a shift, a non-exclusive, an exclusive OR or an equipotential operation of the data.
  • the device supports discrete representation or non-discrete representation of the data, the user can customize which layer of the data is in discrete representation or non-discrete representation, and can customize the number of bits of the discrete data according to specific needs, thereby Instead of the number of real data to be represented, for example, discrete data set to a number of bits such as 1 bit, 2 bits, or 3 bits, respectively, can represent 2, 4, or 8 real data.
  • the apparatus includes an instruction cache unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and a plurality of slave operation modules 6, and optionally a continuous discrete conversion module 7. .
  • the instruction cache unit 1, the controller unit 2, the data access unit 3, the interconnect module 4, the main operation module 5, the slave operation module 6, and the continuous discrete conversion module 7 may all pass through hardware circuits (including but not limited to FPGA, CGRA, dedicated Implementation of integrated circuit ASICs, analog circuits, and memristors.
  • the device can provide storage and computational support for discrete data.
  • the instruction cache unit 1 reads in an instruction through the data access unit 3 and caches the read instruction.
  • the controller unit 2 reads instructions from the instruction cache unit 1 and translates the instructions into micro-instructions that control the behavior of other modules, such as the data access unit 3, the main arithmetic module 5, and the slave arithmetic module 6.
  • the data access unit 3 can access the external address space, directly read and write data to each cache unit inside the device, and complete data loading and storage.
  • the data is either discretely represented or non-discretely represented. This unit is used to design data that can read discrete representations.
  • the interconnection module 4 is used for connecting the main operation module and the slave operation module, and can realize different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, a bus structure, etc.)
  • FIG. 2 schematically shows an embodiment of the interconnection module 4: an H-tree module.
  • the H-tree module 4 constitutes a data path between the main arithmetic module 5 and the plurality of slave arithmetic modules 6, and has an H-tree structure.
  • the H-tree is a binary tree path composed of multiple nodes. Each node sends the upstream data to the downstream two nodes in the same way, and the data returned by the two downstream nodes are combined and returned to the upstream node.
  • the neuron data in the main operation module 5 may be discrete representation or non-discrete representation sent to the respective slave arithmetic module 6 through the H-tree module 4; After the calculation process is completed, the value of each neuron output from the arithmetic module is progressively formed into a complete vector composed of neurons in the H-tree as an intermediate result vector.
  • the operation module dedicated to discrete data operations inside the master-slave operation module as shown in Figure 7.
  • the neural network full connection layer is described.
  • the intermediate result vector is segmented by N, each segment has N elements, and the i-th slave computing module calculates the i-th element in each segment. .
  • the N elements are assembled into a vector of length N through the H-tree module and returned to the main arithmetic module. So if the network has only N output neurons, each slave unit only needs to output the value of a single neuron. If the network has m*N output neurons, each slave unit needs to output m neuron values.
  • the H-tree module supports discrete data representation in the process of storing and transmitting data.
  • FIG. 3 shows an example block diagram of the structure of the main operation module 5 in the apparatus for performing artificial neural network forward operation according to an embodiment of the present invention.
  • the main operation module 5 includes an operation unit 51, a data dependency determination unit 52, and a neuron buffer unit 53 that supports discrete data representation.
  • the neuron buffer unit 53 supporting the discrete data representation is used to buffer the input data and the output data used by the main operation module 5 in the calculation process.
  • the arithmetic unit 51 performs various arithmetic functions of the main arithmetic module 5.
  • the addition, subtraction, multiplication and division of discrete data and discrete data can be realized by looking up the table.
  • 2-bit discrete data can represent 4 consecutive data values.
  • There are 4*4 16 combinations for 4 consecutive data.
  • the 4*4 index table can be created and maintained, and the corresponding calculated value is found through the index table.
  • a total of four 4*4 index tables are required for the four operations.
  • the corresponding bit operations may be preset for the addition, subtraction, multiplication, and division operations for different discrete data.
  • a dot product operation of discrete data and continuous data may be replaced by a method of cumulative summation after bitwise exclusive OR and multiplied by 2 corresponding powers.
  • a multiplication operation if the multiplicative factor data is discretely represented, it may be replaced by a discrete data index corresponding operation (eg, bitwise XOR, NAND, shift, etc. of the corresponding data) and the discrete data representation. Multiplication of continuous data, which reduces the number of multiplier parts.
  • the function of the arithmetic unit can be replaced by a method of finding a switch such as a search index. For example, it can be specified that the discrete data representation method of -1/2 is 01. If an operation factor is -1/2, the discrete data received by the arithmetic unit 51 is 01. The arithmetic unit 51 uses the operation corresponding to the discrete data 01.
  • the decimal representation is -8.
  • 16 Divide by -2 16 is continuous data and -2 is discrete data. If the discrete data-2 binary is specified as 10.
  • the arithmetic unit uses the division operation corresponding to the discrete data 10.
  • the result is obtained by inverting the sign bit by subtracting 1000 bits from the 8-bit fixed-point number of 16 to 10001000 and obtaining 10001000 in decimal.
  • the addition and subtraction operations are similar to the above process. According to the binary of the discrete data as an index, the index is shifted to the left, right, or XOR. After this operation, the addition or subtraction operation with the real data represented by the discrete data is realized.
  • the dependency judging unit 52 is a port in which the arithmetic unit 51 reads and writes the neuron buffer unit 53, and at the same time, can ensure read and write consistency of data in the neuron cache unit.
  • the data dependency judging unit 52 is also responsible for transmitting the read data to the slave arithmetic module through the interconnect module 4, and the output data from the arithmetic module 6 is directly sent to the arithmetic unit 51 through the interconnect module 4.
  • the command output from the controller unit 2 is sent to the calculation unit 51 and the data dependency determination unit 52 to control its behavior.
  • each slave arithmetic module 6 includes an arithmetic unit 61, a data dependency determining unit 62, a neuron buffer unit 63 supporting discrete data representation, and a weight buffer unit 64 supporting discrete data representation.
  • the arithmetic unit 61 receives the microinstructions issued by the controller unit 2 and performs an arithmetic logic operation.
  • the addition, subtraction, multiplication and division of discrete data and discrete data can be realized by looking up the table.
  • 2-bit discrete data can represent 4 consecutive data values.
  • There are 4*4 16 combinations for 4 consecutive data.
  • the 4*4 index table can be created and maintained, and the corresponding calculated value is found through the index table. A total of four 4*4 index tables are required for the four operations.
  • the corresponding bit operations may be preset for the addition, subtraction, multiplication, and division operations for different discrete data.
  • a dot product operation of discrete data and continuous data may be replaced by a method of cumulative summation after bitwise exclusive OR and multiplied by 2 corresponding powers.
  • a multiplication operation if the multiplicative factor data is discretely represented, it may be replaced by a discrete data index corresponding operation (eg, bitwise XOR, NAND, shift, etc. of the corresponding data) and the discrete data representation. Multiplication of continuous data, which reduces the number of multiplier parts.
  • arithmetic unit 51 For example, for multiplication operations of continuous data and discrete data, -1/2 is multiplied by 16. Traditional multiplier components will multiply -1/2 and 16 directly.
  • the function of the arithmetic unit can be replaced by a method of finding a switch such as a search index.
  • a discrete data representation method of -1/2 can be specified Is 01. If an operation factor is -1/2, the discrete data received by the arithmetic unit 51 is 01. The arithmetic unit 51 uses the operation corresponding to the discrete data 01.
  • the decimal representation is -8.
  • 16 is divided by -2. 16 is continuous data and -2 is discrete data. If the discrete data-2 binary is specified as 10.
  • the arithmetic unit uses the division operation corresponding to the discrete data 10. The result is obtained by inverting the sign bit by subtracting 1000 bits from the 8-bit fixed-point number of 16 to 10001000 and obtaining 10001000 in decimal.
  • the addition and subtraction operations are similar to the above process. According to the binary of the discrete data as an index, the index is shifted to the left, right, or XOR. After this operation, the addition or subtraction operation with the real data represented by the discrete data is realized.
  • the data dependency judging unit 62 is responsible for reading and writing operations to the neuron cache unit in the calculation process. Before the data dependency determination unit 62 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all microinstructions sent to the data dependency unit 62 are stored in an instruction queue inside the data dependency unit 62, in which the range of read data of the read instruction is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
  • the neuron buffer unit 63 supporting the discrete data representation buffers the input neuron vector data and the output neuron value data of the slave arithmetic module 6. This data can be stored and transmitted in the form of discrete data.
  • the weight buffer unit 64 supporting the discrete data representation buffers the weight data required by the slave arithmetic module 6 in the calculation process. This data can be discretely represented or not according to user definition. For each slave arithmetic module 6, only the weights between all input neurons and partial output neurons are stored. Taking the all-connected layer as an example, the output neurons are segmented according to the number N of the operation units, and the weights corresponding to the n-th output neurons of each segment are stored in the n-th slave operation unit.
  • the first half of the parallel operation of each layer of artificial neural network can be realized from the operation module 6.
  • Data storage and operations in this module support discrete data representation.
  • the sum is added to the interconnection module 4 step by step to obtain the final result.
  • This result can be represented by discrete data.
  • Each of the slave arithmetic modules 6 calculates an output neuron value, and all of the output neuron values are combined in the interconnect module 4 to obtain an intermediate result vector.
  • Each slave arithmetic module 6 only needs to calculate the output neuron value corresponding to the module in the intermediate result vector y.
  • the interconnection module 4 sums all the neuron values output from the arithmetic module 6 to obtain a final intermediate result vector y.
  • the main operation module 5 performs subsequent calculations based on the intermediate result vector y, such as adding offset, pooling (for example, MAXPOOLING or AVGPOOLING, etc.), performing activation and sampling.
  • FIG. 8 shows a structural block diagram of an arithmetic unit which can be used for the arithmetic unit 51 in the main arithmetic module or the arithmetic unit 61 in the slave arithmetic module.
  • the input data during the operation can be discrete data or continuous data.
  • the data type judging unit 71 judges that the input data is all continuous data, all discrete data, or mixed data including both continuous data and discrete data.
  • the continuous data operation unit 72 performs a corresponding operation.
  • the discrete data operation unit 73 When the input data is all discrete data, the discrete data operation unit 73 performs a corresponding operation.
  • the addition, subtraction, multiplication and division of discrete data and discrete data can be realized by looking up the table.
  • 2-bit discrete data can represent 4 consecutive data values.
  • There are 4*4 16 combinations for 4 consecutive data.
  • the operation decision unit 74 decides which operation to perform according to the discrete data therein.
  • the corresponding operations can be preset for different discrete data.
  • the mixed data operation unit 75 performs a corresponding operation based on the determination result of the arithmetic decision unit 74.
  • the corresponding bit operations may be preset for the addition, subtraction, multiplication, and division operations for different discrete data.
  • a dot product operation of discrete data and continuous data may be replaced by a method of cumulative summation after bitwise exclusive OR and multiplied by 2 corresponding powers.
  • multiplicative factor data for a multiplication operation, if the multiplicative factor data is discretely represented, it may be replaced by a discrete data index corresponding operation (eg, bitwise XOR, NAND, shift, etc. of the corresponding data) and the discrete data representation.
  • a discrete data index corresponding operation eg, bitwise XOR, NAND, shift, etc. of the corresponding data
  • Multiplication of continuous data which reduces the number of multiplier parts. For example, for multiplication operations of continuous data and discrete data, -1/2 is multiplied by 16. Traditional multiplier components will multiply -1/2 and 16 directly.
  • the function of the arithmetic unit can be replaced by a method of finding a switch such as a search index. For example, it can be specified that the discrete data representation method of -1/2 is 01.
  • the arithmetic unit 51 uses the operation corresponding to the discrete data 01.
  • the decimal representation is -8.
  • 16 is divided by -2.
  • 16 is continuous data and -2 is discrete data. If the discrete data-2 binary is specified as 10.
  • the arithmetic unit uses the division operation corresponding to the discrete data 10. The result is obtained by inverting the sign bit by subtracting 1000 bits from the 8-bit fixed-point number of 16 to 10001000 and obtaining 10001000 in decimal.
  • the addition and subtraction operations are similar to the above process. According to the binary of the discrete data as an index, the index is shifted to the left, right, or XOR. After this operation, the addition or subtraction operation with the real data represented by the discrete data is realized.
  • Figure 9 shows a continuous discrete conversion unit. Users can define the use of this module to convert continuous data to discrete data or not. Enter continuous data and output discrete data.
  • the unit includes a random number generation module, a judgment module, and an operation module.
  • the input continuous data is obtained by the arithmetic module, and the comparison module compares the random number with the calculated result to determine which interval the random number falls in, thereby determining the specific value of the output discrete data. For example, user definitions produce binary discrete data. For any continuous data x entered.
  • Discrete data 1 and 0 represent -1 and +1 of continuous data, respectively. Store the resulting discrete data back into memory. Waiting for the operation unit in the master-slave operation module to generate the corresponding operation.
  • the weight data and the output input data in the forward process may be represented by discrete data or not.
  • the multiplication operation of continuous data can be replaced by an exclusive OR, a non-displacement, a displacement, or the like based on discrete data.
  • the weight is represented by 1-bit discrete data, 0 represents +1, and 1 represents -1, and the multiplication of the weight is realized by XORing the sign bit of the data multiplied by the weight.
  • an instruction set for performing an artificial neural network forward operation on the aforementioned apparatus includes a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction, among which:
  • the CONFIG command configures various constants required for current layer calculation before each layer of artificial neural network calculation begins;
  • the COMPUTE instruction completes the arithmetic logic calculation of each layer of artificial neural network
  • the IO instruction implements the input data required for the calculation from the external address space and will be completed after the calculation is completed.
  • the data is stored back to the external space, which supports discretization representation;
  • the NOP instruction is responsible for clearing all the microinstructions in the microinstruction buffer queue of the current device, and ensuring that all the instructions before the NOP instruction are all completed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is responsible for the jump of the next instruction address that the controller will read from the instruction cache unit to implement the control flow jump;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • FIG. 5 illustrates an example block diagram of a neural network forward operation process in accordance with an embodiment of the present invention.
  • the input neuron vector is respectively subjected to a dot product operation with the weight vector of the slave operation module 6, to obtain a corresponding output neuron value, and all of the output neuron values constitute an intermediate result vector, and the intermediate result
  • the vector is obtained by adding the offset vector and the activation operation to obtain the final output neuron vector of the layer neural network.
  • the weight vector of each slave arithmetic module 6 is a column vector corresponding to the slave arithmetic module 6 in the weight matrix.
  • the interconnect module sends the input neuron vector [in0,...,inN] to all slave arithmetic units, temporarily stored in the neuron cache unit.
  • the dot product of its corresponding weight vector [w_i0, . . . , w_iN] and the input neuron vector is calculated.
  • the result output from the arithmetic unit is integrated into the complete output vector through the interconnect module and returned to the main operation unit, and the activation operation is performed in the main operation unit to obtain the final output neuron vector [out0, out1, out2, ..., outN ].
  • FIG. 6 is an implementation of an artificial neural network forward calculation that supports a single layer supporting discrete data representations, according to one embodiment.
  • the flowchart depicts an artificial neural network forward operation process for implementing a single layer discrete data representation as shown in FIG. 5 using the apparatus and instruction set of the present invention.
  • Step S1.1 storing the initial instruction into the instruction storage unit 1;
  • Step S1.2 reading an instruction from the instruction storage unit 1;
  • Step S1.3 decoding the above instruction
  • Step S1.4 performing corresponding operations according to the decoded control signal
  • step S1.5 the operation result is written back to the corresponding storage.
  • step S1.1 an initialization IO instruction may be stored for carrying subsequent instructions.
  • the readable instructions include, but are not limited to, a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
  • step S1.3 the control signal of the corresponding module is obtained according to the operation type of the instruction (CONFIG, COMPUTE, IO, NOP, JUMP, MOVE, etc.).
  • the decoding obtains the configuration information of the remaining modules.
  • the control signal of the master-slave operation module is obtained by decoding, and the corresponding operations taken by different discrete data are controlled.
  • the control signal of the data access module is decoded.
  • the NOP instruction the actual control signal is not generated, and only the control signal in all the control signal buffer queues of the current device is cleared, and all the instructions before the NOP instruction are all executed.
  • the JUMP instruction a control signal for the jump instruction stream is obtained.
  • the MOVE command a control signal for carrying data inside the device is obtained.
  • step S1.4 the above module 2-6 performs a corresponding operation in accordance with the control signal.
  • the interconnect module sends the input neuron vector [in0, ..., inN] to all the slave arithmetic modules, temporarily stored in the neuron cache unit.
  • the i-th slave arithmetic module calculate the dot product of its corresponding weight vector [w_i0,...,w_iN] and the input neuron vector.
  • the result output from the arithmetic module is integrated into the complete output vector through the interconnect module and returned to the main operation module, and the activation operation is performed in the main operation module to obtain the final output neuron vector [out0, out1, out2, ..., outN].
  • each module writes the result of the operation back to the corresponding cache.
  • the output neuron vector obtained by the main operation module is written back to the storage unit.
  • FIG. 7 is another, more detailed implementation of a single layer artificial neural network forward operation in accordance with one embodiment.
  • the flowchart depicts the process of implementing a single layer neural network forward operation as shown in FIG. 4 using the apparatus and instruction set of the present invention.
  • step S1 an IO instruction is pre-stored at the first address of the instruction cache unit 1.
  • step S2 the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction cache unit 1, and according to the translated microinstruction, the data access unit 3 reads all the corresponding artificial neural network operations from the external address space.
  • the instruction is cached in the instruction cache unit 1.
  • step S3 the controller unit 2 then reads in the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit 3 reads all data required by the main operation module 5 from the external address space (for example, including input). Neuron vectors, interpolation tables, constant tables, and offsets, etc.) to the neuron buffer unit 53 of the main computation module 5, the data supporting discrete representations, which may be all discrete or partially discrete.
  • the controller unit 2 then reads in the next IO instruction from the instruction cache unit, according to the translation
  • the microinstruction the data access unit 3 reads the weight matrix data required from the arithmetic module 6 from the external address space, the data supporting discrete representations, which may be all discrete or partially discrete.
  • step S5 the controller unit 2 then reads the next CONFIG command from the instruction cache unit, and according to the translated microinstruction, the device configures various constants required for the calculation of the layer neural network.
  • the arithmetic unit 51, 61 configures the value of the internal register of the unit according to the parameters in the microinstruction, and the parameters include, for example, the precision setting of the calculation of the layer, and the data of the activation function (for example, the precision bit of the calculation of the layer, the rang of the Lrn layer algorithm). Parameters, the inverse of the window size of the AveragePooling layer algorithm, etc.).
  • step S6 the controller unit 2 then reads the next COMPUTE instruction from the instruction cache unit.
  • the main operation module 5 first sends the input neuron vector to each slave operation module 6 through the interconnection module 4, and saves The neuron buffer unit 63 to the arithmetic module 6.
  • step S7 according to the microinstruction decoded by the COMPUTE instruction, the weight vector (the column vector corresponding to the slave operation module 6 in the weight matrix) is read from the weight buffer unit 64 from the operation unit 61 of the operation module 6.
  • the neuron cache unit reads the input neuron vector, completes the dot product operation of the weight vector and the input neuron vector, and returns the intermediate result through the interconnection.
  • the custom uses the exclusive OR equivalent operation instead of the dot product operation or not. use. For example, for a 1-bit discrete data representation, 0 represents +1 and 1 represents -1, and the multiplication of the weights is achieved by XORing the sign bits of the data multiplied by the weights. .
  • step S8 in the interconnection module 4, the intermediate results returned from each of the arithmetic modules 6 are successively assembled into a complete intermediate result vector.
  • step S9 the main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the neuron buffer unit 53 according to the micro-instruction decoded by the COMPUTE instruction, adds it to the vector returned by the interconnection module 4, and then The addition result is activated, and the device supports user-defined whether to discretize the result after activation. The last output neuron vector is written back to the neuron buffer unit 53.
  • step S10 the controller unit then reads the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit 3 stores the output neuron vector in the neuron buffer unit 53 to the external address space designation address. The operation ends.
  • the artificial neural network batch normalization operation steps are similar to the above process.
  • the controller completes the following process.
  • the controller controls the data access unit to read in the input data, and then controls the master-slave operation module to find the mean and variance of the respective positions according to the batch size or use the set mean variance.
  • the controller then controls the input data of the corresponding position minus the mean divided by variance.
  • the controller controls multiplying the processed data by the learning parameters and adding another learning parameter.
  • the implementation process is similar to that of a single-layer neural network.
  • the next-level operation instruction will output the output neuron address of the upper layer stored in the main operation unit. As the input neuron address of this layer. Similarly, the weight address and offset address in the instruction are also changed to the address corresponding to this layer.
  • the processes or methods depicted in the preceding figures may include hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software embodied on a non-transitory computer readable medium), or both
  • the combined processing logic is executed.
  • the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.
  • the representation of discrete data it should be understood which data discretization representations and which continuous representations can be selected. The spirit of whether the data is discretely represented throughout the entire process.

Abstract

本发明提供了一种支持离散数据表示的用于执行人工神经网络正向运算的装置,包括指令缓存单元、控制器单元、数据访问单元、互联模块、主运算模块、以及多个从运算模块、离散数据运算模块、连续离散转换模块。使用该装置可以实现支持离散数据表示的多层人工神经网络的正向运算。正向运算过程中的权值、神经元等数据可以采用离散形式表示。例如-1、-1/2、0、1/2、1等不是连续的数据。提供了支持离散数据运算的模块。根据离散数据的值采用不同位运算例如数据的异或、取非等代替连续数据的基本运算例如乘法、加法等。提供了将连续数据转换为离散数据的模块。提供了利用上述装置支持批归一化(batch normalization)计算。

Description

支持离散数据表示的人工神经网络正向运算装置和方法 技术领域
本发明总体上涉及人工神经网络,具体地涉及一种用于执行人工神经网络正向运算的装置和方法,本发明中的数据支持离散数据表示。并对离散数据,用按位运算例如异或、取非等代替了连续数据基本运算例如乘法等操作。
背景技术
多层人工神经网络被广泛应用于模式识别,图像处理,函数逼近和优化计算等领域,多层人工网络在近年来由于其较高的识别准确度和较好的可并行性,受到学术界和工业界越来越广泛的关注。
一种支持多层人工神经网络正向运算的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。另一种支持多层人工神经网络正向训练的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。
这两种装置在数据存储和运算上都是使用的连续数据。连续数据的存储需要较多的资源,例如一个32位的浮点数据,就需要32个比特位来存储该数据。在连续数据的运算上,所需要的加法器、乘法器等功能部件的实现也较为复杂。
离散数据表示指通过特定的数字来代替连续数据的存储方式。例如,可以通过00、01、10、11四个数字分别代表数据-1、-1/8、1/8、1四个数字。这种存储方式不同于连续存储方式。在连续存储方式中二进制数00/01/10/11就代表着十进制中0/1/2/3四个连续的数字。通过这种类似于索引的表示方式我们用形式上连续的数字代替了不连续离散化的真实数据。存储的数字不是连续的,所以叫离散数据表示。
目前的运算多层人工神经网络的运算装置在数据表示上已知方法是用浮点数或者定点数这样的连续数据表示。因为多层神经网络权值的精度较高和数量较大,连续数据的表示方式在运算和存储两方面带来更大的开销。而通过离散数据表示的方式,可以通过数据按位的异或、移位等运算代替连续数据的乘法等运算。从而大大减少乘法器部件的数量。并且用几个比特位的离散化数据,相比于传统 的32位的浮点数存储上的优势也是明显的。
发明内容
本发明的一个方面提供了一种支持离散数据表示的用于执行人工神经网络正向运算的装置,包括指令缓存单元、控制器单元、数据访问单元、互联模块、主运算模块、多个从运算模块,其中:
指令缓存单元用于通过数据访问单元读入指令并缓存读入的指令;
控制器单元用于从指令缓存单元读取指令,并将该指令译码成控制互联模块、主运算模块、以及从运算模块行为的微指令;
数据访问单元用于从外部地址空间向主运算模块和各从运算模块的相应数据缓存单元中写入离散数据或连续数据或从所述数据缓存单元向外部地址空间读离散数据或连续数据;
每层神经网络开始正向计算的阶段,主运算模块通过互联模块向所有的从运算模块传输本层的离散或连续的输入神经元向量,在从运算模块的计算过程完成后,互联模块逐级将各从运算模块的离散或连续化的输出神经元值拼成中间结果向量,其中,当输入数据是离散数据与连续数据的混合数据时,从运算模块针对不同离散数据采取预先设置的相应计算方式;
主运算模块用于利用中间结果向量完成后续计算,当输入数据是离散数据与连续数据的混合数据时,主运算模块针对不同离散数据采取预先设置的相应计算方式。
可选地,离散数据表示指用特定的离散数字代替真实的连续数据的表示方式。
可选地,多个从运算模块利用相同的离散或连续的输入神经元向量和各自不同的离散或连续的权值向量,并行地计算出各自的离散或连续输出神经元值。
可选地,主运算模块对中间结果向量执行以下任一项操作:
加偏置操作,在中间结果向量上加上偏置;
对中间结果向量进行激活,激活函数active是非线性函数sigmoid,tanh,relu,softmax中的任一个或线性函数;
采样操作,将中间结果向量与随机数比较,大于随机数则输出1,小于随机 数则输出0;或者
池化操作,包括最大值池化或平均值池化。
可选地,从运算模块包括输入神经元缓存单元,用于缓存离散或者连续的输入神经元向量。
可选地,互联模块构成主运算模块和所述多个从运算模块之间的连续或离散化数据的数据通路,可以实现成不同的互联拓扑。在一种实施方式中,具有H树型的结构,H树是由多个节点构成的二叉树通路,每个节点将上游的离散或连续的数据同样地发给下游的两个节点,将下游的两个节点返回的离散或连续的数据合并,并返回给上游的节点。
可选地,主运算模块包括运算单元、数据依赖关系判断单元和神经元缓存单元,其中:
神经元缓存单元用于缓存主运算模块在计算过程中用到的离散或连续表示的输入数据和输出数据;
运算单元完成主运算模块的各种运算功能,当输入数据是离散数据与连续数据的混合数据时,针对不同离散数据采取预先设置的相应计算方式;
数据依赖关系判断单元是运算单元读写神经元缓存单元的端口,保证对神经元缓存单元中连续数据或离散数据读写不存在一致性冲突,并且负责从神经元缓存单元读取输入离散或连续的神经元向量,并通过互联模块发送给从运算模块;以及
来自互联模块的中间结果向量被发送到运算单元。
可选地,每个从运算模块包括运算单元、数据依赖关系判定单元、神经元缓存单元和权值缓存单元,其中:
运算单元接收控制器单元发出的微指令并进行算数逻辑运算,当输入数据是离散数据与连续数据的混合数据时,针对不同离散数据采取预先设置的相应计算方式;
数据依赖关系判断单元负责计算过程中对支持离散数据表示的神经元缓存单元和支持离散数据表示的权值缓存单元的读写操作,保证对支持离散数据表示的神经元缓存单元和支持离散数据表示的权值缓存单元的读写不存在一致性冲突;
神经元缓存单元缓存输入神经元向量的数据以及该从运算模块计算得到的输出神经元值;以及
权值缓存单元缓存该从运算模块在计算过程中需要的离散或连续表示的权值向量。
可选地,数据依赖关系判断单元通过以下方式保证读写不存在一致性冲突:判断尚未执行的微指令与正在执行过程中的微指令的数据之间是否存在依赖关系,如果不存在,允许该条微指令立即发射,否则需要等到该条微指令所依赖的所有微指令全部执行完成后该条微指令才允许被发射。
可选地,主运算模块或从运算模块中的运算单元包括运算决定单元和混合数据运算单元,当输入数据是混合数据时,运算决定单元根据其中的离散数据决定应对该混合数据执行何种操作,然后,混合数据运算单元根据运算决定单元的决定结果,执行相应操作。
可选地,主运算模块或从运算模块中的所述运算单元还包括离散数据运算单元和连续数据运算单元中的至少一个,以及数据类型判断单元,当输入数据全是离散数据时,由离散数据运算单元根据输入的离散数据通过查表执行相应操作,当输入数据全是连续数据时,由连续数据运算单元执行相应操作。
可选地,该装置还包括连续离散转换单元,连续离散转换单元包括预处理模块、距离计算模块、和判断模块,假设使用M个离散数据,M=2m,m≥1,令这些离散数据分别对应于预定区间[-zone,zone]内的M个数值,其中:
预处理模块对于输入的连续数据x使用clip(-zone,zone)运算进行预处理,得到区间[-zone,zone]内的预处理数据y,其中,如果x≤-zone则y=-zone,如果x≥zone则y=zone,如果-zone<x<zone,则预处理数据y=x;
距离计算模块计算预处理数据y与上述各数值之间的距离;以及
判断模块基于该距离计算并输出离散数据。
可选地,预定区间[-zone,zone]是[-1,1]或[-2,2];并且/或者M个数值的绝对值是2的幂的倒数;并且/或者判断模块执行:输出与该预处理数据y距离最近的数值所对应的离散数据,如果有两个数值与该预处理数据距离相等,则输出二者中任一个所对应的离散数据;或者计算预处理数据y分别到距离最近的两个数值中任一个的归一化概率,将这两个数值中任一个所对应的归一化概率与随机数生成 模块生成的(0,1)之间的随机数z比较,如果该z小于该概率则输出该离散数据,否则输出另一离散数据。
本发明的另一个方面提供了一种使用上述装置执行单层人工神经网络正向运算的方法。通过提供的指令集,控制器控制读入运算所需的神经元、权值、常数量等数据。这些数据可以采用离散数据表示也可以不采用。之后主运算模块、从运算模块、互联模块完成神经元数据与权值数据相乘加偏置激活的过程。特别对于离散数据表示的数据,在进行乘法操作时,根据离散数据的数值,通过对相关数据的位操作代替了乘法运算。例如权值数据用1比特的离散数据表示,0代表+1,1代表-1,通过对与权值相乘数据的符号位异或,实现了对权值的乘法运算。
本发明的另一个方面提供了一种使用上述装置支持人工神经网络批归一化运算(Batch Normalization)的方法。通过提供的指令集,控制器控制数据访问单元读入输入的数据,之后控制主从运算模块根据batch大小求出各自位置的均值以及方差或使用设定好的均值方差。之后控制器控制对应位置的输入数据减去均值除以方差。最后控制器控制用处理后的数据与学习参数相乘后加上另一个学习参数。
本发明的另一方面提供了一种使用上述装置执行多层人工神经网络正向运算的方法。其实现过程与单层神经网络类似,当上一层人工神经网络执行完毕后,下一层的运算指令会将主运算单元中存储的上一层的输出神经元地址作为本层的输入神经元地址。同样地,指令中的权值地址和偏置地址也会变更至本层对应的地址。
附图说明
为了更完整地理解本发明及其优势,现在将参考结合附图的以下描述,其中:
图1示出了根据本发明实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置的整体结构的示例框图。
图2示意性示出了根据本发明实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置中H树模块(互联模块的一种实施方式)的结构。
图3示出了根据本发明实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置中主运算模块结构的示例框图。
图4示出了根据本发明实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置中从运算模块结构的示例框图。
图5示出了根据本发明实施例的神经网络正向运算过程的示例框图。
图6示出了根据本发明实施例的支持离散数据表示的神经网络反向训练过程的示例框图。
图7示出了根据本发明实施例的单层人工神经网络运算的流程图。
图8示出了根据本发明实施例的运算单元示例结构。
图9示出了根据本发明实施例的连续数据和离散数据转化的连续离散转化模块的示例结构。
在所有附图中,相同的装置、部件、单元等使用相同的附图标记来表示。
具体实施方式
根据结合附图对本发明示例性实施例的以下详细描述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。
在本发明中,术语“包括”和“含有”及其派生词意为包括而非限制;术语“或”是包含性的,意为和/或。
在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不背离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同参考数字用于相似功能和操作。
根据本发明实施例的支持离散数据表示的多层人工神经网络的正向运算,包括两层或者两层以上的多个神经元。对于每一层来说,输入神经元向量首先和权值向量进行点积运算,结果经过激活函数得到输出神经元。其中激活函数可以是sigmoid函数,tanh、relu、softmax函数等非线性函数,支持将激活后的输出神经元离散化表示或连续化表示。在另一种实施例中,激活函数也可以是线性函数。
对于离散数据表示的输入神经元向量或离散数据表示的权值向量的点积运 算,本装置支持将点积运算转换为数据的移位、取非、异或等位运算。对于数据的表示方式,本装置支持数据离散表示或非离散表示,用户可以自定义哪一个层的哪些数据采用离散表示形式或非离散表示,并且可以根据具体需要自定义离散数据的位数,从而代替表示的真实数据的个数,例如设定为1比特、2比特、3比特等位数的离散数据,分别可以表示2个、4个、8个真实数据。
图1示出了根据本发明实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置的整体结构的示例框图。如图1所示,该装置包括指令缓存单元1、控制器单元2、数据访问单元3、互联模块4、主运算模块5和多个从运算模块6,可选地还包括连续离散转换模块7。指令缓存单元1、控制器单元2、数据访问单元3、互联模块4、主运算模块5和从运算模块6、连续离散转换模块7均可以通过硬件电路(例如包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等)实现。特别的,本装置可以对离散数据提供存储和运算支持。
指令缓存单元1通过数据访问单元3读入指令并缓存读入的指令。
控制器单元2从指令缓存单元1中读取指令,将指令译成控制其他模块行为的微指令,所述其他模块例如数据访问单元3、主运算模块5和从运算模块6等。
数据访问单元3能够访存外部地址空间,直接向装置内部的各个缓存单元读写数据,完成数据的加载和存储。该数据是离散表示的或非离散表示的。该单元用来设计可以读取离散表示的数据。
互联模块4用于连接主运算模块和从运算模块,可以实现成不同的互连拓扑(如树状结构、环状结构、网格状结构、分级互连、总线结构等)
图2示意性示出了互联模块4的一种实施方式:H树模块。H树模块4构成主运算模块5和多个从运算模块6之间的数据通路,并具有H树的结构。H树是由多个节点构成的二叉树通路,每个节点将上游的数据同样地发给下游的两个节点,将下游的两个节点返回的数据进行合并,并返回给上游的节点。例如,在每层人工神经网络开始计算阶段,主运算模块5内的神经元数据该数据可以是离散表示或非离散表示的通过H树模块4发送给各个从运算模块6;当从运算模块6的计算过程完成后,每个从运算模块输出的神经元的值会在H树中逐级拼成一个完整的由神经元组成的向量,作为中间结果向量。针对于离散数据表示的运算, 我们特别提到了在主从运算模块内部的专用于离散数据运算的运算模块见图7。以神经网络全连接层进行说明,假设装置中共有N个从运算模块,则中间结果向量按N分段,每段有N个元素,第i个从运算模块计算每段中的第i个元素。N个元素经过H树模块拼成长度为N的向量并返回给主运算模块。所以如果网络只有N个输出神经元,则每个从运算单元只需输出单个神经元的值,若网络有m*N个输出神经元,则每个从运算单元需输出m个神经元值。H树模块在存储和传输数据的过程中均支持离散数据表示。
图3示出了根据本发明实施例的用于执行人工神经网络正向运算的装置中主运算模块5的结构的示例框图。如图3所示,主运算模块5包括运算单元51、数据依赖关系判断单元52和支持离散数据表示的神经元缓存单元53。
支持离散数据表示的神经元缓存单元53用于缓存主运算模块5在计算过程中用到的输入数据和输出数据。
运算单元51完成主运算模块5的各种运算功能。对于运算因子全是离散数据的情况,可以通过查表实现离散数据与离散数据的加减乘除运算。例如2位的离散数据,可以表示4个连续数据值。对于4个连续数据共有4*4=16种组合。对于每种加减乘除运算的操作,可以制作并维护该4*4的索引表,通过索引表找到对应的计算值。4种运算共需要4张4*4的索引表。
对于运算因子包含离散数据和连续数据的情况,可以针对不同离散数据,为加、减、乘、除运算预先设定相应的位操作。例如,可以采取按位异或后乘2的相应位次幂之后累加求和的方式代替离散数据与连续数据的点积运算。例如,对于乘法操作,乘法因子数据如果存在离散表示的,可以通过离散数据索引相应的操作(例如,对相应数据的按位异或、取非、移位等操作)代替和该离散数据表示的连续数据的乘法操作,从而减少了乘法器部件数量。例如对于连续数据与离散数据的乘法操作,-1/2乘以16。传统的乘法器部件会将-1/2与16直接做乘法。在运算单元51中,由于离散数据的可能性较少,可以通过查找索引这样一种开关判断的方法代替了运算单元的功能。例如,可以规定-1/2的离散数据表示方法为01。如果一个运算因子是-1/2,则运算单元51接收到的离散数据为01。运算单元51便采用离散数据01对应的操作。通过对于16的8位定点数表示00010000符号位取反,向右移1位得到10001000,十进制表示为-8。对于除法操作,16 除以-2。其中16是连续数据,-2是离散数据。如果规定离散数据-2二进制表示为10。运算单元便采用离散数据10对应的除法操作。通过对16的8位定点数表示0001000右移1位之后符号位取反得到10001000,十进制表示为-8得到结果。加法和减法操作与上述过程类似。根据离散数据的二进制作为一个索引,索引到按位左移、右移、异或等操作。经过该操作后实现了与离散数据表示的真实数据的相加或者相减操作。
依赖关系判断单元52是运算单元51读写神经元缓存单元53的端口,同时能够保证神经元缓存单元中数据的读写一致性。同时,数据依赖关系判断单元52也负责将读取数据通过互联模块4发送给从运算模块,而从运算模块6的输出数据通过互联模块4直接发送给运算单元51。控制器单元2输出的指令发送给计算单元51和数据依赖关系判断单元52,来控制其行为。
图4示出了根据本发明实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置中从运算模块6的结构的示例框图。如图4所示,每个从运算模块6包括运算单元61、数据依赖关系判定单元62、支持离散数据表示的神经元缓存单元63和支持离散数据表示的权值缓存单元64。
运算单元61接收控制器单元2发出的微指令并进行算数逻辑运算。对于运算因子全是离散数据的情况,可以通过查表实现离散数据与离散数据的加减乘除运算。例如2位的离散数据,可以表示4个连续数据值。对于4个连续数据共有4*4=16种组合。对于每种加减乘除运算的操作,可以制作并维护该4*4的索引表,通过索引表找到对应的计算值。4种运算共需要4张4*4的索引表。
对于运算因子包含离散数据和连续数据的情况,可以针对不同离散数据,为加、减、乘、除运算预先设定相应的位操作。例如,可以采取按位异或后乘2的相应位次幂之后累加求和的方式代替离散数据与连续数据的点积运算。例如,对于乘法操作,乘法因子数据如果存在离散表示的,可以通过离散数据索引相应的操作(例如,对相应数据的按位异或、取非、移位等操作)代替和该离散数据表示的连续数据的乘法操作,从而减少了乘法器部件数量。例如对于连续数据与离散数据的乘法操作,-1/2乘以16。传统的乘法器部件会将-1/2与16直接做乘法。在运算单元51中,由于离散数据的可能性较少,可以通过查找索引这样一种开关判断的方法代替了运算单元的功能。例如,可以规定-1/2的离散数据表示方法 为01。如果一个运算因子是-1/2,则运算单元51接收到的离散数据为01。运算单元51便采用离散数据01对应的操作。通过对于16的8位定点数表示00010000符号位取反,向右移1位得到10001000,十进制表示为-8。对于除法操作,16除以-2。其中16是连续数据,-2是离散数据。如果规定离散数据-2二进制表示为10。运算单元便采用离散数据10对应的除法操作。通过对16的8位定点数表示0001000右移1位之后符号位取反得到10001000,十进制表示为-8得到结果。加法和减法操作与上述过程类似。根据离散数据的二进制作为一个索引,索引到按位左移、右移、异或等操作。经过该操作后实现了与离散数据表示的真实数据的相加或者相减操作。
数据依赖关系判断单元62负责计算过程中对神经元缓存单元的读写操作。数据依赖关系判断单元62执行读写操作之前会首先保证指令之间所用的数据不存在读写一致性冲突。例如,所有发往数据依赖关系单元62的微指令都会被存入数据依赖关系单元62内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。
支持离散数据表示的神经元缓存单元63缓存该从运算模块6的输入神经元向量数据和输出神经元值数据。该数据可以以离散数据的形式存储和传输。
支持离散数据表示的权值缓存单元64缓存该从运算模块6在计算过程中需要的权值数据。该数据根据用户定义可以是离散表示的或不是。对于每一个从运算模块6,都只会存储全部输入神经元与部分输出神经元之间的权值。以全连接层为例,输出神经元按照从运算单元的个数N进行分段,每段的第n个输出神经元对应的权值存放在第n个从运算单元中。
从运算模块6实现每层人工神经网络正向运算过程中可以并行的前半部分。该模块中的数据存储以及运算都支持离散数据表示。以人工神经网络全连接层(MLP)为例,过程为y=f(wx+b),其中权值矩阵w和输入神经元向量x的乘法可以划分为不相关的并行计算子任务,out与in是列向量,每个从运算模块6只计算in中相应的部分标量元素与权值矩阵w对应的列的乘积,得到的每个输出向量都是最终结果的一个待累加的部分和,这些部分和在互联模块4中逐级两两相加得到最后的结果。这个结果可以是离散数据表示的。所以计算过程变成了 并行的计算部分和的过程和后面的累加的过程。每个从运算模块6计算出输出神经元值,所有的输出神经元值在互联模块4中拼成得到中间结果向量。每个从运算模块6只需要计算出中间结果向量y中与本模块对应的输出神经元值即可。互联模块4对所有从运算模块6输出的神经元值求和,得到最终的中间结果向量y。主运算模块5基于中间结果向量y进行后续计算,比如加偏置、池化(例如最大值池化(MAXPOOLING)或平均值池化(AVGPOOLING)等)、做激活和做采样等。
图8示出了运算单元的结构框图,其可用于主运算模块中的运算单元51或从运算模块中的运算单元61。运算过程中输入数据可以是离散数据或连续数据。数据类型判断单元71判断输入数据全是连续数据、全是离散数据或是既包含连续数据又包含离散数据的混合数据。当输入数据全是连续数据时,连续数据运算单元72执行相应运算。
当输入数据全是离散数据时,离散数据运算单元73执行相应运算。对于运算因子全是离散数据的情况,可以通过查表实现离散数据与离散数据的加减乘除运算。例如2位的离散数据,可以表示4个连续数据值。对于4个连续数据共有4*4=16种组合。对于每种加减乘除运算的操作,我们制作并维护该4*4的索引表,通过索引表找到对应的计算值。4种运算共需要4张4*4的索引表。
当输入数据是混合数据时,运算决定单元74根据其中的离散数据决定应对其执行何种操作。可以针对不同的离散数据分别预先设置相应操作。然后,混合数据运算单元75根据运算决定单元74的决定结果,执行相应操作。对于运算因子包含离散数据和连续数据的情况,可以针对不同离散数据,为加、减、乘、除运算预先设定相应的位操作。例如,可以采取按位异或后乘2的相应位次幂之后累加求和的方式代替离散数据与连续数据的点积运算。例如,对于乘法操作,乘法因子数据如果存在离散表示的,可以通过离散数据索引相应的操作(例如,对相应数据的按位异或、取非、移位等操作)代替和该离散数据表示的连续数据的乘法操作,从而减少了乘法器部件数量。例如对于连续数据与离散数据的乘法操作,-1/2乘以16。传统的乘法器部件会将-1/2与16直接做乘法。在运算单元51中,由于离散数据的可能性较少,可以通过查找索引这样一种开关判断的方法代替了运算单元的功能。例如,可以规定-1/2的离散数据表示方法为01。如果一个 运算因子是-1/2,则运算单元51接收到的离散数据为01。运算单元51便采用离散数据01对应的操作。通过对于16的8位定点数表示00010000符号位取反,向右移1位得到10001000,十进制表示为-8。对于除法操作,16除以-2。其中16是连续数据,-2是离散数据。如果规定离散数据-2二进制表示为10。运算单元便采用离散数据10对应的除法操作。通过对16的8位定点数表示0001000右移1位之后符号位取反得到10001000,十进制表示为-8得到结果。加法和减法操作与上述过程类似。根据离散数据的二进制作为一个索引,索引到按位左移、右移、异或等操作。经过该操作后实现了与离散数据表示的真实数据的相加或者相减操作。
图9示出了连续离散转换单元。用户可以定义采用该模块将连续数据转换为离散数据或不采用。输入连续数据,输出离散数据。该单元包括随机数产生模块、判断模块、运算模块。对于输入的连续数据通过运算模块得到运算后的结果,经由判断模块用随机数与运算后的结果比较,判断随机数落在哪一个区间,从而决定出输出的离散数据的具体值。例如用户定义产生二元离散数据。对于输入的任意连续数据x。经由运算模块计算出结果y,y=abs(clip(-1,1))。之后通过判断模块,如果随机数大于y,则输出的离散数据是1,反之输出的离散数据是0。离散数据1和0分别代表了连续数据的-1和+1。将得到的离散数据存储回内存中。等待主从运算模块中的运算单元使用,产生相应的操作。
正向过程中的权值数据、输出输入数据可以采用离散数据表示或不采用。对于连续数据的乘法操作,可以通过基于离散数据的异或、取非、位移等方式代替连续数据的乘法操作。例如权值用1比特离散数据表示,0代表+1,1代表-1,通过对与权值相乘数据的符号位异或,实现了对权值的乘法运算。
根据本发明实施例,还提供了在前述装置上执行人工神经网络正向运算的指令集。指令集中包括CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令等,其中:
CONFIG指令在每层人工神经网络计算开始前配置当前层计算需要的各种常数;
COMPUTE指令完成每层人工神经网络的算术逻辑计算;
IO指令实现从外部地址空间读入计算需要的输入数据以及在计算完成后将 数据存回至外部空间,该数据支持离散化表示;
NOP指令负责清空当前装置内部所有微指令缓存队列中的微指令,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何操作;
JUMP指令负责控制器将要从指令缓存单元读取的下一条指令地址的跳转,用来实现控制流的跳转;
MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
图5示出了根据本发明实施例的神经网络正向运算过程的示例框图。在不同从运算模块6中,输入神经元向量分别与该从运算模块6的权值向量进行点积运算,得到对应的输出神经元值,所有这些输出神经元值组成中间结果向量,该中间结果向量经过加偏置向量以及激活运算得到该层神经网络的最终输出神经元向量,公式描述为out=f(w*in+b),其中out输出神经元向量、in是输入神经元向量、b是偏置向量,w是权值矩阵,f是激活函数。每个从运算模块6的权值向量是权值矩阵中与该从运算模块6相对应的列向量。互联模块将输入神经元向量[in0,...,inN]发送给所有的从运算单元,暂存在神经元缓存单元中。对于第i个从运算单元,计算其相应的权值向量[w_i0,...,w_iN]与输入神经元向量的点积。从运算单元输出的结果经过互联模块拼成完整的输出向量并返回给主运算单元,在主运算单元中进行激活运算,得到最后的输出神经元向量[out0,out1,out2,...,outN]。
图6是示出根据一个实施例的单层支持离散数据表示的人工神经网络正向计算的一种实施方法。该流程图描述利用本发明的装置和指令集实现图5所示的一种单层离散数据表示的人工神经网络正向运算过程。
步骤S1.1,将初始指令存放到指令存储单元1中;
步骤S1.2,从指令存储单元1中读取一条指令;
步骤S1.3,对上述指令进行译码;
步骤S1.4,根据译码得到的控制信号,进行相应操作;
步骤S1.5,将操作结果写回到相应存储中。
在步骤S1.1中,可以存入初始化IO指令,用于搬运后续指令。
在步骤S1.2中,可读取的指令包括但不限于CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令等。
在步骤S1.3中,根据指令的操作类型(CONFIG,COMPUTE,IO,NOP,JUMP,MOVE等)译码得到相应模块的控制信号。对于CONFIG指令,译码得到配置其余模块的配置信息。对于COMPUTE指令,译码得到主从运算模块的控制信号,控制不同离散数据采取的对应操作。对于IO指令,译码得到数据访问模块的控制信号。对于NOP指令,不产生实际控制信号,只用于清空当前装置内部所有控制信号缓存队列中的控制信号,保证NOP指令之前的所有指令全部执行完毕。对于JUMP指令,得到跳转指令流的控制信号。对于MOVE指令,得到在装置内部搬运数据的控制信号。
在步骤S1.4中,上述模块2-6根据控制信号执行相应操作。以执行支持离散数据表示的神经网络正向的COMPUTE指令为例,互连模块将输入神经元向量[in0,...,inN]发送给所有的从运算模块,暂存在神经元缓存单元中。对于第i个从运算模块,计算其相应的权值向量[w_i0,...,w_iN]与输入神经元向量的点积。从运算模块输出的结果经过互连模块拼成完整的输出向量并返回给主运算模块,在主运算模块中进行激活运算,得到最后的输出神经元向量[out0,out1,out2,...,outN]。
在步骤S1.5中,各个模块将操作结果写回到相应缓存中。以执行离散数据表示的神经网络正向的运算为例,主运算模块得到的输出神经元向量被写回到存储单元。
图7是示出根据一个实施例的单层人工神经网络正向运算的另一种更详细的实施方法。该流程图描述利用本发明的装置和指令集实现图4所示的一种单层神经网络正向运算的过程。
在步骤S1,在指令缓存单元1的首地址处预先存入一条IO指令。
在步骤S2,运算开始,控制器单元2从指令缓存单元1的首地址读取该条IO指令,根据译出的微指令,数据访问单元3从外部地址空间读取相应的所有人工神经网络运算指令,并将其缓存在指令缓存单元1中。
在步骤S3,控制器单元2接着从指令缓存单元读入下一条IO指令,根据译出的微指令,数据访问单元3从外部地址空间读取主运算模块5需要的所有数据(例如,包括输入神经元向量、插值表、常数表和偏置等)至主运算模块5的神经元缓存单元53,该数据支持离散表示,可以是全部离散或部分离散。
在步骤S4,控制器单元2接着从指令缓存单元读入下一条IO指令,根据译出 的微指令,数据访问单元3从外部地址空间读取从运算模块6需要的权值矩阵数据,该数据支持离散表示,可以是全部离散或部分离散。
在步骤S5,控制器单元2接着从指令缓存单元读入下一条CONFIG指令,根据译出的微指令,装置配置该层神经网络计算需要的各种常数。例如,运算单元51、61根据微指令里的参数配置单元内部寄存器的值,所述参数例如包括本层计算的精度设置、激活函数的数据(例如本层计算的精度位,Lrn层算法的rang参数,AveragePooling层算法窗口大小的倒数等)。
在步骤S6,控制器单元2接着从指令缓存单元读入下一条COMPUTE指令,根据译出的微指令,主运算模块5首先通过互联模块4将输入神经元向量发给各从运算模块6,保存至从运算模块6的神经元缓存单元63。
在步骤S7,根据COMPUTE指令译出的微指令,从运算模块6的运算单元61从权值缓存单元64读取权值向量(权值矩阵中对应于该从运算模块6的列向量),从神经元缓存单元读取输入神经元向量,完成权值向量和输入神经元向量的点积运算,将中间结果通过互联返回,对于离散数据,自定义采用异或等位运算代替点积运算或不采用。例如对于1比特的离散数据表示,0代表+1,1代表-1,通过对与权值相乘数据的符号位异或,实现了对权值的乘法运算。。
在步骤S8,在互联模块4中,各从运算模块6返回的中间结果被逐级拼成完整的中间结果向量。
在步骤S9,主运算模块5得到互联模块4的返回值,根据COMPUTE指令译出的微指令,从神经元缓存单元53读取偏置向量,与互联模块4返回的向量相加,然后再对相加结果做激活,该装置支持用户自定义是否将激活后的结果离散化表示。并将最后的输出神经元向量写回至神经元缓存单元53。
在步骤S10,控制器单元接着从指令缓存单元读入下一条IO指令,根据译出的微指令,数据访问单元3将神经元缓存单元53中的输出神经元向量存至外部地址空间指定地址,运算结束。
对于人工神经网络批归一化运算(Batch Normalization)运算步骤与上述过程相仿。通过提供的指令集,控制器完成以下过程。控制器控制数据访问单元读入输入的数据,之后控制主从运算模块根据batch大小求出各自位置的均值以及方差或使用设定好的均值方差。之后控制器控制对应位置的输入数据减去均值除以 方差。最后控制器控制用处理后的数据与学习参数相乘后加上另一个学习参数。
对于多层人工神经网络,其实现过程与单层神经网络类似,当上一层人工神经网络执行完毕后,下一层的运算指令会将主运算单元中存储的上一层的输出神经元地址作为本层的输入神经元地址。同样地,指令中的权值地址和偏置地址也会变更至本层对应的地址。
通过采用用于执行人工神经网络正向运算的装置和指令集,解决了CPU和GPU运算性能不足,前端译码开销大的问题。有效提高了对多层人工神经网络正向运算的支持。
通过采用针对多层人工神经网络正向运算的专用片上缓存,充分挖掘了输入神经元和权值数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络正向运算性能瓶颈的问题。
通过采用离散数据表示的方法,相较于浮点数、定点数等表示方法,大大较少了装置的存储能耗等开销。可以再有限的面积上优化结构布局,提高运算速度或性能能耗比等指标。
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被具体化在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。并且对于离散数据的表示问题,应该理解可以选择哪些数据离散化表示、哪些连续表示。数据是否离散表示的精神贯穿于整个运算过程中。
在前述的说明书中,参考其特定示例性实施例描述了本发明的各实施例。显然,可对各实施例做出各种修改,而不背离所附权利要求所述的本发明的更广泛的精神和范围。相应地,说明书和附图应当被认为是说明性的,而不是限制性的。

Claims (16)

  1. 一种支持离散数据表示的用于执行人工神经网络正向运算的装置,包括指令缓存单元、控制器单元、数据访问单元、互联模块、主运算模块、多个从运算模块,其中:
    指令缓存单元用于通过数据访问单元读入指令并缓存读入的指令;
    控制器单元用于从指令缓存单元读取指令,并将该指令译码成控制互联模块、主运算模块、以及从运算模块行为的微指令;
    数据访问单元用于从外部地址空间向主运算模块和各从运算模块的相应数据缓存单元中写入离散数据或连续数据或从所述数据缓存单元向外部地址空间读离散数据或连续数据;
    每层神经网络开始正向计算的阶段,主运算模块通过互联模块向所有的从运算模块传输本层的离散或连续的输入神经元向量,在从运算模块的计算过程完成后,互联模块逐级将各从运算模块的离散或连续化的输出神经元值拼成中间结果向量,其中,当输入数据是离散数据与连续数据的混合数据时,从运算模块针对不同离散数据采取预先设置的相应计算方式;
    主运算模块用于利用中间结果向量完成后续计算,当输入数据是离散数据与连续数据的混合数据时,主运算模块针对不同离散数据采取预先设置的相应计算方式。
  2. 根据权利要求1所述的装置,其中,离散数据表示指用特定的离散数字代替真实的连续数据的表示方式。
  3. 根据权利要求1所述的装置,其中,多个从运算模块利用相同的离散或连续的输入神经元向量和各自不同的离散或连续的权值向量,并行地计算出各自的离散或连续输出神经元值。
  4. 根据权利要求1所述的装置,其中,主运算模块对中间结果向量执行以下任一项操作:
    加偏置操作,在中间结果向量上加上偏置;
    对中间结果向量进行激活,激活函数active是非线性函数sigmoid,tanh,relu,softmax中的任一个或线性函数;
    采样操作,将中间结果向量与随机数比较,大于随机数则输出1,小于随机数则输出0;或者
    池化操作,包括最大值池化或平均值池化。
  5. 根据权利要求1所述的装置,其中,从运算模块包括输入神经元缓存单元,用于缓存离散或者连续的输入神经元向量。
  6. 根据权利要求1所述的装置,其中,互联模块构成主运算模块和所述多个从运算模块之间的连续或离散化数据的数据通路,互
  7. 根据权利要求1所述的装置,其中,主运算模块包括运算单元、数据依赖关系判断单元和神经元缓存单元,其中:
    神经元缓存单元用于缓存主运算模块在计算过程中用到的离散或连续表示的输入数据和输出数据;
    运算单元完成主运算模块的各种运算功能,当输入数据是离散数据与连续数据的混合数据时,针对不同离散数据采取预先设置的相应计算方式;
    数据依赖关系判断单元是运算单元读写神经元缓存单元的端口,保证对神经元缓存单元中连续数据或离散数据读写不存在一致性冲突,并且负责从神经元缓存单元读取输入离散或连续的神经元向量,并通过互联模块发送给从运算模块;以及
    来自互联模块的中间结果向量被发送到运算单元。
  8. 根据权利要求1所述的装置,其中,每个从运算模块包括运算单元、数据依赖关系判定单元、神经元缓存单元和权值缓存单元,其中:
    运算单元接收控制器单元发出的微指令并进行算数逻辑运算,当输入数据是离散数据与连续数据的混合数据时,针对不同离散数据采取预先设置的相应计算方式;
    数据依赖关系判断单元负责计算过程中对支持离散数据表示的神经元缓存单元和支持离散数据表示的权值缓存单元的读写操作,保证对支持离散数据表示的神经元缓存单元和支持离散数据表示的权值缓存单元的读写不存在一致性冲突;
    神经元缓存单元缓存输入神经元向量的数据以及该从运算模块计算得到的输出神经元值;以及
    权值缓存单元缓存该从运算模块在计算过程中需要的离散或连续表示的权值向量。
  9. 根据权利要求7或8所述的装置,其中,数据依赖关系判断单元通过以下方式保证读写不存在一致性冲突:判断尚未执行的微指令与正在执行过程中的微指令的数据之间是否存在依赖关系,如果不存在,允许该条微指令立即发射,否则需要等到该条微指令所依赖的所有微指令全部执行完成后该条微指令才允许被发射。
  10. 根据权利要求7或8所述的装置,其中主运算模块或从运算模块中的运算单元包括运算决定单元和混合数据运算单元,当输入数据是混合数据时,运算决定单元根据其中的离散数据决定应对该混合数据执行何种操作,然后,混合数据运算单元根据运算决定单元的决定结果,执行相应操作。
  11. 根据权利要求9所述的装置,其中主运算模块或从运算模块中的所述运算单元还包括离散数据运算单元和连续数据运算单元中的至少一个,以及数据类型判断单元,当输入数据全是离散数据时,由离散数据运算单元根据输入的离散数据通过查表执行相应操作,当输入数据全是连续数据时,由连续数据运算单元执行相应操作。
  12. 根据权利要求1所述的装置,还包括连续离散转换单元,连续离散转换单元包括预处理模块、距离计算模块、和判断模块,假设使用M个离散数据,M=2m,m≥1,令这些离散数据分别对应于预定区间[-zone,zone]内的M个数值,其中:
    预处理模块对于输入的连续数据x使用clip(-zone,zone)运算进行预处理,得到区间[-zone,zone]内的预处理数据y,其中,如果x≤-zone则y=-zone,如果x≥zone则y=zone,如果-zone<x<zone,则预处理数据y=x;
    距离计算模块计算预处理数据y与上述各数值之间的距离;以及
    判断模块基于该距离计算并输出离散数据。
  13. 根据权利要求12所述的装置,其特征在于以下任意一项或多项:
    预定区间[-zone,zone]是[-1,1]或[-2,2];
    M个数值的绝对值是2的幂的倒数;或者
    判断模块执行:
    输出与该预处理数据y距离最近的数值所对应的离散数据,如果有两个数值与该预处理数据距离相等,则输出二者中任一个所对应的离散数据;或者
    计算预处理数据y分别到距离最近的两个数值中任一个的归一化概率,将这两个数值中任一个所对应的归一化概率与随机数生成模块生成的(0,1)之间的随机数z比较,如果该z小于该概率则输出该离散数据,否则输出另一离散数据。
  14. 一种使用根据权利要求1-13中的任一项的装置执行单层人工神经网络正向运算的方法,包括:
    数据访问单元从外部地址空间读取与该层人工神经网络正向运算有关的所有人工神经网络运算指令,并将其缓存在指令缓存单元中;
    连续离散转换模块从外部地址空间读取该层神经网络需要转换的连续数据转换为离散数据后存储回外部地址空间;
    数据访问单元从外部地址空间读取主运算模块需要的与该层人工神经网络正向运算有关的所有离散或连续数据至主运算模块的神经元缓存单元;
    数据访问单元从外部地址空间读取从运算模块需要的离散表示或连续表示的权值矩阵数据;
    配置该层神经网络正向运算需要的各种离散或连续表示的常数;
    主运算模块首先通过互联模块将输入神经元向量发给各从运算模块,保存至从运算模块的支持离散数据表示的神经元缓存单元;
    从运算模块的运算单元从权值缓存单元读取权值向量,从从运算模块的神经元缓存单元读取输入神经元向量,对于向量中没有离散数据表示的完成权值向量和输入神经元向量的点积运算,对于向量中有离散数据表示的,通过离散数据运算模块,根据离散数据的值判断相应的位操作代替点积运算,将得到的神经元值通过互联模块返回;
    在互联模块中,各从运算模块返回的神经元值被逐级拼成完整的中间结果向量;
    主运算模块从主运算模块的神经元缓存单元读取离散表示或连续表示的偏置向量,与互联模块返回的中间结果向量相加,然后再对相加结果做激活,得到输 出神经元向量写回至主运算模块的神经元缓存单元;以及
    数据访问单元将主运算模块的神经元缓存单元中的输出神经元向量存至外部地址空间指定地址。
  15. 一种使用根据权利要求1-13中的任一项的装置执行批归一化运算的方法:
    数据访问单元从外部地址空间读取与该批归一化正向运算有关的所有人工神经网络运算指令,并将其缓存在指令缓存单元中;
    连续离散转换模块从外部地址空间读取该层神经网络需要转换的连续数据转换为离散数据后存储回外部地址空间。
    数据访问单元从外部地址空间读取主运算模块需要的与该层批归一化正向运算有关的所有离散或连续数据至主运算模块的神经元缓存单元;
    配置该层批归一化正向运算需要的各种离散或连续表示的常数;
    主运算模块首先通过互联模块将输入神经元向量发给各从运算模块,保存至从运算模块的支持离散数据表示的神经元缓存单元;
    从运算模块的运算单元从权值缓存单元读取权值向量,从从运算模块的神经元缓存单元读取输入神经元向量,对于输入向量计算在每一个批的尺度下的均值和标准差,将得到的神经元值通过互联模块返回;
    在互联模块中,各从运算模块返回的神经元值被逐级拼成完整的中间结果向量;
    主运算模块从主运算模块的神经元缓存单元读取离散表示或连续表示输入神经元向量,与互联模块返回的均值结果向量相减,然后再对减结果和标准差结果相除,得到输出神经元向量写回至主运算模块的神经元缓存单元;
    数据访问单元将主运算模块的神经元缓存单元中的输出神经元向量存至外部地址空间指定地址。
  16. 一种执行多层人工神经网络正向运算的方法,包括:
    针对每一层,执行根据权利要求14、15所述的方法,其中:
    当针对上一层人工神经网络执行完毕后,将主运算模块中存储的上一层的输出神经元地址作为本层的输入神经元地址,针对所述本层再次执行根据权利要求14或15所述的方法。
PCT/CN2016/079431 2016-04-15 2016-04-15 支持离散数据表示的人工神经网络正向运算装置和方法 WO2017177442A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US16/093,956 US20190138922A1 (en) 2016-04-15 2016-04-15 Apparatus and methods for forward propagation in neural networks supporting discrete data
PCT/CN2016/079431 WO2017177442A1 (zh) 2016-04-15 2016-04-15 支持离散数据表示的人工神经网络正向运算装置和方法
EP16898260.1A EP3444757B1 (en) 2016-04-15 2016-04-15 Discrete data representation supported device and method for forward operation of artificial neural network
US16/182,420 US20190073584A1 (en) 2016-04-15 2018-11-06 Apparatus and methods for forward propagation in neural networks supporting discrete data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/079431 WO2017177442A1 (zh) 2016-04-15 2016-04-15 支持离散数据表示的人工神经网络正向运算装置和方法

Related Child Applications (3)

Application Number Title Priority Date Filing Date
US16/093,956 A-371-Of-International US20190138922A1 (en) 2016-04-15 2016-04-15 Apparatus and methods for forward propagation in neural networks supporting discrete data
US16/093,956 Continuation-In-Part US20190138922A1 (en) 2016-04-15 2016-04-15 Apparatus and methods for forward propagation in neural networks supporting discrete data
US16/182,420 Continuation-In-Part US20190073584A1 (en) 2016-04-15 2018-11-06 Apparatus and methods for forward propagation in neural networks supporting discrete data

Publications (1)

Publication Number Publication Date
WO2017177442A1 true WO2017177442A1 (zh) 2017-10-19

Family

ID=60042342

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/079431 WO2017177442A1 (zh) 2016-04-15 2016-04-15 支持离散数据表示的人工神经网络正向运算装置和方法

Country Status (3)

Country Link
US (1) US20190138922A1 (zh)
EP (1) EP3444757B1 (zh)
WO (1) WO2017177442A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270269A (zh) * 2020-10-30 2021-01-26 湖南快乐阳光互动娱乐传媒有限公司 一种人脸图像质量的评估方法及装置

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609642B (zh) * 2016-01-20 2021-08-31 中科寒武纪科技股份有限公司 计算装置和方法
KR102224510B1 (ko) * 2016-12-09 2021-03-05 베이징 호라이즌 인포메이션 테크놀로지 컴퍼니 리미티드 데이터 관리를 위한 시스템들 및 방법들
CN109522254B (zh) * 2017-10-30 2022-04-12 上海寒武纪信息科技有限公司 运算装置及方法
US11741362B2 (en) * 2018-05-08 2023-08-29 Microsoft Technology Licensing, Llc Training neural networks using mixed precision computations
US20220171829A1 (en) * 2019-03-11 2022-06-02 Untether Ai Corporation Computational memory
US20210295134A1 (en) * 2020-03-18 2021-09-23 Infineon Technologies Ag Artificial neural network activation function
CN113626080B (zh) * 2020-05-08 2023-10-03 安徽寒武纪信息科技有限公司 数据处理装置以及相关产品
CN113919477A (zh) * 2020-07-08 2022-01-11 嘉楠明芯(北京)科技有限公司 一种卷积神经网络的加速方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963983A (zh) * 2010-09-28 2011-02-02 江苏瑞蚨通软件科技有限公司(中外合资) 一种粗集优化神经网络的数据挖掘方法
CN103399486A (zh) * 2013-07-05 2013-11-20 杭州电子科技大学 塑料烘干器温度优化节能控制方法
CN103886519A (zh) * 2014-04-10 2014-06-25 北京师范大学 基于rbf神经网络的经济用水数据空间离散化方法
CN104462459A (zh) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 基于神经网络的大数据分析处理系统及方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110992935B (zh) * 2014-09-12 2023-08-11 微软技术许可有限责任公司 用于训练神经网络的计算系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963983A (zh) * 2010-09-28 2011-02-02 江苏瑞蚨通软件科技有限公司(中外合资) 一种粗集优化神经网络的数据挖掘方法
CN103399486A (zh) * 2013-07-05 2013-11-20 杭州电子科技大学 塑料烘干器温度优化节能控制方法
CN103886519A (zh) * 2014-04-10 2014-06-25 北京师范大学 基于rbf神经网络的经济用水数据空间离散化方法
CN104462459A (zh) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 基于神经网络的大数据分析处理系统及方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
See also references of EP3444757A4 *
ZHANG, XIAN: "An Algorithm for Training Back-propagation Neural Networks Based on Data Parallelism", CHINA MASTER'S THESES FULL-TEXT DATABASE INFORMATION AND TECHNOLOGY, vol. 2010, no. 05, 10 May 2010 (2010-05-10), pages 1 - 57, XP009512866, ISSN: 1674-0246 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270269A (zh) * 2020-10-30 2021-01-26 湖南快乐阳光互动娱乐传媒有限公司 一种人脸图像质量的评估方法及装置

Also Published As

Publication number Publication date
EP3444757B1 (en) 2021-07-07
US20190138922A1 (en) 2019-05-09
EP3444757A1 (en) 2019-02-20
EP3444757A4 (en) 2019-12-25

Similar Documents

Publication Publication Date Title
CN107807819B (zh) 一种支持离散数据表示的用于执行人工神经网络正向运算的装置及方法
WO2017177442A1 (zh) 支持离散数据表示的人工神经网络正向运算装置和方法
US10726336B2 (en) Apparatus and method for compression coding for artificial neural network
CN107301453B (zh) 支持离散数据表示的人工神经网络正向运算装置和方法
JP7233656B2 (ja) 加速化ディープラーニングのタスクアクティベーション
US11308398B2 (en) Computation method
US10713568B2 (en) Apparatus and method for executing reversal training of artificial neural network
WO2017124642A1 (zh) 用于执行人工神经网络正向运算的装置和方法
WO2017185391A1 (zh) 一种用于执行卷积神经网络训练的装置和方法
CN107301454B (zh) 支持离散数据表示的人工神经网络反向训练装置和方法
WO2017185387A1 (zh) 一种用于执行全连接层神经网络正向运算的装置和方法
WO2017185347A1 (zh) 用于执行循环神经网络和lstm运算的装置和方法
WO2017177446A1 (zh) 支持离散数据表示的人工神经网络反向训练装置和方法
WO2017185248A1 (zh) 用于执行人工神经网络自学习运算的装置和方法
CN111178492B (zh) 计算装置及相关产品、执行人工神经网络模型的计算方法
CN109993276B (zh) 用于执行人工神经网络反向训练的装置和方法
CN111198714B (zh) 重训练方法及相关产品
CN113570053A (zh) 一种神经网络模型的训练方法、装置以及计算设备
WO2019127480A1 (zh) 用于处理数值数据的方法、设备和计算机可读存储介质
CN110097181B (zh) 用于执行人工神经网络正向运算的装置和方法
US20190073584A1 (en) Apparatus and methods for forward propagation in neural networks supporting discrete data
US20190080241A1 (en) Apparatus and methods for backward propagation in neural networks supporting discrete data

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016898260

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2016898260

Country of ref document: EP

Effective date: 20181115

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16898260

Country of ref document: EP

Kind code of ref document: A1