CN109359736A

CN109359736A - Network processing unit and network operations method

Info

Publication number: CN109359736A
Application number: CN201811423421.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2017-04-06
Filing date: 2018-04-04
Publication date: 2019-02-19
Also published as: EP3579150A4; EP3620992B1; EP3620992A1; CN109409515B; EP3624018B1; CN109409515A; CN109219821A; EP3627437A1; EP3627437B1; EP3633526A1; US10896369B2; CN109219821B; EP3579150B1; EP3624018A1; US20190205739A1; CN109344965A; EP4372620A2; EP3579150A1; WO2018184570A1

Abstract

Present disclose provides a kind of network processing unit and network operations method, the network processing unit includes: memory, scratchpad and isomery kernel；Wherein, the memory, for storing the data and instruction of neural network computing；The scratchpad is connect by memory bus with the memory；The isomery kernel, it is connect by scratchpad bus with the scratchpad, the data and instruction of neural network computing is read by scratchpad, complete neural network computing, and operation result is sent back into scratchpad, scratchpad, which is controlled, by operation result writes back to memory.Disclosure network processing unit and network operations method can be reduced the power dissipation overhead of network query function, and can make full use of the concurrency of network, thus reduce the cost of network operations, improve the efficiency of network operations.

Description

Network processing unit and network operations method

The disclosure is application No. is the divisional application of 201880001242.9 Chinese patents, and the content of female case patent is all quoted In this.

Technical field

This disclosure relates to field of artificial intelligence, relate more specifically to a kind of network processing unit and network operations method.

Background technique

Artificial neural network (ANN) is abstracted human brain neuroid from information processing angle, and it is simple to establish certain Model forms different networks according to different connection types.Currently, artificial neural network all achieves very greatly in many fields Progress, be widely used in fields such as pattern-recognition, intelligent robot, automatic control, predictive estimation, biology, medicine, economy Solving practical problems.

Monokaryon neural network processor makes full use of nerve using special instruction as a kind of new exclusive processor The concurrency of network operations carries out neural network computing.But since monokaryon neural network processor needs compatible most nerve Network model requires the neural network computing of existing different types of neural network and different scales to provide branch It holds, this makes existing monokaryon neural network processor, and structure is complicated, expensive, and smaller for scale, and structure is simple The operation of the simple neural network model of neural network computing and such as impulsive neural networks (SNN), there is also waste hardware Resource and the excessive problem of power dissipation overhead, meanwhile, monokaryon neural network processor is not directed to during neural network computing not Concurrency between same layer is accelerated, and there are still very big optimization spaces.

As a result, for the scale of different neural network models and neural network computing, neural network is being made full use of to transport Neural network computing is completed under conditions of the concurrency between concurrency and different layers in calculation process middle layer and is made full use of Neural network computing device reduces the improvement direction that functional component redundancy has become neural network computing device.

Summary of the invention

(1) technical problems to be solved

Present disclose provides a kind of network processing units and network operations method, at least partly to solve skill set forth above Art problem.

(2) technical solution

According to one aspect of the disclosure, a kind of neural network processor is provided, comprising: memory, scratchpad are deposited Reservoir and isomery kernel；Wherein,

The memory, for storing the data and instruction of neural network computing；

The scratchpad is connect by memory bus with the memory；

The isomery kernel is connect by scratchpad bus with the scratchpad, and high speed is passed through Temporary storage reads the data and instruction of neural network computing, completes neural network computing, and operation result is sent back to height Fast temporary storage controls scratchpad for operation result and writes back to memory.

In some embodiments, the isomery kernel includes:

Multiple operation kernels, have at least two different types of operation kernels, for execute neural network computing or Neural net layer operation；And

One or more logic control kernels determine to execute neural network fortune for the data according to neural network computing The operation core type of calculation or neural net layer operation.

In some embodiments, the multiple operation kernel includes x generic core and v private core；Wherein, described Private core is exclusively used in executing specified neural network/neural net layer operation, and the generic core is for executing any nerve net Network/neural net layer operation；

The logic control kernel is determined for the data according to neural network computing by the private core and/or institute It states generic core and executes neural network computing or neural net layer operation.

In some embodiments, the generic core is cpu, and the private core is npu.

In some embodiments, the scratchpad includes shared scratchpad and/or unshared height Fast temporary storage；Wherein, described one shared scratchpad passes through in scratchpad bus and the isomery At least two kernels in core are correspondingly connected with；The one unshared scratchpad by scratchpad bus with A kernel in the isomery kernel is correspondingly connected with.

In some embodiments, the logic control kernel is deposited by scratchpad bus and the scratchpad Reservoir connection, the data of neural network computing is read by scratchpad, and according in the data of neural network computing Neural network model type and parameter, determine by private core and/or generic core to execute nerve as target kernel Network operations and/or neural net layer operation.

In some embodiments, the logic control kernel directly sends signal to the target kernel by control bus Or signal is sent to the target kernel through the scratchpad memory, so that controlling target kernel executes nerve Network operations and/or neural net layer operation.

In some embodiments, the logic control kernel from cache for reading in the data of neural network computing And instruction, according to the type and parameter of the neural network model in data, judge whether there is support the neural network computing and It can complete the private core of the neural network computing scale, and if it exists, the neural network computing is then completed by private core, otherwise The neural network computing is completed by generic core.

In some embodiments, the neural network processor is provided with dedicated/generic core information table, this is dedicated/logical Include type, number, address and the idle state information of kernel with kernel information table, supports to be somebody's turn to do for judging whether there is Neural network computing and the private core that the neural network computing scale can be completed.

In some embodiments, the logic control kernel includes decoder, and the decoder is for judging described instruction Type, and judge according to described instruction the type of network layer.

In some embodiments, the logic control kernel includes content adressable memory, for being numbered according to network layer And scale determines the type and number of the kernel used.

In some embodiments, the private core and generic core Unified number or independent numbering, if a private core Number m it is identical as the number n of a generic core, then addressed according to physical address.

In some embodiments, the scratchpad is unshared scratchpad, the neural network Processor further includes a data switching networks, at the same with the scratchpad, logic control kernel, generic core and specially It is connected with kernel.

In some embodiments, using bus connection or the high speed between the scratchpad and memory It is connected between temporary storage and memory by data switching networks.

A kind of neural network computing method another aspect of the present disclosure provides, comprising:

Scratchpad reads the data and instruction of neural network computing from memory；

Isomery kernel receives the data and instruction for the neural network computing that scratchpad is sent, and executes nerve net Network operation.

In some embodiments, operation result is sent back to high speed temporarily after completing neural network computing by isomery kernel Memory is deposited, and controls scratchpad and operation result is write back into memory.

In some embodiments, isomery kernel receives the data for the neural network computing that scratchpad is sent and refers to It enables, and executes neural network computing, comprising:

Logic control kernel in isomery kernel receive the neural network computing that scratchpad is sent data and Instruction；And

Logic control kernel in isomery kernel is according to the type of the neural network model in the data of neural network computing And parameter, it determines to execute neural network computing or neural net layer operation by private core and/or generic core.

In some embodiments, the logic control kernel in isomery kernel is according to the nerve in the data of neural network computing The type and parameter of network model determine that executing neural net layer operation by private core and/or generic core includes:

Logic control kernel in isomery kernel is according to the type of the neural network model in the data of neural network computing And parameter, judge whether there is qualified private core；

If private core m is eligible, private core m is as target kernel, in the logic control in isomery kernel Core sends signal to target kernel, sends target kernel for the corresponding address of the data and instruction of neural network computing；

Target kernel obtains neural network fortune from memory by shared or unshared scratchpad according to address The data and instruction of calculation carries out neural network computing, and operation result is passed through the shared or unshared scratchpad It is output to memory, operation is completed；

If there is no qualified private core, then the logic control kernel in isomery kernel is sent to generic core The corresponding address of the data and instruction of neural network computing is sent generic core by signal；

Generic core obtains neural network fortune from memory by shared or unshared scratchpad according to address The data and instruction of calculation carries out neural network computing, and operation result is exported by shared or unshared scratchpad To memory, operation is completed.

In some embodiments, the qualified private core refers to the specified neural network computing of support and can complete The private core of specified neural network computing scale.

In some embodiments, the logic control kernel in isomery kernel is according to the nerve in the data of neural network computing The type and parameter of network model determine that executing neural network computing by private core and/or generic core includes:

Logic control kernel in isomery kernel parses the type and parameter of the neural network model in data, right Each neural net layer judges whether there is qualified private core respectively, and corresponding for each neural network Layer assignment Generic core or private core obtain kernel sequence corresponding with neural net layer；

Logic control kernel in isomery kernel sends the corresponding address of the data and instruction of neural net layer operation to The corresponding private core of neural net layer or generic core, and sent to the corresponding private core of neural net layer or generic core The number of next private core or generic core in kernel sequence；

The corresponding private core of neural net layer and generic core read the data of neural net layer operation from address and refer to Enable, carry out neural net layer operation, by operation result be transmitted to shared and/or unshared scratchpad specifiedly Location；

Nuclear control is shared in logic control and/or unshared scratchpad writes the operation result of neural net layer It returns in memory, operation is completed.

In some embodiments, the qualified private core refers to the specified neural net layer operation of support and can be complete The private core of neural net layer operation scale is specified at this.

In some embodiments, the neural network computing includes impulsive neural networks operation；The neural net layer fortune It calculates the convolution algorithm including neural net layer, full articulamentum, splice operation, contraposition plus/multiplication, Relu operation, pond operation And/or Batch Norm operation.

(3) beneficial effect

It can be seen from the above technical proposal that disclosure arithmetic unit and method at least have the advantages that wherein it One:

(1) neuron number evidence is stored using power data presentation technique, memory space needed for reducing storage network data, Meanwhile the data presentation technique simplifies neuron and the multiplication of weight data operates, and reduces the design requirement to arithmetic unit, The arithmetic speed of the neural network of quickening.

(2) by the neuron number obtained after operation according to power neuron number evidence is converted to, neural network storage money is reduced The expense in source and computing resource is conducive to the arithmetic speed for improving neural network.

(3) non-power neuron number is converted according to that can first pass through power before being input to neural network computing device, then Neural network computing device is inputted, the expense of neural network storage resource and computing resource is further reduced, improves nerve The arithmetic speed of network.

(4) data and instruction for participating in screening operation is temporarily stored in dedicated cache by the data screening device and method of the disclosure On, data screening operation more efficiently can be carried out to different storage organizations, different size of data.

(5) neural network computing is carried out using isomery kernel, can be selected according to the type and scale of practical neural network Different kernels carries out operation, makes full use of the actual operational capability of hardware, reduces cost, reduces power dissipation overhead.

(6) different kernels carries out the operation of different layers, and concurrent operation between different layers can make full use of neural network Concurrency, improve the efficiency of neural network computing.

Detailed description of the invention

Attached drawing is and to constitute part of specification for providing further understanding of the disclosure, with following tool Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:

Figure 1A is the structural schematic diagram of the neural network computing device according to one embodiment of the disclosure.

Figure 1B is the structural schematic diagram of the neural network computing device according to another embodiment of the disclosure.

Fig. 1 C is the arithmetic element functional schematic of the embodiment of the present disclosure.

Fig. 1 D is another functional schematic of arithmetic element of the embodiment of the present disclosure.

Fig. 1 E is the functional schematic of the main process task circuit of the embodiment of the present disclosure.

Fig. 1 F is another structural schematic diagram of the neural network computing device according to the embodiment of the present disclosure.

Fig. 1 G is the another structural schematic diagram of the neural network computing device according to the embodiment of the present disclosure.

Fig. 1 H is the neural network computing method flow diagram according to the embodiment of the present disclosure.

Fig. 1 I is the schematic diagram of the coding schedule according to the embodiment of the present disclosure.

Fig. 1 J is another schematic diagram of the coding schedule according to the embodiment of the present disclosure.

Fig. 1 K is another schematic diagram of the coding schedule according to the embodiment of the present disclosure.

Fig. 1 L is another schematic diagram of the coding schedule according to the embodiment of the present disclosure.

Fig. 1 M is the representation method schematic diagram of the power data according to the embodiment of the present disclosure.

Fig. 1 N is the multiplication operation chart of the weight and power neuron according to the embodiment of the present disclosure.

Fig. 1 O is the multiplication operation chart of the weight and power neuron according to the embodiment of the present disclosure.

Fig. 2A is the structural schematic diagram of the neural network computing device according to the embodiment of the present disclosure.

Fig. 2 B is the neural network computing method flow diagram according to the embodiment of the present disclosure.

Fig. 2 C is the representation method schematic diagram of the power data according to the embodiment of the present disclosure.

Fig. 2 D is the multiplication operation chart of the neuron and power weight according to the embodiment of the present disclosure.

Fig. 2 E is the multiplication operation chart of the neuron and power weight according to the embodiment of the present disclosure.

Fig. 2 F is the neural network computing method flow diagram according to the embodiment of the present disclosure.

Fig. 2 G is the representation method schematic diagram of the power data according to the embodiment of the present disclosure.

Fig. 2 H is the multiplication operation chart of the power neuron and power weight according to the embodiment of the present disclosure.

Fig. 3 A is the structural schematic diagram for the arithmetic unit that the disclosure proposes.

Fig. 3 B is the information flow diagram for the arithmetic unit that the disclosure proposes.

Fig. 3 C is the structural schematic diagram of computing module in the arithmetic unit of disclosure proposition.

Fig. 3 D is that the computing module that the disclosure proposes carries out matrix operation schematic diagram.

Fig. 3 E is the structural schematic diagram of operation control module in the arithmetic unit of disclosure proposition.

Fig. 3 F is the detailed construction schematic diagram for the arithmetic unit that one embodiment of the disclosure proposes.

Fig. 3 G is the flow chart for the operation method that another embodiment of the disclosure proposes.

Fig. 4 A is the overall structure diagram of the data screening device of the embodiment of the present disclosure.

Fig. 4 B is the data screening Elementary Function schematic diagram of the embodiment of the present disclosure.

Fig. 4 C is the concrete structure schematic diagram of the data screening device of the embodiment of the present disclosure.

Fig. 4 D is another concrete structure schematic diagram of the data screening device of the embodiment of the present disclosure.

Fig. 4 E is the flow chart of the data screening method of the embodiment of the present disclosure.

Fig. 5 A diagrammatically illustrates the heterogeneous polynuclear neural network processor of one embodiment of the disclosure.

Fig. 5 B diagrammatically illustrates the heterogeneous polynuclear neural network processor of another embodiment of the disclosure.

Fig. 5 C is the neural network computing method flow diagram of another embodiment of the disclosure.

Fig. 5 D is the neural network computing method flow diagram of another embodiment of the disclosure.

Fig. 5 E diagrammatically illustrates the heterogeneous polynuclear neural network processor of another embodiment of the disclosure.

Fig. 5 F diagrammatically illustrates the heterogeneous polynuclear neural network processor of another embodiment of the disclosure.

Specific embodiment

For the purposes, technical schemes and advantages of the disclosure are more clearly understood, below in conjunction with specific embodiment, and reference The disclosure is further described in attached drawing.

It should be noted that similar or identical part all uses identical figure number in attached drawing or specification description.It is attached The implementation for not being painted or describing in figure is form known to a person of ordinary skill in the art in technical field.In addition, though this Text can provide the demonstration of the parameter comprising particular value, it is to be understood that parameter is equal to corresponding value without definite, but can connect It is similar to be worth accordingly in the error margin or design constraint received.The direction term mentioned in embodiment, for example, "upper", "lower", "front", "rear", "left", "right" etc. are only the directions with reference to attached drawing.Therefore, the direction term used is for illustrating not to use To limit the protection scope of the disclosure.

In one embodiment of the disclosure, as shown in Figure 1A, arithmetic unit, comprising: computing module 1-1, for executing nerve Network operations；And power conversion module 1-2, it is connect with the computing module, for the input of neural network computing is neural Metadata and/or output neuron data are converted to power neuron number evidence.

In another embodiment, as shown in Figure 1B, arithmetic unit, comprising:

Memory module 1-4, for storing data and operational order；

Control module 1-3 is connect with the memory module, and for controlling the interaction of data and operational order, receiving should The data and operational order that memory module is sent, and operational order is decoded into operation microcommand；

Computing module 1-1 is connect with the control module, receives the data and operation microcommand of control module transmission, And according to operation microcommand to its received weight data and neuron number according to executing neural network computing；And

Power conversion module 1-2, connect with the computing module, for by the input neuron number evidence of neural network computing And/or output neuron data are converted to power neuron number evidence.

It will be understood by those in the art that memory module can integrate inside arithmetic unit, piece external storage can also be used as Device is arranged outside arithmetic unit.

Specifically, the memory module includes: storage unit 1-41, for storing number please continue to refer to shown in Figure 1B According to and operational order.

The control module includes:

Operational order cache unit 1-32, connect with the DCU data control unit, and control unit is sent for receiving data Operational order；

Decoding unit 1-33 is connect with the operational order cache unit, for reading from operational order cache unit Operational order, and it is decoded into each operation microcommand；

Neuron cache unit 1-34 is inputted, is connect with the DCU data control unit, control unit is sent out for receiving data The neuron number evidence sent；

Weight cache unit 1-35, connect with the DCU data control unit, for receiving from DCU data control unit transmission Weight data；

DCU data control unit 1-31, connect with memory module, caches respectively with operational order for realizing memory module single Data and operational order interaction between member, weight cache unit and input neuron cache unit.

The computing module includes: arithmetic element 1-11, respectively with the decoding unit, input neuron cache unit and The connection of weight cache unit receives each operation microcommand, neuron number evidence and weight data, for according to each operation microcommand pair Its received neuron number evidence and weight data execute corresponding operation.

In an optional embodiment, arithmetic element is included but are not limited to: one or more multiplication of first part Device；One or more adder of second part (more specifically, the adder of the second part can also form add tree)； The activation primitive unit of Part III；And/or the vector processing unit of Part IV.More specifically, vector processing unit can be with Handle vector operation and/or pond operation.Input data 1 (in1) is multiplied to obtain phase by first part with input data 2 (in2) Output (out) after multiplying, process are as follows: out=in1*in2；Input data in1 is added to obtain by second part by adder Output data (out).More specifically, when second part is add tree, input data in1 is added step by step by add tree and is obtained Output data (out), wherein in1 is the vector that a length is N, and N is greater than 1, process are as follows: out=in1 [1]+in1 [2]+... + in1 [N], and/or be added to obtain output data with input data (in2) after input data (in1) is added up by addition number (out), process are as follows: out=in1 [1]+in1 [2]+...+in1 [N]+in2, or by input data (in1) and input data (in2) it is added and obtains output data (out), process are as follows: out=in1+in2；Input data (in) is passed through activation by Part III Function (active) operation obtains activation output data (out), process are as follows: out=active (in), activation primitive active can To be sigmoid, tanh, relu, softmax etc., in addition to doing activation operation, other non-linear letters are may be implemented in Part III Input data (in) can will be obtained output data (out), process by operation (f) are as follows: out=f (in) by number.Vector Processing Input data (in) is obtained the output data (out) after pondization operation, process out=pool by pond operation by unit (in), wherein pool is pondization operation, and pondization operation includes but is not limited to: average value pond, maximum value pond, intermediate value pond, Input data in is and exports the data in the relevant pond core of out.

It is that the input data 1 is multiplied with input data 2 that it includes first part that the arithmetic element, which executes operation, is obtained Data after multiplication；And/or second part executes add operation and (more specifically, is add tree operation, is used for input data 1 is added step by step by add tree), or the input data 1 is passed through and is added to obtain output data with input data 2；And/or Part III executes activation primitive operation, obtains output data by activation primitive (active) operation to input data；And/or Part IV executes pond operation, and out=pool (in), wherein pool is pondization operation, and pondization operation includes but is not limited to: flat Mean value pond, maximum value pond, intermediate value pond, input data in are and export the data in the relevant pond core of out.With The operation of upper several parts can carry out the combination of different order with one multiple portions of unrestricted choice, to realize various different function The operation of energy.Computing unit constitutes second level, three-level or level Four flowing water level framework accordingly.

In another optional embodiment, the arithmetic element may include a main process task circuit and multiple from processing Circuit.

The main process task circuit, for that an input data will be distributed into multiple data blocks, by the multiple data block In at least one data block and multiple operational orders at least one operational order be sent to it is described from processing circuit；

It is the multiple from processing circuit, obtain centre for executing operation to the data block received according to the operational order As a result, and operation result is transferred to the main process task circuit；

The main process task circuit refers to for being handled to obtain the operation by multiple intermediate results sent from processing circuit Enable as a result, the result of the operational order is sent to the DCU data control unit.

In an alternative embodiment, arithmetic element is as shown in Figure 1 C, may include branch process circuit；Wherein,

Main process task circuit and branch process circuit connection, branch process circuit are connect with multiple from processing circuit；

Branch process circuit, for executing forwarding main process task circuit and from the data or instruction between processing circuit.

In another optional embodiment, arithmetic element may include a main process task circuit and multiple as shown in figure iD From processing circuit.Optionally, it is multiple from processing circuit be in array distribution；Each from processing circuit and adjacent other from processing electricity Road connection, main process task circuit connection are the multiple a from processing circuit, the k tandem circuit from the k in processing circuit are as follows: the The n m arranged from processing circuit and the 1st of n of 1 row from processing circuit, m row are a from processing circuit.

K is from processing circuit, in the main process task circuit and multiple data between processing circuit and referring to The forwarding of order.

Optionally, as referring to figure 1E, which can also include: conversion processing circuit, activation processing circuit, add One of method processing circuit or any combination；

Conversion processing circuit, for by the received data block of main process task circuit or intermediate result execute the first data structure with Exchange (such as conversion of continuous data and discrete data) between second data structure；Or by the received data of main process task circuit Block or intermediate result execute exchange (such as fixed point type and floating point type between the first data type and the second data type Conversion)；

Processing circuit is activated, for executing the activation operation of data in main process task circuit；

Addition process circuit, for executing add operation or accumulating operation.

It is described to include: from processing circuit

Multiplication process circuit obtains result of product for executing product calculation to the data block received；

Forward process circuit (optional), for forwarding the data block received or result of product.

Accumulation process circuit, the accumulation process circuit obtain among this for executing accumulating operation to the result of product As a result.

In another optional embodiment, which can be Matrix Multiplication with the instruction of matrix, accumulated instruction, activation Instruction etc. operational order.

The output module 1-5 includes: output neuron cache unit 1-51, is connect with the arithmetic element, for connecing Receive the neuron number evidence of arithmetic element output；

The power conversion module includes:

First power converting unit 1-21, connect with the output neuron cache unit, is used for the output nerve The neuron number evidence of first cache unit output is converted to power neuron number evidence；And

Second power converting unit 1-22, connect, for that will input the nerve of the memory module with the memory module Metadata is converted to power neuron number evidence.And for the power neuron number evidence in neural network input data, then it directly deposits Enter memory module.

If the neural network computing device realizes data input/output, first and second power using I/O module Converting unit may also be arranged between I/O module and computing module, by the input neuron number of neural network computing according to and/or Output neuron data are converted to power neuron number evidence.

Optionally, the arithmetic unit can include: third power converting unit 1-23, for power neuron number evidence to be turned It is changed to non-power neuron number evidence.Non- power neuron number is converted to power neuron number after according to through the second power converting unit It is input in arithmetic element and executes operation, in calculating process, to improve precision, third power is optionally set and converts list Member, for, according to non-power neuron number evidence is converted to, third power converting unit can be located at the fortune by power neuron number Calculating module-external (as shown in fig. 1F) can also be located inside the computing module (as shown in Figure 1 G), and what is exported after operation is non- Power neuron number is converted into power neuron number evidence according to using the first power converting unit, then feeds back to data control list Member, thus closed circulation can be formed by participating in subsequent arithmetic to accelerate arithmetic speed.

Certainly, the data of computing module output can also be sent directly to output neuron cache unit, by output nerve First cache unit is sent to DCU data control unit without via power converting unit.

Wherein, memory module can receive data from external address space and operational order, the data include neural network Weight data, neural network input data etc..

In addition, there are many optional ways for power conversion operation.Three kinds of power conversions used by the present embodiment are set forth below Operation:

The first power conversion method:

s_out=s_in

Wherein, d_inFor the input data of power converting unit, d_outFor the output data of power converting unit, s_inFor input The symbol of data, s_outFor the symbol of output data, d_in+For the positive portion of input data, d_in+=d_in×s_in, d_out+For output The positive portion of data, d_out+=d_out×s_out,Expression removes whole operation to data x.

Second of power conversion method:

s_out=s_in

Wherein, d_inFor the input data of power converting unit, d_outFor the output data of power converting unit, s_inFor input The symbol of data, s_outFor the symbol of output data, d_in+For the positive portion of input data, d_in+=d_in×s_in, d_out+For output The positive portion of data, d_out+=d_out×s_out,Expression takes upper whole operation to data x.

The third power conversion method:

s_out=s_in

d_out+=[log₂(d_in+)]

Wherein, d_inFor the input data of power converting unit, d_outFor the output data of power converting unit；s_inFor input The symbol of data, s_outFor the symbol of output data；d_in+For the positive portion of input data, d_in+=d_in×s_in, d_out+For output The positive portion of data, d_out+=d_out×s_out；[x] indicates the operation that rounds up to data x.

It should be noted that disclosure power conversion regime in addition to round, round up, be rounded it downwards Outside, it can also be and be rounded to odd number, be rounded to even number, being rounded to zero and be rounded at random.Wherein, it preferably rounds up and takes It is whole, be rounded and random be rounded to reduce loss of significance to zero.

In addition, the embodiment of the present disclosure additionally provides a kind of neural network computing method, the neural network computing method, packet It includes: executing neural network computing；And before executing neural network computing, by the input neuron number evidence of neural network computing Be converted to power neuron number evidence；And/or after executing neural network computing, by the output neuron number of neural network computing According to being converted to power neuron number evidence.

Optionally, before executing neural network computing, the input neuron number evidence of neural network computing is converted into power Secondary neuron number according to the step of include: by the non-power neuron number in input data according to being converted to power neuron number evidence；With And receive and store operational order, power neuron number evidence and weight data.

Optionally, receive and store operational order, the power neuron number according to and the step of weight data and execution Between the step of neural network computing, further includes: read operational order, and be decoded into each operation microcommand.

Optionally, in the step of executing neural network computing, according to operation microcommand to weight data and power nerve Metadata carries out neural network computing.

Optionally, after executing neural network computing, the output neuron data of neural network computing are converted into power Secondary neuron number according to the step of include: the neuron number evidence obtained after output nerve network operations；And by neural network computing Non- power neuron number evidence of the neuron number obtained afterwards in is converted to power neuron number evidence.

Optionally, the non-power neuron number evidence by the neuron number obtained after neural network computing in is converted to power Neuron Data Concurrent is sent to the DCU data control unit, using the input power neuron as lower layer of neural network computing, Neural network computing step and non-power neuron number are repeated according to being converted into power neuron data step, until neural network most Later layer operation terminates.

Specifically, the neural network of the embodiment of the present disclosure is multilayer neural network, in some embodiments, for every layer Neural network can carry out operation by operation method shown in Fig. 1 H, wherein neural network first layer inputs power neuron number evidence It can be read in by memory module from external address, directly incoming storage if being power data if the data that external address is read in Otherwise module first passes through power converting unit and is converted to power data, hereafter the input power neuron number of each layer neural network According to can be by the output power neuron number of one or more layers neural network before this layer according to offer.Please refer to Fig. 1 H, this reality Apply a monolayer neural networks operation method, comprising:

Step S1-1 obtains operational order, weight data and neuron number evidence.

Wherein, the step S1-1 includes following sub-step:

Operational order, neuron number evidence and weight data are inputted memory module by S1-11；Wherein, to power neuron number According to memory module is directly inputted, to non-power neuron number according to the input storage mould after the second power converting unit conversion Block；

S1-12, DCU data control unit receive the operational order of memory module transmission, power neuron number evidence and weight number According to；

S1-13, operational order cache unit, input neuron cache unit and weight cache unit receive the number respectively According to the operational order of control unit transmission, power neuron number evidence and weight data and it is distributed to decoding unit or arithmetic element.

The power neuron number according to indicate neuron number evidence numerical value using its power exponent value form expression, specifically, For power neuron number according to including sign bit and power position, sign bit indicates power neuron number evidence with one or more bits Symbol, power position indicate the power position data of power neuron number evidence with m bits, and m is the positive integer greater than 1.Store mould The storage unit of block prestores coding schedule, provides the corresponding exponential number of each power position data of power neuron number evidence.It compiles It is to specify corresponding power neuron number according to being that one or more power position data (i.e. zero setting power position data), which is arranged, in code table 0.That is, indicating should when the power position data when power neuron number evidence are the zero setting power position data in coding schedule Power neuron number evidence is 0.Wherein, the coding schedule can have flexible storage mode, either form is deposited Storage can also be the mapping carried out by functional relation.

The corresponding relationship of coding schedule can be arbitrary.

For example, the corresponding relationship of coding schedule can be random ordering.In a kind of part for the coding schedule that m is 5 as shown in Figure 1 I Hold, equivalency index numerical value is 0 when power position data are 00000.Equivalency index numerical value when power position data are 00001 It is 3.Equivalency index numerical value is 4 when power position data are 00010.Equivalency index number when power position data are 00011 Value is 1.Power position data correspond to power neuron number evidence when being 00100 be 0.

The corresponding relationship of coding schedule is also possible to positively related, and memory module prestores an integer value x and a positive integer Value y, the smallest power position data equivalency index numerical value are x, and any other one or more power position data correspond to power nerve Metadata is 0.X indicates that bias, y indicate step-length.In a kind of embodiment, the smallest power position data equivalency index number Value is x, and maximum power position data correspond to power neuron number according to being 0, other except the power position data of minimum and maximum Power position data equivalency index numerical value be (power position data+x) * y.By presetting different x and y and by changing x With the numerical value of y, the expression range of power becomes to match, and can be adapted for the different application scenarios for needing different numberical ranges. Therefore, the application range of this neural network computing device is more extensive, can be changed, can be done according to user demand using more flexible Adjustment.

In a kind of embodiment, the numerical value of y 1, x are equal to -2^m-1.Thus power neuron number is according to represented numerical value Index range is -2^m-1~2^m-1-1。

In a kind of embodiment, a kind of m is 5, x 0, the partial content for the coding schedule that y is 1, power position as shown in figure iJ Equivalency index numerical value is 0 when data are 00000.Equivalency index numerical value is 1 when power position data are 00001.Power Equivalency index numerical value is 2 when position data are 00010.Equivalency index numerical value is 3 when power position data are 00011.Power Secondary position data correspond to power neuron number evidence when being 11111 be 0.Another kind m is 5, x 0, the volume that y is 2 as shown in figure iK The partial content of code table, equivalency index numerical value is 0 when power position data are 00000.When power position data are 00001 Equivalency index numerical value is 2.Equivalency index numerical value is 4 when power position data are 00010.Power position data be 00011 when Waiting equivalency index numerical value is 6.Power position data correspond to power neuron number evidence when being 11111 be 0.

The corresponding relationship of coding schedule can be negative correlation, and memory module prestores an integer value x and a positive integer value Y, maximum power position data equivalency index numerical value are x, and any other one or more power position data correspond to power neuron Data are 0.X indicates that bias, y indicate step-length.In a kind of embodiment, maximum power position data equivalency index numerical value For x, it is 0 that the smallest power position data, which correspond to power neuron number evidence, the others except the power position data of minimum and maximum Power position data equivalency index numerical value is (power position data-x) * y.By presetting different x and y and by changing x and y Numerical value, the expression range of power becomes to match, and can be adapted for the different application scenarios for needing different numberical ranges.Cause This, the application range of this neural network computing device is more extensive, can be changed, can be adjusted according to user demand using more flexible It is whole.

In a kind of embodiment, the numerical value of y 1, x are equal to 2^m-1.Thus power neuron number according to represented numerical value finger Number range is -2^m-1- 1~2^m-1。

A kind of partial content for the coding schedule that m is 5 as can be seen in 1L, power position data correspond to number number when being 11111 Value is 0.Equivalency index numerical value is 1 when power position data are 11110.Equivalency index when power position data are 11101 Numerical value is 2.Equivalency index numerical value is 3 when power position data are 11100.Power position data correspond to power when being 00000 Secondary neuron number evidence is 0.

The corresponding relationship of coding schedule can be power position data highest order and represent zero setting position, power position data other m-1 Equivalency index numerical value.When power position data highest order is 0, corresponding power neuron number evidence is 0；When power position data highest order When being 1, corresponding power neuron number evidence is not 0.Vice versa, i.e., when power position data highest order is 1, corresponding power nerve Metadata is 0；When power position data highest order is 0, corresponding power neuron number evidence is not 0.It is described with another language, That is the power position of power neuron number evidence is separated out a bit to indicate whether power neuron number evidence is 0.

Shown in an instantiation figure 1M, sign bit is 1, and power position data bit is 7, i.e. m is 7.Coding schedule is Power position data correspond to power neuron number evidence when being 11111111 be 0, power when power position data are other numerical value Secondary neuron number is according to the corresponding complement of two's two's complement of correspondence.When power neuron number is 0 according to sign bit, power position is 0001001, then It indicates that specific value is 2⁹, i.e., 512；Power neuron number is 1 according to sign bit, and power position is 1111101, then it indicates specific Numerical value is -2^-3, i.e., -0.125.Relative to floating data, power data only retain the power position of data, significantly reduce storage number According to required memory space.

By power data presentation technique, storage neuron number can reduce according to required memory space.In the present embodiment In provided example, power data are 8 data, it should be appreciated that the data length is not fixed and invariable, in different occasions Under, different data lengths is used according to the data area of data neuron.

Step S1-2, according to operation microcommand to weight data and neuron number according to progress neural network computing.Wherein, institute Stating step S2 includes following sub-step:

S1-21, decoding unit read operational order from operational order cache unit, and are decoded into the micro- finger of each operation It enables；

S1-22, arithmetic element receive the decoding unit, input neuron cache unit and weight cache unit hair respectively The operation microcommand sent, power neuron number accordingly and weight data, and according to operation microcommand to weight data and power mind Neural network computing is carried out through metadata.

The power neuron and weight multiplication operate specifically, power neuron number is according to sign bit and weight data symbol Do xor operation in position；The corresponding relationship of coding schedule be it is out-of-order in the case where search coding schedule and find out power neuron number according to power position Corresponding exponential number, the corresponding relationship of coding schedule are to record the exponential number minimum value of coding schedule in positively related situation and do Addition finds out power neuron number according to the corresponding exponential number in power position, and the corresponding relationship of coding schedule, which is negative in relevant situation, to be remembered It records the maximum value of coding schedule and does subtraction and find out power neuron number according to the corresponding exponential number in power position；By exponential number and power Add operation is done in Value Data power position, and weight data significance bit remains unchanged.

Specific example is just like shown in Fig. 1 N, and weight data is 16 floating datas, and sign bit 0, power position is 10101, Significance bit is 0110100000, then its actual numerical value indicated is 1.40625*2⁶.Power neuron number is 1 according to sign bit, Power position data bit is 5, i.e. m is 5.Coding schedule, which is power position data, corresponds to power neuron number when be 11111 according to being 0, power position data correspond to the corresponding complement of two's two's complement when power position data are other numerical value.Power neuron is 000110, then its actual numerical value indicated is 64, i.e., 2⁶.The power position of weight is plus the power position result of power neuron 11011, then the actual numerical value of result is 1.40625*2¹², the as result of product of neuron and weight.By the arithmetic operation, So that multiplication operation becomes add operation, reduce the operand needed for calculating.

Specific example two is as shown in Fig. 1 O, and weight data is 32 floating datas, and sign bit 1, power position is 10000011, significance bit 10010010000000000000000, then its actual numerical value indicated is -1.5703125*2⁴.Power Secondary neuron number is 1 according to sign bit, and power position data bit is 5, i.e. m is 5.Coding schedule is that power position data are 11111 When correspond to power neuron number according to for 0, when power position data are other numerical value power position data correspond to corresponding two into Complement code processed.Power neuron is 111100, then its actual numerical value indicated is -2-4.(the power position of weight is plus power nerve The power position result of member is 01111111, then the actual numerical value of result is 1.5703125*2⁰, as neuron and weight multiply Product result.

Neuron number evidence after neural network computing is converted into power neuron by step S1-3, the first power converting unit Data.

Wherein, the step S1-3 includes following sub-step:

S1-31, output neuron cache unit receive the mind obtained after the neural network computing that the computing unit is sent Through metadata；

S1-32, the first power converting unit receive the neuron number evidence that the output neuron cache unit is sent, and will Non- power neuron number evidence therein is converted to power neuron number evidence.

Wherein, it there are many optional power conversion operations, is selected according to actual application demand.It is listed in the present embodiment Three kinds of power conversion operations:

The first power conversion method:

s_out=s_in

Second of power conversion method:

s_out=s_in

The third power conversion method:

s_out=s_in

d_out+=[log₂(d_in+)]

In addition, the power neuron number obtained by power converting unit is according to can be used as lower layer of neural network computing defeated Enter power neuron, repeats step 1 to step 3 until neural network the last layer operation terminates.By changing memory module The integer value x and positive integer value y prestored, the power neuron number that adjustable neural network computing device can indicate is according to model It encloses.

In another embodiment, the disclosure additionally provides a kind of method using the neural network computing device, leads to It crosses and changes integer value x and positive integer value y that memory module prestores, to adjust the power mind that neural network computing device can indicate Through metadata scope.

In other embodiments of the disclosure, unlike previous embodiment, the power of the arithmetic unit is converted Module is connect with the computing module, for the input data of neural network computing and/or output data to be converted to power number According to.

Specifically, the input data includes input neuron number evidence, input weight data, the output data includes defeated It is spellbound through metadata, output weight data, the power data include power neuron number evidence, power weight data.

That is, on the basis of previous embodiment, power conversion module herein is in addition to can be to neuron number evidence It carries out except power conversion, power conversion can also be carried out to weight data, in addition, the weight data in operation result is converted to After power weight data, it can be sent directly to DCU data control unit, participate in subsequent arithmetic.Remaining module of arithmetic unit, list Member composition, function and usage and connection relationship are similar with previous embodiment.

As shown in Figure 2 A, in the present embodiment neural network computing device, including memory module 2-4, control module 2-3, fortune Calculate module 2-1, output module 2-5 and power conversion module 2-2.

The memory module includes: storage unit 2-41, for storing data and is instructed；

The control module includes:

DCU data control unit 2-31 is connect with the storage unit, for the number between storage unit and each cache unit According to and instruction interaction；

Operational order cache unit 2-32, connect with the DCU data control unit, and control unit is sent for receiving data Instruction；

Decoding unit 2-33 is connect with described instruction cache unit, for reading instruction from instruction cache unit, and will It is decoded into each operational order；

Neuron cache unit 2-34 is inputted, is connect with the DCU data control unit, control unit is sent out for receiving data The neuron number evidence sent；

Weight cache unit 2-35, connect with the DCU data control unit, for receiving from DCU data control unit transmission Weight data.

The computing module includes: arithmetic element 2-11, is connect with the control module, and control module transmission is received Data and operational order, and neural network computing is executed to its received neuron number evidence and weight data according to operational order；

The output module includes: output neuron cache unit 2-51, is connect with the arithmetic element, for receiving fortune Calculate the neuron number evidence of unit output；And send it to the DCU data control unit.Thus it can be used as next layer of neural network The input data of operation；

The power conversion module can include:

First power converting unit 2-21, connect with the output neuron cache unit and the arithmetic element, is used for The neuron number of output neuron cache unit output is defeated accordingly and by arithmetic element according to power neuron number is converted to Weight data out is converted to power weight data；And/or

Second power converting unit 2-22, connect, for that will input the nerve of the memory module with the memory module Metadata, weight data are respectively converted into power neuron number evidence, power weight data；

Optionally, the arithmetic unit further include: third power converting unit 2-23 is connect with the arithmetic element, is used In power neuron number is respectively converted into non-power neuron number according to, non-power weight data according to, power weight data.

It should be noted that herein only with power conversion module simultaneously include the first power converting unit, the second power turn It changes for unit and third power converting unit and is illustrated, but in fact, the power conversion module may include the first power Secondary converting unit, the second power converting unit and third power converting unit it is any, with shown in aforementioned Figure 1B, 1F, 1G Embodiment.

Non- power neuron number evidence, weight data are converted to power neuron number evidence, power through the second power converting unit It is input in arithmetic element after weight data and executes operation, in calculating process, to improve precision, can turned by setting third power Unit is changed, power neuron number evidence, power weight data are converted into non-power neuron number evidence, non-power weight data, the Three power converting units can be located at outside the computing module or be located inside the computing module, be exported after operation Non- power neuron number according to being converted into power neuron number evidence using the first power converting unit, then feed back to data control Unit, thus closed circulation can be formed by participating in subsequent arithmetic to accelerate arithmetic speed.

In addition, the concrete operation method of the weight data power conversion is identical as previous embodiment, details are not described herein again.

In some embodiments, the neural network is multilayer neural network, can be by Fig. 2 B institute for every layer of neural network The operation method shown carries out operation, wherein neural network first layer inputs power weight data can be by storage unit from outside Address is read in, directly incoming storage unit if being power weight data if the weight data that external address is read in, otherwise first Power weight data is converted to by power converting unit.B referring to figure 2., the present embodiment monolayer neural networks operation method, packet It includes:

Step S2-1, acquisition instruction, neuron number evidence and power weight data.

Wherein, the step S2-1 includes following sub-step:

Instruction, neuron number evidence and weight data are inputted storage unit by S2-11；Wherein, direct to power weight data Storage unit is inputted, storage unit is inputted after the conversion of power converting unit to non-power weight data；

S2-12, DCU data control unit receive the instruction of storage unit transmission, neuron number evidence and power weight data；

S2-13, instruction cache unit, input neuron cache unit and weight cache unit receive the data control respectively Instruction, neuron number evidence and the power weight data of unit transmission processed are simultaneously distributed to decoding unit or arithmetic element.

The power weight data indicates that the numerical value of weight data is indicated using its power exponent value form, specifically, power Weight data includes sign bit and power position, and sign bit indicates the symbol of weight data, power position with one or more bits The power position data of weight data are indicated with m bits, m is the positive integer greater than 1.Storage unit prestores coding schedule, mentions For the corresponding exponential number of each power position data of power weight data.One or more power position data is arranged in coding schedule (i.e. zero setting power position data) are 0 to specify corresponding power weight data.That is, working as the power position of power weight data When data are the zero setting power position data in coding schedule, indicate that the power weight data is 0.The corresponding relationship of coding schedule is with before State similar in embodiment, details are not described herein again.

Shown in an instantiation figure 2C, sign bit is 1, and power position data bit is 7, i.e. m is 7.Coding schedule is It is 0 that power position data, which correspond to power weight data when being 11111111, power when power position data are other numerical value Weight data corresponds to the corresponding complement of two's two's complement.When power weight data sign bit is 0, power position is 0001001, then it is indicated Specific value is 2⁹, i.e., 512；Power weight data sign bit is 1, and power position is 1111101, then it indicates that specific value is -2^-3, i.e., -0.125.Relative to floating data, power data only retain the power position of data, significantly reduce needed for storing data Memory space.

By power data presentation technique, memory space needed for can reduce storage weight data.In the present embodiment institute It provides in example, power data are 8 data, it should be appreciated that the data length is not fixed and invariable, in different occasions Under, different data lengths is used according to the data area of data weight.

Step S2-2 carries out neural network computing to neuron number evidence and power weight data according to operational order.Wherein, The step S2 includes following sub-step:

S2-21, decoding unit reads instruction from instruction cache unit, and is decoded into each operational order；

S2-22, arithmetic element receive the decoding unit, input neuron cache unit and weight cache unit hair respectively Operational order, power weight data and the neuron number evidence sent, and neuron number evidence and power are indicated according to operational order Weight data carry out neural network computing.

The neuron and power weight multiplication operate specifically, neuron number is according to sign bit and power weight data symbol Do xor operation in position；The corresponding relationship of coding schedule be it is out-of-order in the case where search coding schedule to find out power weight data power position right The exponential number answered, the corresponding relationship of coding schedule are to record the exponential number minimum value of coding schedule in positively related situation and add Method finds out the corresponding exponential number in power weight data power position, and the corresponding relationship of coding schedule is negative in relevant situation to record and compile The maximum value of code table simultaneously does subtraction and finds out the corresponding exponential number in power weight data power position；By exponential number and neuron number Add operation is done according to power position, neuron data valid bit remains unchanged.

As shown in Figure 2 D, neuron number is according to being 16 floating datas, and sign bit 0, power position is for specific example one 10101, significance bit 0110100000, then its actual numerical value indicated is 1.40625*2⁶.Power weight data sign bit is 1 Position, power position data bit are 5, i.e. m is 5.Coding schedule, which is power position data, to be corresponded to power weight data when be 11111 and is 0, power position data correspond to the corresponding complement of two's two's complement when power position data are other numerical value.Power weight is 000110, Then its actual numerical value indicated is 64, i.e., 2⁶.The power position of power weight is 11011 plus the power position result of neuron, then As a result actual numerical value is 1.40625*2¹², the as result of product of neuron and power weight.By the arithmetic operation, so that Multiplication operation becomes add operation, reduces the operand needed for calculating.

As shown in Figure 2 E, neuron number is according to being 32 floating datas, and sign bit 1, power position is for specific example two 10000011, significance bit 10010010000000000000000, then its actual numerical value indicated is -1.5703125*2⁴.Power Secondary weight data sign bit is 1, and power position data bit is 5, i.e. m is 5.Coding schedule be power position data be 11111 when Waiting corresponding power weight data is O, and power position data correspond to corresponding binary system and mend when power position data are other numerical value Code.Power neuron is 111100, then its actual numerical value indicated is -2^-4.(the power position of neuron adds the power of power weight Secondary position result is 01111111, then the actual numerical value of result is 1.5703125*2⁰, the as product of neuron and power weight As a result.

It optionally, further include step S2-3, by the neuron number after neural network computing according to output and as next layer of mind Input data through network operations.

Wherein, the step S2-3 may include following sub-step:

S2-31, output neuron cache unit receive the mind obtained after the neural network computing that the computing unit is sent Through metadata.

S2-32 passes through output by the received neuron number of output neuron cache unit according to DCU data control unit is transferred to The neuron number that neuron cache unit obtains repeats step according to the input neuron that can be used as lower layer of neural network computing S2-1 to step S2-3 terminates until neural network the last layer operation.

In addition, the power neuron number obtained by power converting unit is according to can be used as lower layer of neural network computing defeated Enter power neuron, repeats step S2-1 to step S2-3 until neural network the last layer operation terminates.It is deposited by changing The integer value x and positive integer value y that storage unit prestores, the power neuron number that adjustable neural network computing device can indicate According to range.

In some embodiments, the neural network is multilayer neural network, can be by Fig. 2 F institute for every layer of neural network The operation method shown carries out operation, wherein neural network first layer inputs power weight data can be by storage unit from outside Address is read in, and directly incoming storage unit, otherwise first passes through if being power weight data if the data that external address is read in Power converting unit is converted to power weight data；And neural network first layer input power neuron number evidence can pass through storage list Member is read in from external address, directly incoming storage unit if being power data if the data that external address is read in, otherwise first Power neuron number evidence is converted to by power converting unit, hereafter the input neuron number evidence of each layer neural network can be by this The output power neuron number of one or more layers neural network before layer is according to offer.F referring to figure 2., the present embodiment single layer mind Through network operations method, comprising:

Step S2-4, acquisition instruction, power neuron number evidence and power weight data.

Wherein, the step S2-4 includes following sub-step:

Instruction, neuron number evidence and weight data are inputted storage unit by S2-41；Wherein, to power neuron number according to and Power weight data directly inputs storage unit, then passes through described first to non-power neuron number evidence and non-power weight data Power converting unit inputs storage unit after being converted to power neuron number evidence and power weight data；

S2-42, DCU data control unit receive the instruction of storage unit transmission, power neuron number evidence and power weight number According to；

S2-43, instruction cache unit, input neuron cache unit and weight cache unit receive the data control respectively Instruction, power neuron number evidence and the power weight data of unit transmission processed are simultaneously distributed to decoding unit or arithmetic element.

The power neuron number evidence and power weight data indicate that the numerical value of neuron number evidence and weight data uses it Power exponent value form indicates, specifically, power neuron number evidence and power weight data include sign bit and power position, symbol Position indicates the symbol of neuron number evidence and weight data with one or more bits, and power position indicates nerve with m bits The power position data of metadata and weight data, m are the positive integer greater than 1.The storage unit of storage unit prestores coding schedule, The corresponding exponential number of each power position data of power neuron number evidence and power weight data is provided.Coding schedule is arranged one Or multiple power positions data (i.e. zero setting power position data) are to specify corresponding power neuron number evidence and power weight data It is 0.That is, when the power position data of power neuron number evidence and power weight data are the zero setting power positions in coding schedule When data, indicates power neuron number evidence and power weight data is 0.

In a specific example mode, as shown in Figure 2 G, sign bit is 1, and power position data bit is 7, i.e. m is 7. Coding schedule is power position data correspond to when be 11111111 power neuron number according to and power weight data be 0, power position Power neuron number evidence and power weight data correspond to the corresponding complement of two's two's complement when data are other numerical value.When power mind It is 0 through metadata and power weight data sign bit, power position is 0001001, then it indicates that specific value is 2⁹, i.e., 512；Power Secondary neuron number evidence and power weight data sign bit are 1, and power position is 1111101, then it indicates that specific value is -2^-3, i.e. ,- 0.125.Relative to floating data, power data only retain the power position of data, storage needed for significantly reducing storing data Space.

By power data presentation technique, memory space needed for can reduce storage neuron number evidence and weight data. In the provided example of the present embodiment, power data are 8 data, it should be appreciated that the data length is not fixed and invariable, Under different occasions, different data lengths is used according to the data area of neuron number evidence and weight data.

Step S2-5 carries out neural network computing to power neuron number evidence and power weight data according to operational order. Wherein, the step S5 includes following sub-step:

S2-51, decoding unit reads instruction from instruction cache unit, and is decoded into each operational order；

S2-52, arithmetic element receive the decoding unit, input neuron cache unit and weight cache unit hair respectively Operational order, power neuron number evidence and the power weight data sent, and according to operational order to power neuron number evidence and power Secondary weight data carries out neural network computing.

The power neuron and power weight multiplication operate specifically, power neuron number is according to sign bit and power weight Do xor operation in data symbol position；The corresponding relationship of coding schedule be it is out-of-order in the case where search coding schedule and find out power neuron number According to and the corresponding exponential number in power weight data power position, the corresponding relationship of coding schedule be that coding is recorded in positively related situation The exponential number minimum value of table and do addition find out power neuron number according to and the corresponding index number in power weight data power position Value, the corresponding relationship of coding schedule, which is negative to record the maximum value of coding schedule in relevant situation and do subtraction, finds out power neuron book Note and the corresponding exponential number in power weight data power position；By power neuron number according to corresponding exponential number and power weight The corresponding exponential number of data does add operation.

As illustrated in figure 2h, power neuron number evidence and power weight data sign bit are 1 to specific example one, power digit It is 4 according to position, i.e. m is 4.Coding schedule is power position data, and to correspond to power weight data when be 1111 be 0, power position data Power position data correspond to the corresponding complement of two's two's complement when for other numerical value.Power neuron number is according to being 00010, then it is indicated Actual numerical value be 2².Power weight is 00110, then its actual numerical value indicated is 64, i.e., 2⁶.Power neuron number evidence and power The product of secondary weight data is 01000, and the actual numerical value indicated is 2⁸。

It can be seen that the multiplying of power neuron number evidence and power weight is compared to the multiplication of floating data and floating The multiplication of point data and power data is all more simple and convenient.

The present embodiment method can also further comprise step S2-6, by the neuron number after neural network computing according to output And the input data as next layer of neural network computing.

Wherein, the step S2-6 includes following sub-step:

S2-61, output neuron cache unit receive the mind obtained after the neural network computing that the computing unit is sent Through metadata.

S2-62 passes through output by the received neuron number of output neuron cache unit according to DCU data control unit is transferred to The neuron number that neuron cache unit obtains repeats step according to the input neuron that can be used as lower layer of neural network computing S4 to step S6 terminates until neural network the last layer operation.

Neuron number due to obtaining after neural network computing is transmitted to DCU data control unit according to being also power data Required bandwidth greatly reduces compared to bandwidth needed for floating data, therefore further reduces neural network storage resource and calculating The expense of resource improves the arithmetic speed of neural network.

In addition, the concrete operation method of the power conversion is identical as previous embodiment, details are not described herein again.

All units of the disclosed embodiments can be hardware configuration, and the physics realization of hardware configuration includes but not It is confined to physical device, physical device includes but is not limited to transistor, memristor, DNA computer.

One embodiment of the disclosure provides a kind of arithmetic unit, comprising:

Operation control module 3-2, for determining blocking information；And

Computing module 3-3, for carrying out piecemeal, transposition and union operation to operation matrix according to the blocking information, with Obtain the transposed matrix of the operation matrix.

Specifically, the blocking information may include piecemeal size information, partitioned mode information, piecemeal pooling information is extremely Few one kind.Wherein, piecemeal size information indicate by the operation matrix carry out piecemeal after, each matrix in block form obtained it is big Small information.Separating type information indicates the mode that piecemeal is carried out to the operation matrix.Separating pooling information indicates each point After block matrix carries out transposition operation, the mode for obtaining the transposed matrix of operation matrix is reconsolidated.

Since disclosure arithmetic unit can carry out piecemeal to operation matrix, by turning respectively to multiple matrixs in block form It sets operation and obtains the transposed matrix of multiple matrixs in block form, finally the transposed matrix of multiple matrixs in block form is merged, is transported The transposed matrix of matrix is calculated, therefore may be implemented individually to instruct using one and complete arbitrary size square in constant time complexity The transposition operation of battle array.Compared to more traditional matrix transposition operation realizing method, the disclosure is reducing the same of operating time complexity When also make matrix transposition operate use be more simple and efficient.

As shown in Fig. 3 A~3B, in some embodiments of the present disclosure, the arithmetic unit, further includes:

Address memory module 3-1, for storing the address information of operation matrix；And

Data memory module 3-4, for storing original matrix data, which includes the operation matrix, And store the transposed matrix after operation；

Wherein, the operation control module be used for from address memory module extract operation matrix address information, and according to The address information of operation matrix is analyzed to obtain blocking information；The computing module is used to obtain operation matrix from operation control module Address information and blocking information, according to the address information of operation matrix from data memory module extract operation matrix, and according to Blocking information carries out piecemeal, transposition and union operation to operation matrix, obtains the transposed matrix of operation matrix, and by operation matrix Transposed matrix feed back to data memory module.

As shown in Figure 3 C, in some embodiments of the present disclosure, above-mentioned computing module includes partitioning of matrix unit, matrix fortune Calculate unit and matrix combining unit, in which:

Partitioning of matrix unit 3-31: for obtaining the address information and blocking information of operation matrix from operation control module, And operation matrix is extracted from data memory module according to the address information of operation matrix, operation matrix is carried out according to blocking information Piecemeal operation obtains n matrix in block form；

Matrix operation unit 3-32 carries out transposition operation for obtaining n matrix in block form, and to n matrix in block form respectively, Obtain the transposed matrix of n matrix in block form；

Matrix combining unit 3-33 obtains the operation matrix for obtaining and merging the transposed matrix of n matrix in block form Transposed matrix, wherein n is natural number.

For example, as shown in Figure 3D, for the operation matrix X being stored in data memory module, computing module Partitioning of matrix unit extracts the operation matrix X from data memory module, carries out piecemeal to operation matrix X according to blocking information Arithmetic operation obtains 4 matrix in block form X₁、X₂、X₃、X₄, and export to matrix operation unit；Matrix operation unit divides from matrix This 4 matrixs in block form are obtained in module unit, and transposition arithmetic operation is carried out respectively to this 4 matrixs in block form, obtain 4 piecemeal squares The transposed matrix X of battle array₁ ^T、X₂ ^T、X₃ ^T、X₄ ^T, and export to matrix combining unit；Matrix combining unit is obtained from matrix operation unit It takes the transposed matrix of this 4 matrixs in block form and merges, obtain the transposed matrix X of operation matrix^T, can also be further by transposition Matrix X^TIt exports to data memory module.

In some embodiments of the present disclosure, above-mentioned computing module further includes cache unit 3-34, for caching n piecemeal Matrix, for the acquisition of matrix operation unit.

In some embodiments of the present disclosure, above-mentioned matrix combining unit can also include memory, for temporarily storing The transposed matrix of the matrix in block form of acquisition, after matrix operation unit completes the operation of all matrixs in block form, matrix combining unit The transposed matrix of all matrixs in block form can be got, then operation is merged to the transposed matrix of n matrix in block form, is turned The matrix postponed, and output result is write back in data memory module.

Those skilled in the art are it is to be understood that above-mentioned partitioning of matrix unit, matrix operation unit and matrix close And unit both can take the form of hardware realization, can also be realized in the form of software program module.The partitioning of matrix It may include one or more controls that unit and matrix combining unit, which may include one or more control elements, the matrix operation unit, Element processed, computing element.

As shown in FIGURE 3 E, in some embodiments of the present disclosure, above-mentioned operation control module includes instruction process unit 3- 22, instruction cache unit 3-21 and matrix judging unit 3-23, in which:

Instruction cache unit, for storing pending matrix operation command；

Instruction process unit carries out matrix operation command for obtaining matrix operation command from instruction cache unit It decodes, and extracts the address information of operation matrix from the memory module of address according to the matrix operation command after decoding；

Matrix judging unit judges whether to need to carry out piecemeal for the address information according to operation matrix, and according to sentencing Disconnected result obtains blocking information.

In some embodiments of the present disclosure, above-mentioned operation control module further includes dependence processing unit 3-24, is used Whether the address information of matrix operation command and operation matrix after judging decoding exists with a upper operation conflicts, and rushes if it exists It is prominent, then the address information of matrix operation command and operation matrix after keeping in decoding；Conflict if it does not exist, then after emitting decoding Matrix operation command and the address information of operation matrix are to matrix judging unit.

In some embodiments of the present disclosure, above-mentioned operation control module further includes instruction queue memory 3-25, is used for There is the address information of matrix operation command and operation matrix after the decoding of conflict in caching, after the conflict resolving, by institute The address information of matrix operation command and operation matrix after the decoding of caching emits to matrix judging unit.

Specifically, when matrix operation command access data memory module, it is empty that front and back instruction may access same storage Between, in order to guarantee the correctness of instruction execution result, if present instruction be detected with the data of instruction before exist according to The relationship of relying, the instruction must wait until that dependence is eliminated in instruction queue memory.

In some embodiments of the present disclosure, above-metioned instruction processing unit includes Fetch unit 3-221 and decoding unit 3- 222, in which:

Fetch unit is transmitted for obtaining matrix operation command from instruction cache unit, and by this matrix operation command To decoding unit；

Decoding unit, for being decoded to matrix operation command, according to the matrix operation command after the decoding from address The address information of operation matrix is extracted in memory module, and by the ground of the operation matrix of matrix operation command and extraction after decoding Location information is transmitted to the dependence processing unit.

In some embodiments of the present disclosure, above-mentioned arithmetic unit further includes input/output module, for storing to data Module inputs the operation matrix, is also used to obtain the transposed matrix after operation from data memory module, and after exporting operation Transposed matrix.

In some embodiments of the present disclosure, the address information of above-mentioned operation matrix is the initial address message (IAM) and square of matrix Battle array size information.

In some embodiments of the present disclosure, the address information of operation matrix is storage of the matrix in data memory module Address.

In some embodiments of the present disclosure, address memory module is scalar register heap or universal memory unit；Data Memory module is scratchpad or universal memory unit.

In some embodiments of the present disclosure, address memory module can be scalar register heap, provide in calculating process Required scalar register, scalar register not only store matrix address, can also store scalar data.When to extensive matrix It carries out after having carried out piecemeal operation when transposition, the scalar data in scalar register can be used for recording the quantity of matrix-block.

In some embodiments of the present disclosure, data memory module can be scratchpad, can support difference The matrix data of size.

In some embodiments of the present disclosure, matrix judging unit be used for judgment matrix size, if it exceeds it is defined most Extensive M then needs to carry out piecemeal operation to matrix, and matrix judging unit is analyzed to obtain blocking information according to this judging result.

In some embodiments of the present disclosure, instruction cache unit, for storing pending matrix operation command.Instruction In the process of implementation, while also it is buffered in instruction cache unit, after an instruction execution is complete, if the instruction is simultaneously And an instruction earliest in instruction is not submitted in instruction cache unit, which will be submitted, once submitting, this refers to The operation carried out is enabled to will be unable to cancel to the change of unit state.The instruction cache unit can be the caching that reorders.

In some embodiments of the present disclosure, matrix operation command is matrix transposition operational order, including operation code and behaviour Make domain, wherein operation code is used to indicate the function of the matrix transposition operational order, and matrix operation control module is by identifying the behaviour Make code confirmation and carry out the operation of matrix transposition, operation domain is used to indicate the data information of the matrix transposition operational order, wherein data Information can be immediate or register number, for example, can be deposited accordingly when obtaining a matrix according to register number Matrix initial address and matrix size are obtained in device, are obtained in data memory module further according to matrix initial address and matrix size The matrix for taking appropriate address to store.

A kind of transposition operation of the disclosure using realization that new operating structure is simple and efficient to matrix, reduces this fortune The time complexity of calculation.

The disclosure also discloses a kind of operation method, comprising the following steps:

Step 1, operation control module extract the address information of operation matrix from address memory module；

Step 2, operation control module obtain blocking information according to the address information of operation matrix, and by the ground of operation matrix Location information and blocking information are transmitted to computing module；

Step 3, computing module extract operation matrix from data memory module according to the address information of operation matrix；And according to Operation matrix is divided into n matrix in block form by blocking information；

Step 4, computing module carry out transposition operation to n matrix in block form respectively, obtain the transposition square of n matrix in block form Battle array；

Step 5, computing module merge the transposed matrix of n matrix in block form, obtain the transposed matrix and feedback of operation matrix To data memory module；

Wherein, n is natural number.

The arithmetic unit and method proposed below by way of specific embodiment to the disclosure is described in detail.

In some embodiments, as illustrated in Figure 3 F, the present embodiment proposes a kind of arithmetic unit, including address memory module, Operation control module, computing module, data memory module and input/output module 3-5, wherein

Optionally, the operation control module includes instruction cache unit, instruction process unit, dependence processing list Member, instruction queue memory and matrix judging unit, wherein instruction process unit includes Fetch unit and decoding unit again；

Optionally, the computing module includes that partitioning of matrix unit, matrix cache unit, matrix operation unit and matrix close And unit；

Optionally, the address memory module is a scalar register heap；

Optionally, the data memory module is a scratchpad；Input/output module is the direct memory of an IO Access module.

Each component part of arithmetic unit is described in detail below:

Fetch unit, the unit are responsible for taking out next operational order to be executed from instruction cache unit, and will The operational order is transmitted to decoding unit；

Decoding unit, which is responsible for decoding operational order, and the operational order after decoding is sent to scalar Register file obtains the address information of the operation matrix of scalar register heap feedback, by the operational order after decoding and obtains The address information of operation matrix is transferred to dependence processing unit；

Dependence processing unit, the cell processing operational order and previous item instruct storage that may be present to rely on and close System.Matrix operation command can access scratchpad, and front and back instruction may access same memory space.In order to guarantee The correctness of instruction execution result is closed if current operation instruction is detected to exist to rely on the data of operational order before System, the operational order must be cached to waiting until that dependence is eliminated in instruction queue memory；As current operation instruction with Dependence is not present in operational order before, then dependence processing unit is directly by the address information of operation matrix and decoding Operational order afterwards is transmitted to matrix judging unit.

Instruction queue memory, it is contemplated that be possible to deposit on corresponding/specified scalar register of nonidentity operation instruction In dependence, for caching the address information of the operational order after the decoding in the presence of conflict and corresponding operation matrix, when according to The address information of operational order and corresponding operation matrix after bad relationship is satisfied after transmitting decoding is to matrix judging unit；

Matrix judging unit, for the address information judgment matrix size according to operation matrix, if it exceeds it is defined most Extensive M then needs to carry out piecemeal operation to matrix, and matrix judging unit is analyzed to obtain blocking information according to this judging result, And the address information of operation matrix and obtained blocking information are transmitted to partitioning of matrix unit.

Partitioning of matrix unit, the unit are responsible for the address information according to operation matrix, and extracting from scratch pad memory need to be into The operation matrix of row transposition operation, and piecemeal is carried out to the operation matrix according to blocking information, obtain n matrix in block form.Matrix Cache unit, the unit are successively transmitted to matrix operation unit and carry out transposition for caching the n matrix in block form after piecemeal Operation；

Matrix operation unit is responsible for successively extracting matrix in block form progress transposition operation from matrix cache unit, and will be turned The matrix in block form postponed is transmitted to matrix combining unit；

Matrix combining unit, the matrix in block form after being responsible for receiving simultaneously temporary cache transposition, all carries out to all matrixs in block form After complete transposition operation, operation is merged to the transposed matrix of n matrix in block form, obtains the transposed matrix of operation matrix.

Scalar register heap provides device scalar register needed for calculating process, provides operation matrix for operation Address information；

Scratch pad memory, the module are the dedicated temporary storage devices of matrix data, can support different size of matrix Data.

IO memory access module, the module are responsible for for directly accessing scratchpad from scratchpad Middle reading data or write-in data.

In some embodiments, as shown in Figure 3 G, the present embodiment proposes a kind of operation method, for executing extensive matrix Transposition operation, specifically includes the following steps:

Step 1, operation control module extract the address information of operation matrix from address memory module, specifically include following step It is rapid:

Step 1-1, Fetch unit extracts operational order, and operational order is sent to decoding unit；

Step 1-2, decoding unit decodes operational order, according to the operational order after decoding from address memory module The address information of operation matrix is obtained, and the address information of operational order and operation matrix after decoding is sent at dependence Manage unit；

Step 1-3, dependence processing unit analyzes the end that has not carried out of the operational order after the decoding and front Instruction whether there is dependence in data；Specifically, the dependence processing unit can be according to needed for operational order The address of the register of reading judges the register with the presence or absence of situation to be written, if any then there is dependence, wait count After writing back, the operational order just can be performed.

Dependence if it exists, the operational order after this decoding are referring to the address information of corresponding operation matrix needs The instruction for being not carried out end for waiting until itself and front in queue memory is enabled no longer to there is dependence in data；

Step 2, operation control module obtain blocking information according to the address information of operation matrix；

Specifically, after dependence is not present, instruction queue memory emits the operational order after this decoding and corresponding Operation matrix address information to matrix judging unit, whether judgment matrix needs to carry out piecemeal, matrix judging unit according to Judging result obtains blocking information, and the address information of blocking information and operation matrix is transmitted to partitioning of matrix unit；

Specifically, partitioning of matrix unit takes out need according to the address information of incoming operation matrix from data memory module Operation matrix is divided into n matrix in block form further according to incoming blocking information by the operation matrix wanted, and successively will after completing piecemeal Each matrix in block form is passed to matrix cache unit；

Specifically, matrix operation unit successively extracts matrix in block form from matrix cache unit, and to the piecemeal of each extraction Matrix carries out transposition operation, then the transposed matrix of obtained each matrix in block form is passed to matrix combining unit.

Step 5, computing module merge the transposed matrix of n matrix in block form, obtain the transposed matrix of operation matrix, and should Transposed matrix feeds back to data memory module, specifically includes the following steps:

Step 5-1, matrix combining unit receives the transposed matrix of each matrix in block form, when the matrix in block form received After transposed matrix quantity reaches total block count, matrix union operation is carried out to all piecemeals, obtains the transposition of operation matrix Matrix；And the transposed matrix is fed back to the specified address of data memory module；

Step 5-2, input/output module directly accesses data memory module, and operation is read from data memory module and is obtained The transposed matrix of the operation matrix arrived.

The vector that the disclosure is mentioned can be 0 dimensional vector, 1 dimensional vector, 2 dimensional vectors or multi-C vector.Wherein, 0 tie up to Amount can also be referred to as scalar, and 2 dimensions can also be referred to as matrix.

One embodiment of the disclosure proposes a kind of data screening device, referring to fig. 4 A, comprising: storage unit 4-3, It for storing data and instructs, wherein data include data to be screened and location information data.

Register cell 4-2, for storing the data address in storage unit.

Data screening module 4-1, including data screening unit 4-11, data screening module is according to instruction in register cell Middle acquisition data address obtains corresponding data according to the data address in the memory unit, and is carried out according to the data of acquisition Screening operation obtains data screening result.

Data screening Elementary Function schematic diagram is as shown in Figure 4 B, and input data is data to be screened and location information number According to output data can also can include the relevant information of data after screening, relevant information example only comprising data after screening simultaneously In this way vector length, takes up space at array size.

Further, C, the data screening device of the present embodiment specifically include referring to fig. 4:

Storage unit 4-3, for storing data, location information data and instruction to be screened.

Register cell 4-2, for storing the data address in storage unit.

Data screening module 4-1 includes:

Instruction cache unit 4-12, for storing instruction.

Control unit 4-13 reads instruction from instruction cache unit, and is decoded into concrete operations microcommand.

I/O unit 4-16, the instruction in storage unit is transported in instruction cache unit, by the number in storage unit According to being transported in input data cache unit and output cache unit, it can also will export the output data in cache unit and carry Into storage unit.

Input data cache unit 4-14, the data that storage I/O unit is carried, including data to be screened and location information Data.

Data screening unit 4-11 obtains data for the microcommand that reception control unit transmits, and from register cell Address, the data to be screened that input data cache unit is transmitted and location information data are as input data, to input data Screening operation is carried out, data after screening are transmitted to output data cache unit after the completion.

Output data cache unit 4-15 stores output data, and output data can be only comprising data after screening, can also With the relevant information simultaneously comprising data after screening, as vector length, array size, take up space etc..

The data screening device of the present embodiment is suitable for a variety of screening objects, and data to be screened can be vector or high dimension Group etc., location information data is also possible to vector or higher-dimension array etc. either binary code, each of which component is 0 or 1. Wherein, the component of data to be screened and the component of location information data can be correspondingly.Those skilled in the art should It is understood that each component of location information data is a kind of illustrative representation that 1 or 0 is location information, The representation of location information is not limited in a kind of this representation.

Optionally, when indicating for component each in location information data using 0 or 1, data screening unit is to input number It specifically includes according to screening operation is carried out, each component of data screening unit scan location information data, if component is 0, deletes Except the corresponding data component to be screened of the component retains the corresponding data component to be screened of the component if component is 1；Alternatively, If the component of location information data is 1, the corresponding data component to be screened of the component is deleted, if component is 0, retains this point Measure corresponding data component to be screened.When scanned, then completion is screened, data and is exported after being screened.In addition, carry out While screening operation, the relevant information of data after screening can also be recorded, such as vector length, array size, shared Space etc. decides whether the synchronous record for carrying out relevant information and output as the case may be.It should be noted that position is believed When each component of breath data is indicated using other representations, data screening unit can also be configured and representation phase The screening operation answered.

Illustrate the process of data screening below by way of example.

Example one:

If data to be screened are vector (1 0 101 34 243), need to screen is less than 100 component, then inputting Location information data be also vector, i.e. vector (1 101 0).Data after screening can still keep vector structure, And the vector length of data after screening can be exported simultaneously.

Wherein, vector of position can be externally input, be also possible to internal generation.Optionally, in the disclosure Device can also include location information generation module, and location information generation module can be used for generating vector of position, the position Information generating module is set to connect with data screening unit.Specifically, location information generation module can be raw by vector operation At vector of position, vector operation can be vector comparison operation, i.e., by the component to vector to be screened one by one with it is default Numerical value is bigger small to be obtained.It should be noted that location information generation module can also select other vectors according to preset condition Operation generates vector of position.The component of specified position information data retains corresponding data to be screened point for 1 in this example Amount, component are 0 and delete corresponding data component to be screened.

Data screening unit initialize a variable length=0, to record screening after data vector length；

Data screening unit reads the data of input data cache unit, the 1st component of scanning position information vector, hair Its existing value is 1, then retains the value 1, length=length+1 of the 1st component of vector to be screened；

2nd component of scanning position information vector finds that its value is 1, then retains the 2nd component of vector to be screened Value 0, length=length+1；

3rd component of scanning position information vector finds that its value is 0, then leaves out the 3rd component of vector to be screened Value 101, length is constant；

4th component of scanning position information vector finds that its value is 1, then retains the 4th component of vector to be screened Value 34, length=length+1；

5th component of scanning position information vector finds that its value is 0, then retains the 5th component of vector to be screened Value 243, length is constant；

Vector (1 0 34) and its vector length are length=3 after the value composition screening remained, and by output number It is stored according to cache unit.

In the data screening device of the present embodiment, data screening module can also include a malformation unit 4-17, Can the storage organization of output data of input data and output data cache unit to input data cache unit carry out Deformation, such as expands into vector for higher-dimension array, vector is become higher-dimension array etc..Optionally, side high dimensional data being unfolded Formula can be row major, is also possible to column preferentially, can also select other expansion modes as the case may be.

Example two:

If data to be screened are four-dimensional arrayNeed to screen is even number value, then the location information inputted Array isData are vector structure after screening, do not export relevant information.The component of specified position information data in this example It is 1, then retains corresponding data component to be screened, component is 0 and deletes corresponding data component to be screened.

Data screening unit reads the data of input data cache unit, a point of (1,1) of scanning position information array Amount finds that its value is 0, then leaves out the value 1 of (1,1) a component of array to be screened；

(1,2) a component of scanning position information array finds that its value is 1, then retains (1,2) of array to be screened The value 4 of a component；

(2,1) a component of scanning position information array finds that its value is 0, then leaves out (2,1) of array to be screened The value 61 of a component；

(2,2) a component of scanning position information array finds that its value is 1, then retains (2,2) of array to be screened The value 22 of a component；

The value remained is become vector by malformation unit, that is, data are vector (422) after screening, and by output number It is stored according to cache unit.

In some embodiments, as shown in Figure 4 D, the Data Data screening module can also further comprise: computing unit 4-18.Data screening and processing can be achieved at the same time in disclosure device as a result, and a kind of data screening and processing unit can be obtained. The same previous embodiment of the specific structure of the computing unit, details are not described herein again.

Present disclose provides a kind of methods for carrying out data screening using the data screening device, comprising:

Data screening module obtains data address in register cell；

Obtain corresponding data in the memory unit according to data address；And

Screening operation is carried out to the data of acquisition, obtains data screening result.

In some embodiments, the data screening module obtains the step of data address in register cell and includes: Data screening unit obtains the address of data to be screened and the address of location information data from register cell.

In some embodiments, described the step of obtaining corresponding data in the memory unit according to data address include with Lower sub-step:

I/O unit by storage unit data to be screened and location information data pass to input data cache unit； And

Data to be screened and location information data are passed to data screening unit by input data cache unit.

Optionally, the I/O unit by storage unit data to be screened and location information data pass to input number Data to be screened and location information data are passed into data according to the sub-step and the input data cache unit of cache unit Between the sub-step of screening unit further include: judge whether to storage organization deformation.

If carrying out storage organization deformation, data to be screened are transmitted to malformation unit by input data cache unit, Malformation unit carries out storage organization deformation, passes deformed data to be screened back input data cache unit, holds later Data to be screened and location information data are passed to the sub-step of data screening unit by the row input data cache unit；Such as Fruit is no, then directly executes the input data cache unit for data to be screened and location information data and pass to data screening list The sub-step of member.

In some embodiments, the step of data of described pair of acquisition carry out screening operation, obtain data screening result packet Include: data screening unit treats garbled data and carries out screening operation, output data is passed to output according to location information data Data buffer storage unit.

As shown in Figure 4 E, in a specific embodiment of the disclosure, the step of data screening method, is specific as follows:

Step S4-1, control unit reads in data screening instruction from instruction cache unit, and is decoded as having Microcommand is made in gymnastics, passes to data screening unit；

Step S4-2, data screening unit obtain the ground of data to be screened and location information data from register cell Location；

Step S4-3, control unit read in an I/O instruction from instruction cache unit, it is micro- to be decoded as concrete operations Instruction, passes to I/O unit；

Step S4-4, I/O unit by storage unit data to be screened and location information data pass to input data Cache unit；

Storage organization deformation is judged whether to, if so, S4-5 is thened follow the steps, if it is not, then directly executing step S4-6。

Data are transmitted to malformation unit by step S4-5, input data cache unit, and carry out respective stored structure change Then shape passes deformed data back input data cache unit, then goes to step S4-6；

Step S4-6, input data cache unit pass data to data screening unit, and data screening unit is according to position Information data is set, garbled data is treated and carries out screening operation；

Output data is passed to output data cache unit by step S4-7, and wherein output data can be only comprising screening Data afterwards, can also relevant information simultaneously comprising data after screening, such as vector length, array size takes up space etc..

So far, attached drawing is had been combined the embodiment of the present disclosure is described in detail.According to above description, art technology Personnel should have clear understanding to a kind of data screening device and method of the disclosure.

One embodiment of the disclosure provides a kind of neural network processor, comprising: memory, scratchpad (its main feature is that capacity is generally less than above-mentioned memory, but reading speed is faster than above-mentioned memory, referred to as " caching " or " slow at a high speed Deposit (cache) ", play temporary storage of data) and isomery kernel；Wherein, the memory, for storing neural network fortune The data and instruction of calculation；The scratchpad is connect by memory bus with the memory；In the isomery Core is connect by scratchpad bus with the scratchpad, reads nerve by scratchpad The data and instruction of network operations completes neural network computing, and operation result is sent back to scratchpad, and control is high Operation result is write back to memory by fast temporary storage.

Using scratchpad, the actual operational capability of hardware is made full use of, data access efficiency is improved, reduced Time cost.

Wherein, the isomery kernel refers to the kernel including at least two different types of kernels namely two kinds of different structures.

In some embodiments, the isomery kernel includes: multiple operation kernels, different types of at least two Operation kernel, for executing neural network computing or neural net layer operation；And one or more logic control kernels, it is used for According to the data of neural network computing, determine to be executed by the private core and/or the generic core neural network computing or Neural net layer operation.

Further, the multiple operation kernel includes m generic core and n private core；Wherein, described dedicated interior Core is exclusively used in executing specified neural network/neural net layer operation, and the generic core is for executing any neural network/nerve Network layer operation.Optionally, the generic core can be cpu, and the private core can be npu.

Specifically, the scratchpad can only include one or more shared scratchpad, often Multiple kernels (logic control kernel, private core or generic core) in one shared scratchpad and isomery kernel Connection.The scratchpad can also only include one or more unshared scratchpad, each is non-total Scratchpad is enjoyed to connect with the kernel (logic control kernel, private core or generic core) in isomery kernel. The scratchpad can also include that one or more shared scratchpad and one or more are non-simultaneously Shared scratchpad, wherein multiple kernels (logic control in each shared scratchpad and isomery kernel Kernel, private core or generic core processed) it connects, in one in each unshared scratchpad and isomery kernel Core (logic control kernel, private core or generic core) connection.

In some embodiments, the logic control kernel is deposited by scratchpad bus and the scratchpad Reservoir connection, the data of neural network computing is read by scratchpad, and according in the data of neural network computing Neural network model type and parameter, determine by private core and/or generic core to execute nerve as target kernel Network operations and/or neural net layer operation.Wherein it is possible to which access is added between kernel, the logic control kernel can lead to It crosses control bus and directly sends signal to the target kernel, it can also be through the scratchpad memory to the target Kernel sends signal, so that controlling target kernel executes neural network computing and/or neural net layer operation.

One embodiment of the disclosure proposes a kind of heterogeneous polynuclear neural network processor, referring to Fig. 5 A, comprising: storage Device 11, unshared scratchpad 12 and isomery kernel 13.

Memory 11, for storing the data and instruction of neural network computing, data include biasing, weight, input data, Certainly, the output data can also be not stored in memory for type and parameter of output data and neural network model etc.； Instruction include the corresponding various instructions of neural network computing, such as CONFIG instruction, COMPUTE instruction, I/O instruction, NOP instruction, JUMP instruction, MOVE instruction etc..The data and instruction stored in memory 11 can pass through unshared scratchpad 12 It is transmitted in isomery kernel 13.

Unshared scratchpad 12, including multiple scratchpad 121, each scratchpad 121 are connect by memory bus with memory 11, are connect by scratchpad bus with isomery kernel 13, are realized Between isomery kernel 13 and unshared scratchpad 12, between unshared scratchpad 12 and memory 11 Data exchange.The neural network computing data needed for isomery kernel 13 or instruction are not stored in unshared scratchpad When in 12, unshared scratchpad 12 first passes through memory bus and reads required data from memory 11 or refer to It enables, is then sent to it in isomery kernel 13 by scratchpad bus.

Isomery kernel 13, including a logic control kernel 131, a generic core 132 and multiple private cores 133, are patrolled Control kernel 131, generic core 132 and each private core 133 are collected by scratchpad bus and a high speed Temporary storage 121 is correspondingly connected with.

Instruction and data of the isomery kernel 13 to read neural network computing from unshared scratchpad 12, Neural network computing is completed, and operation result is sent back into non-high-speed shared buffer memory 12, controls unshared scratchpad Operation result is write back to memory 11 by 12.

Logic control kernel 131 reads in neural network computing data and instruction from unshared scratchpad 12, According to the type and parameter of the neural network model in data, judges whether there is and support the neural network computing and can complete to be somebody's turn to do The private core 133 of neural network computing scale, if it is present corresponding private core 133 is transferred to complete the neural network Operation, if it does not exist, then generic core 132 is transferred to complete the neural network computing.In order to determine private core position and Whether it is idle, can be used as every class kernel (supporting the private core of same layer to belong to one kind, generic core belongs to one kind) It is arranged a table (referred to as dedicated/generic core information table), the number (or address) of similar kernel is recorded in table and it is current It is whether idle, it is initially the free time, the change of idle state later has logic control kernel and interior internuclear direct or indirect leads to Letter safeguards that the kernel number in table can carry out scanning when this network processing unit initializes and obtain, and can prop up in this way The isomery kernel for holding dynamic and configurable (can change application specific processor type and the number in isomery kernel at any time, change it After will scan update kernel information table)；Optionally, it can not also support the dynamic configuration of isomery kernel, at this moment just only need Kernel in table is numbered and is fixed, Multiple-Scan update is not needed；Optionally, if the number of every class private core always Continuously, it can recorde a plot, then these private cores can be indicated with continuous several bits, with bit 0 or 1 It can indicate whether it is in idle condition.It, can be in logic control kernel in order to judge the type and parameter of network model One decoder of middle setting judges the type of network layer according to instruction, and may determine that generic core instruction or The instruction of private core, parameter, data address etc. can also be parsed from instruction；Optionally, it may further specify that data packet Containing a data head, wherein including the number and its scale of each network layer, and corresponds to and calculate data and the address of instruction etc., and If special resolver (software or hardware) parses these information；Optionally, the information storage parsed is arrived Specified region.It, can be in logic control in order to be determined according to the network layer number and scale that parse using which kernel One content adressable memory CAM (content addressable memory) is set in core, and content therein may be implemented Be it is configurable, this just needs logic control kernel to provide some instructions to configure/write this CAM, in CAM in have network Layer number, the maximum-norm that each dimension can be supported, and support the address of the private core information table of this layer and general interior Nuclear information table address under this scheme, finds corresponding table item with the layer number parsed, and compare size limit；If meeting The address for then taking private core information table is gone wherein to find an idle private core, is numbered according to it and sends control signal, Calculating task is distributed for it；If not finding respective layer in CAM, or have exceeded size limit or private core information table Middle no idle core then takes one idle generic core of searching in generic core information table, is numbered according to it and sends control letter Number, calculating task is distributed for it；If being to find idle kernel in two tables, this task is added to a waiting In queue, and add some necessary information, once have the idle core that can calculate this task, then assign them to it into Row calculates.

Certainly, determine private core position and its whether the free time can there are many mode, above-mentioned determining private core Position and its whether Kong Xian mode is only as illustrating.Each private core 133 can be with a kind of nerve net of complete independently The specified neural network computing such as network operation, such as impulsive neural networks (SNN) operation, and operation result is write back into its correspondence In the scratchpad 121 of connection, the scratchpad 121 is controlled by operation result and writes back to memory 11.

Generic core 132 can be more than the operation scale that private core can be supported or all dedicated interior with complete independently The neural network computing that core 133 is not supported, and operation result is write back into its scratchpad 121 being correspondingly connected with In, the scratchpad 121, which is controlled, by operation result writes back to memory 11.

One embodiment of the disclosure proposes a kind of heterogeneous polynuclear neural network processor, referring to Fig. 5 B, comprising: storage Device 21, shared scratchpad 22 and isomery kernel 23.

Memory 21, for storing the data and instruction of neural network computing, data include biasing, weight, input data, The type and parameter of output data and neural network model, instruction include the corresponding various instructions of neural network computing.Memory The data and instruction of middle storage is transmitted in isomery kernel 23 by shared scratchpad 22.

Shared scratchpad 22, is connected by memory bus with memory 21, is deposited by shared scratchpad Memory bus is attached with isomery kernel 23, is realized between isomery kernel 23 and shared scratchpad 22, shared height Data exchange between fast temporary storage 22 and memory 21.

The neural network computing data needed for isomery kernel 23 or instruction are not stored in shared scratchpad 22 When middle, shared scratchpad 22 first passes through memory bus and reads required data or instruction from memory 21, so It is sent in isomery kernel 23 by scratchpad bus afterwards.

Isomery kernel 23, including a logic control kernel 231, multiple generic cores 232 and multiple private cores 233, Logic control kernel 231, generic core 232 and private core 233 pass through scratchpad bus and shared high speed Temporary storage 22 connects.

Isomery kernel 23 is completed to read neural network computing data and instruction from shared scratchpad 22 Neural network computing, and operation result is sent back into high speed shared buffer memory 22, shared scratchpad 22 is controlled by operation As a result memory 21 is write back to.

In addition when between logic control kernel 231 and generic core 232, logic control kernel 231 and private core 233 it Between, when needing to carry out data transmission between generic core 232 and between private core 233, the kernel for sending data can will Data are first transferred to shared scratchpad 22 by shared scratchpad bus, then transfer data to again The kernel for receiving data, without passing through memory 21.

For neural network computing, neural network model generally comprises multiple neural net layers, each nerve net Network layers carry out corresponding operation using the operation result of a upper neural net layer, and operation result is exported to next neural network Layer, result of the operation result of the last one neural net layer as entire neural network computing.In the present embodiment heterogeneous polynuclear In neural network processor, generic core 232 and private core 233 can execute the operation of a neural net layer, utilize Logic control kernel 231, generic core 232 and private core 233 complete neural network computing jointly, below for the side of description Just, neural net layer is referred to as layer.

Wherein, each private core 233 can independently execute one layer of operation, such as the convolution algorithm, complete of neural net layer Articulamentum, splicing operation, contraposition plus/multiplication, Relu operation, pond operation, Batch Norm operation etc., and neural network fortune The scale for calculating layer cannot be excessive, i.e., cannot exceed the scale of the neural network computing layer that can be supported of corresponding private core, That is private core operation is to the neuron of layer and the limited amount system of cynapse, after layer operation, by operation result It writes back in shared scratchpad 22.

Generic core 232 is used to execute the operation scale that can be supported more than private core 233 or all dedicated interior The layer operation that core is not supported, and operation result is write back in shared scratchpad 22, control shared scratchpad Operation result is write back to memory 21 by memory 22.

Further, after operation result is write back to memory 21 by private core 233 and generic core 232, logic control Kernel 231 can be sent to the private core or generic core for executing next layer of operation starts operation signal, and notice executes next layer The private core or generic core of operation start operation.

Further, private core 233 and generic core 232 receive the private core for executing upper one layer of operation or Generic core send beginning operation signal, and currently without ongoing layer operation when start operation, if currently carrying out Layer operation, then complete current layer operation, and starts operation after operation result is write back in shared scratchpad 22.

Logic control kernel 231 reads in neural network computing data from shared scratchpad 22, for wherein Neural network model type and parameter, each layer of neural network model is parsed, for each layer, is judged whether In the presence of supporting this layer of operation and the private core 233 of this layer of operation scale can be completed, if it is present the operation of this layer is transferred to Corresponding 233 operation of private core, if it does not exist, then transferring to generic core 232 to carry out operation the operation of this layer.Logic control The corresponding address of data and instruction needed for kernel 231 processed also sets up generic core 232 and the progress layer operation of private core 233, Generic core 232 and private core 233 read the data and instruction of corresponding address, execution level operation.

For executing the private core 233 and generic core 232 of first layer operation, logic control kernel 231 can be in operation The private core 233 or generic core 232 are sent when beginning and start operation signal, and after neural network computing, it holds The private core 233 or generic core 232 of row the last layer operation, which can send logic control kernel 231, starts operation signal, Logic control kernel 231 controls shared scratchpad 22 and writes back operation result after receiving beginning operation signal Into memory 21.

One embodiment of the disclosure, which provides, a kind of utilizes heterogeneous polynuclear Processing with Neural Network described in first embodiment The method that device carries out neural network computing, referring to Fig. 5 C, steps are as follows:

Step S5-11: logic control kernel 131 in isomery kernel 13 is by unshared scratchpad 12 from depositing The data and instruction of neural network computing is read in reservoir 11；

Step S5-12: logic control kernel 131 in isomery kernel 13 is according to the type of the neural network model in data And parameter, qualified private core is judged whether there is, it is eligible to refer to that the private core supports the neural network to transport It calculates and the neural network computing scale can be completed (it is intrinsic that size limit can be private core, can inquire that kernel is set at this time Count producer；It is also possible to artificial defined (for example being found through experiments that, be more than certain scale, generic core is more effective), is matching Size limit can be set when setting CAM).If private core m is eligible, private core m is executed as target kernel Step S5-13, it is no to then follow the steps S5-15, wherein m is private core number, and 1≤m≤M, M are private core quantity.

Step S5-13: the logic control kernel 131 in isomery kernel 13 sends signal to target kernel, activates the target Kernel, while the target kernel is sent by the corresponding address of the data and instruction for the neural network computing to be carried out.

Step S5-14: target kernel is according to the address of acquisition by unshared scratchpad 12 from memory 11 The middle data and instruction for obtaining neural network computing, carries out neural network computing, operation result is then passed through unshared high speed Temporary storage 12 is output in memory 11, and operation is completed.

Further, above-mentioned steps S5-12 is met, if there is no qualified private core, then next executes step Rapid S5-15 to S5-16.

Step S5-15: the logic control kernel 131 in isomery kernel 13 sends signal to generic core 132, activates general Kernel 132, while generic core 132 is sent by the corresponding address of the data and instruction for the neural network computing to be carried out.

Step S5-16: generic core 132 is according to the address of acquisition by unshared scratchpad 12 from memory The data and instruction that neural network computing is obtained in 11, carries out neural network computing, and operation result is then passed through unshared height Fast temporary storage 12 is output to 11 in memory, and operation is completed.

One embodiment of the disclosure, which provides, a kind of utilizes heterogeneous polynuclear Processing with Neural Network described in second embodiment The method that device carries out neural network computing, referring to Fig. 5 D, steps are as follows:

Step S5-21: the logic control kernel 231 in isomery kernel 23 is by sharing scratchpad 22 from storage The data and instruction of neural network computing is read in device 21.

Step S5-22: logic control kernel 231 in isomery kernel 23 to the type of the neural network model in data and Parameter is parsed, and for the in neural network model the 1st Dao I layers, judges whether there is qualified private core, I respectively It is eligible to refer to that the private core supports this layer of operation and complete this layer of operation scale for the number of plies of neural network model, And corresponding generic core or private core are distributed for each layer of operation.

For i-th layer of operation of neural network model, 1≤i≤I selects private core m if private core m is eligible I-th layer of operation of neural network model is executed, m is private core number, and 1≤m≤M, M are private core quantity.If without special It is eligible with kernel, then select generic core M+n to execute i-th layer of operation of neural network model, M+n is the volume of generic core Number, 1≤n≤N, N are generic core quantity, wherein by 232 Unified number (i.e. private core of private core 233 and generic core It being numbered together with generic core, such as x private core, y generic core can then be numbered from 1, it compiles to x+y, A number of each private core, the generic core corresponding 1 into x+y；Can certainly private core and generic core it is independent It numbers respectively, such as x private core, y generic core, then private core can be numbered from 1, compile to x, generic core Can be numbered, be compiled to y from 1, the corresponding number of each private core, generic core), at this time private core may with it is logical Numbered with kernel identical, but only logical number is identical, can be addressed according to its physical address, finally obtains and neural network Model the 1st arrives the corresponding kernel sequence of I layers of operation.That is the kernel sequence shares I element, each element for a private core or Generic core sequentially corresponds to neural network model the 1st to I layers of operation.Such as kernel sequence 1a, 2b ..., il, wherein 1,2, i Indicate the number of neural net layer, the number of a, b, 1 expression private core or generic core.

Step S5-23: logic control kernel 231 in isomery kernel 23 is by the data and instruction for the layer operation to be carried out Corresponding address is sent to the private core or generic core for executing this layer of operation, and to the private core for executing this layer of operation or Generic core sends the number of next private core or generic core in kernel sequence, wherein to the last layer operation is executed What private core or generic core were sent is the number of logic control kernel.

Step S5-24: the first kernel transmission into kernel sequence of logic control kernel 231 in isomery kernel 23 starts Operation signal.First private core 233 or generic core 232 are after receiving beginning operation signal, if there is currently not complete At operation, then continue to complete operation, then, continue from the corresponding address reading data of data and instruction and instruction, worked as Front layer operation.

The S5-25: the first private core of step 233 or generic core 232 are after completing current layer operation, by operation result It is transmitted to the specified address of shared scratchpad 22, meanwhile, into kernel sequence, second kernel, which is sent, starts operation Signal.

Step S5-26: and so on, after each kernel receives beginning operation signal in kernel sequence, if currently depositing In unfinished operation, then operation is continued to complete, then, continued from the corresponding address reading data of data and instruction and instruction, Its corresponding layer operation is carried out, operation result is transmitted to the specified address of shared scratchpad 22, meanwhile, to kernel Next kernel, which is sent, in sequence starts operation signal.Wherein, the last one kernel of kernel sequence is to logic control kernel 231 It sends and starts operation signal.

Step S5-27: after logic control kernel 231 receives beginning operation signal, shared scratchpad is controlled 22 write back to the operation result of each neural net layer in memory 21, and operation is completed.

As shown in fig. 5e, the present embodiment is further expanded to above-mentioned first embodiment.Embodiment one (Fig. 1) high speed is temporarily Depositing memory 121 is that each core is dedicated, and private core 1 can only access scratchpad 3, and cannot access other high speeds Temporary storage, other kernels are also the same, therefore 121 entirety 12 constituted just have unshared property.But it is if interior Core j will use the calculated result (this result is initially stored in the corresponding scratchpad of kernel i) of kernel i (i ≠ j), Just first memory 11 must be written from scratchpad in this result by kernel i, and then kernel j will also be read in from memory 11 Its accessible scratchpad, after this process, kernel j could use this result.To make this process more It is convenient, increase the data switching networks 34 of a N × N on this basis, for example can be realized with crossbar, in this way each The accessible all scratchpad (321) of kernel (331 or 332 or 333).32 just it is provided with shared property at this time.

The as follows of the method for neural network computing is carried out using the device of the present embodiment (corresponding diagram 5E):

Step S5-31: the logic control kernel 331 in isomery kernel 33 passes through scratchpad 32 from memory 31 The middle data and instruction for reading neural network computing；

Step S5-32: logic control kernel 331 in isomery kernel 33 is according to the type of the neural network model in data And parameter, qualified private core is judged whether there is, it is eligible to refer to that the private core supports the neural network to transport It calculates and the neural network computing scale can be completed.If private core m is eligible, private core m as target kernel, and Step S5-33 is executed, it is no to then follow the steps S5-35, wherein m is private core number.

Step S5-33: the logic control kernel 331 in isomery kernel 33 sends signal to target kernel, activates the target Kernel, while the target kernel is sent by the corresponding address of the data and instruction for the neural network computing to be carried out.

Step S5-34: target kernel obtains neural network according to the address (from scratchpad 32) of acquisition and transports The data and instruction of calculation carries out neural network computing, then operation result is stored in scratchpad 32, operation is complete At.

Step S5-35: the logic control kernel 331 in isomery kernel 33 sends signal to generic core 332, activates general Kernel 332, while generic core 332 is sent by the corresponding address of the data and instruction for the neural network computing to be carried out.

Step S5-36: generic core 332 obtains neural network according to the address (from scratchpad 32) of acquisition The data and instruction of operation carries out neural network computing, then operation result is stored in scratchpad 32, operation is complete At.

Further, it is possible to be modified to the connection type between memory and scratchpad.Thus it generates new Embodiment, as illustrated in figure 5f.It is relative to embodiment described in Fig. 5 E the difference is that memory 41 and scratchpad Connection type between 42.It was originally connected using bus, in multiple 321 memory writes 31 of scratchpad, to be lined up, imitated Rate is not high (referring to Fig. 5 E).It is now the data switching networks of one 1 input N output by Structural abstraction here, can be used Diversified topological structure realizes this function, such as hub-and-spoke configuration (memory 41 and N number of scratchpad 421 There is special access to connect), tree (for memory 41 in tree root position, scratchpad 421 is in leaf position) Deng.

It should be noted that the number of the quantity of logic control kernel, the quantity of private core, generic core in the disclosure Amount, shared or unshared scratchpad quantity, the quantity of memory with no restriction, can be according to neural network computings Specific requirement appropriate adjustment.

So far, attached drawing is had been combined the embodiment of the present disclosure is described in detail.According to above description, art technology Personnel should a kind of heterogeneous polynuclear neural network processor to the disclosure and neural network computing method have clear understanding.

In some embodiments, the disclosure additionally provides a kind of chip comprising above-mentioned arithmetic unit.

In some embodiments, the disclosure additionally provides a kind of chip-packaging structure comprising said chip.

In some embodiments, the disclosure additionally provides a kind of board comprising said chip encapsulating structure.

In some embodiments, the disclosure additionally provides a kind of electronic equipment comprising above-mentioned board.

Electronic equipment include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server, camera, video camera, projector, hand Table, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.

The vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument And/or electrocardiograph.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, It can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also be realized in the form of software program module.

If the integrated unit is realized in the form of software program module and sells or use as independent product When, it can store in a computer-readable access to memory.Based on this understanding, the technical solution of the application substantially or Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products Reveal and, which is stored in a memory, including some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes all or part of each embodiment the method for the application Step.And memory above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), disk or CD etc..

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

It should be noted that in attached drawing or specification text, the implementation for not being painted or describing is affiliated technology Form known to a person of ordinary skill in the art, is not described in detail in field.In addition, the above-mentioned definition to each element and method is simultaneously It is not limited only to various specific structures, shape or the mode mentioned in embodiment, those of ordinary skill in the art can carry out letter to it It singly changes or replaces, such as:

The control module of the disclosure is not limited to the concrete composition structure of embodiment, person of ordinary skill in the field Data, the control module of operational order interaction, are used equally for realizing this between well known achievable memory module and arithmetic element It is open.

Particular embodiments described above has carried out further in detail the purpose of the disclosure, technical scheme and beneficial effects Describe in detail it is bright, it is all it should be understood that be not limited to the disclosure the foregoing is merely the specific embodiment of the disclosure Within the spirit and principle of the disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the guarantor of the disclosure Within the scope of shield.

Claims

1. a kind of neural network processor, comprising: memory, scratchpad and isomery kernel；Wherein,

The memory, for storing the data and instruction of neural network computing；

The scratchpad is connect by memory bus with the memory；

The isomery kernel is connect with the scratchpad by scratchpad bus, passes through scratchpad Memory reads the data and instruction of neural network computing, completes neural network computing, and operation result is sent back to high speed temporarily Memory is deposited, scratchpad is controlled by operation result and writes back to memory.

2. neural network processor according to claim 1, wherein the isomery kernel includes:

Multiple operation kernels have at least two different types of operation kernels, for executing neural network computing or nerve Network layer operation；And

One or more logic control kernels, for the data according to neural network computing, determine to execute neural network computing or The operation core type of neural net layer operation.

3. neural network processor according to claim 2, wherein the multiple operation kernel includes x generic core With y private core；Wherein, the private core is exclusively used in executing specified neural network/neural net layer operation, described general Kernel is for executing any neural network/neural net layer operation；

The logic control kernel is determined by the private core and/or described logical for the data according to neural network computing Neural network computing or neural net layer operation are executed with kernel.

4. neural network processor according to claim 3, wherein the generic core is cpu, and the private core is npu。

5. neural network processor according to claim 1, wherein the scratchpad includes that shared high speed is temporary Deposit memory and/or unshared scratchpad；Wherein, a shared scratchpad is deposited by scratchpad Memory bus is correspondingly connected with at least two kernels in the isomery kernel；The one unshared scratchpad passes through Scratchpad bus is correspondingly connected with the kernel in the isomery kernel.

6. neural network processor according to claim 3, wherein the logic control kernel is stored by scratchpad Device bus is connect with the scratchpad, and the data of neural network computing, and root are read by scratchpad According to the type and parameter of the neural network model in the data of neural network computing, determine by private core and/or generic core Neural network computing and/or neural net layer operation are executed as target kernel.

7. neural network processor according to claim 6, wherein the logic control kernel is direct by control bus Signal is sent to the target kernel or sends signal to the target kernel through the scratchpad memory, from And it controls target kernel and executes neural network computing and/or neural net layer operation.

8. neural network processor according to claim 2, wherein the logic control kernel is used for from cache The data and instruction for reading in neural network computing judges whether to deposit according to the type and parameter of the neural network model in data It is supporting the neural network computing and the private core of the neural network computing scale can be completed, and if it exists, then by private core The neural network computing is completed, the neural network computing is otherwise completed by generic core.

9. neural network processor according to claim 8, wherein the neural network processor is provided with dedicated/logical With kernel information table, which includes type, number, address and the idle state information of kernel, For judging whether there is the private core supported the neural network computing and the neural network computing scale can be completed.

10. neural network processor according to claim 9, wherein the logic control kernel includes decoder, this is translated Code device is used to judge the type of described instruction, and the type of network layer is judged according to described instruction.

11. neural network processor according to claim 9, wherein the logic control kernel includes content addressed deposits Reservoir, for determining the type and number of the kernel used according to network layer number and scale.

12. neural network processor according to claim 9, wherein the private core and generic core Unified number Or independent numbering addresses if the number m of a private core is identical as the number n of a generic core according to physical address.

13. neural network processor according to claim 5, wherein the scratchpad is unshared high speed Temporary storage, the neural network processor further include a data switching networks, at the same with the scratchpad, patrol Collect control kernel, generic core and private core connection.

14. neural network processor according to claim 1, wherein

Between the scratchpad and memory using bus connection or the scratchpad and memory it Between pass through data switching networks connect.

15. a kind of neural network computing method, comprising:

Isomery kernel receives the data and instruction for the neural network computing that scratchpad is sent, and executes neural network fortune It calculates.

16. neural network computing method according to claim 15, wherein isomery kernel complete neural network computing it Afterwards, operation result is sent back into scratchpad, and controls scratchpad and operation result is write back into memory.

17. neural network computing method according to claim 16, wherein isomery kernel receives scratchpad hair The data and instruction of the neural network computing sent, and execute neural network computing, comprising:

Logic control kernel in isomery kernel receives the data and instruction for the neural network computing that scratchpad is sent； And

Logic control kernel in isomery kernel is according to the type and ginseng of the neural network model in the data of neural network computing Number determines to execute neural network computing or neural net layer operation by private core and/or generic core.

18. neural network computing method according to claim 17, wherein logic control kernel in isomery kernel according to The type and parameter of neural network model in the data of neural network computing, decision are held by private core and/or generic core Row neural net layer operation includes:

Logic control kernel in isomery kernel is according to the type and ginseng of the neural network model in the data of neural network computing Number, judges whether there is qualified private core；

If private core m is eligible, private core m as target kernel, logic control kernel in isomery kernel to Target kernel sends signal, sends target kernel for the corresponding address of the data and instruction of neural network computing；

Target kernel obtains neural network computing from memory by shared or unshared scratchpad according to address Data and instruction carries out neural network computing, and operation result is exported by the shared or unshared scratchpad To memory, operation is completed；

If there is no qualified private core, then the logic control kernel in isomery kernel sends to generic core and believes Number, generic core is sent by the corresponding address of the data and instruction of neural network computing；

Generic core obtains neural network computing from memory by shared or unshared scratchpad according to address Data and instruction carries out neural network computing, and operation result is output to by shared or unshared scratchpad and is deposited Reservoir, operation are completed.

19. neural network computing method according to claim 18, wherein the qualified private core refers to branch It holds specified neural network computing and the private core of specified neural network computing scale can be completed.

20. neural network computing method according to claim 18, wherein logic control kernel in isomery kernel according to The type and parameter of neural network model in the data of neural network computing, decision are held by private core and/or generic core Row neural network computing includes:

Logic control kernel in isomery kernel parses the type and parameter of the neural network model in data, to each Neural net layer judges whether there is qualified private core respectively, and corresponding general for each neural network Layer assignment Kernel or private core obtain kernel sequence corresponding with neural net layer；

The corresponding address of the data and instruction of neural net layer operation is sent nerve by logic control kernel in isomery kernel The corresponding private core of network layer or generic core, and kernel is sent to the corresponding private core of neural net layer or generic core The number of next private core or generic core in sequence；

The corresponding private core of neural net layer and generic core read the data and instruction of neural net layer operation from address, into Operation result is transmitted to the specified address of shared and/or unshared scratchpad by row neural net layer operation；

Nuclear control is shared in logic control and/or unshared scratchpad writes back to the operation result of neural net layer In memory, operation is completed.

21. neural network computing method according to claim 18, wherein the qualified private core refers to branch It holds specified neural net layer operation and the private core of the specified neural net layer operation scale can be completed.

22. neural network computing method according to claim 15, wherein the neural network computing includes pulse nerve Network operations；The neural net layer operation include the convolution algorithm of neural net layer, full articulamentum, splicing operation, contraposition plus/ Multiplication, Relu operation, pond operation and/or Batch Norm operation.