CN111930669B - Multi-core heterogeneous intelligent processor and operation method - Google Patents

Multi-core heterogeneous intelligent processor and operation method Download PDF

Info

Publication number
CN111930669B
CN111930669B CN202010770240.5A CN202010770240A CN111930669B CN 111930669 B CN111930669 B CN 111930669B CN 202010770240 A CN202010770240 A CN 202010770240A CN 111930669 B CN111930669 B CN 111930669B
Authority
CN
China
Prior art keywords
data
quantization
intermediate result
subunit
operated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010770240.5A
Other languages
Chinese (zh)
Other versions
CN111930669A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202010770240.5A priority Critical patent/CN111930669B/en
Publication of CN111930669A publication Critical patent/CN111930669A/en
Application granted granted Critical
Publication of CN111930669B publication Critical patent/CN111930669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present disclosure provides a heterogeneous multi-core intelligent processor and an operation method, the heterogeneous multi-core intelligent processor includes a general purpose processor and/or at least one intelligent processor, the intelligent processor includes: the device comprises a storage unit, a controller unit and an operation unit, wherein the storage unit stores data to be operated, and the operation method comprises the following steps: the controller unit receives an operation instruction and analyzes the operation instruction to obtain an address and operation of data to be operated, which correspond to the operation instruction; the operation unit accesses the address of the data to be operated, acquires the data to be operated, executes the operation, and is used for acquiring intermediate result data corresponding to the data to be operated in a preset table entry storage subunit based on the data to be operated, and acquiring an output result based on the intermediate result data. The operation speed can be improved, and the power consumption can be reduced.

Description

Multi-core heterogeneous intelligent processor and operation method
Technical Field
The disclosure relates to the technical field of data processing, in particular to a multi-core heterogeneous intelligent processor and an operation method.
Background
The data processing is an essential step in the neural network, and a large amount of data is usually present in the neural network operation, so that the speed of the operation device adopted in the conventional operation device when the data operation of the neural network is performed is slow, and the power consumption and the energy consumption expenditure of the operation device are large.
Disclosure of Invention
The main purpose of the present disclosure is to provide a multi-core heterogeneous intelligent processor and an operation method, which can increase the operation speed and reduce the power consumption.
To achieve the above object, a first aspect of embodiments of the present disclosure provides a multi-core heterogeneous intelligent processor, the multi-core heterogeneous intelligent processor including a general purpose processor and/or at least one intelligent processor, the intelligent processor including:
the device comprises a storage unit, a controller unit and an operation unit;
the storage unit is used for storing data to be operated;
the controller unit is used for receiving an operation instruction and analyzing the operation instruction to obtain an address and operation of data to be operated corresponding to the operation instruction;
the operation unit is used for accessing the address of the data to be operated, obtaining the data to be operated, executing the operation, obtaining the intermediate result data corresponding to the data to be operated in a preset table entry storage subunit based on the data to be operated, and obtaining an output result based on the intermediate result data.
Optionally, the controller unit is further configured to receive and parse a write entry instruction to obtain a write entry operation;
The table entry storage subunit is used for executing the table entry writing operation, writing and storing table entry data.
Optionally, the operation unit includes a table lookup subunit, configured to find, according to the data to be operated, the intermediate result data corresponding to the data to be operated in preset table entry data in a table entry storage subunit.
Optionally, the table item data includes the intermediate result data after the neuron data subjected to the finite value quantization and the weight data are subjected to the specified operation.
Optionally, when the arithmetic operation includes a weight data multiplexing operation;
the table look-up subunit is specifically configured to look up the intermediate result data in the table item data stored in the table item storage subunit according to the neuron data subjected to the finite value quantization.
Optionally, the table entry data includes the intermediate result data after performing the assignment operation on the neuron data after performing the finite value quantization and at least one preset data.
Optionally, when the arithmetic operation comprises a neuronal data multiplexing operation;
the table look-up subunit is specifically configured to look up the intermediate result data in the table item data stored in the table item storage subunit according to the weight data.
Optionally, the operation unit further comprises an operation subunit and a quantization subunit;
the operator subunit is configured to obtain the neuron data based on the intermediate result data;
and the quantization subunit is used for carrying out finite value quantization on the neuron data to obtain the output result.
Optionally, the operation unit includes a quantization table storage subunit, where a quantization table is stored in the quantization table storage subunit, and mapping relationships before and after the storage neuron data in the quantization table is quantized with a limited value.
Optionally, the quantization subunit is specifically configured to search the neuron data after performing finite value quantization according to the quantization table stored in the quantization table storage subunit, and the output result is the neuron data after performing finite value quantization.
Optionally, the controller unit is further configured to receive and parse a quantization table writing instruction to obtain a quantization table writing operation;
the quantization table storage subunit is configured to perform the quantization table writing operation, write and store the mapping relationship between the neuron data and the neuron data before and after the finite value quantization.
Optionally, the operation subunit includes:
A first register to store a first intermediate result, the first intermediate result comprising a first portion and data;
a second register for storing a second intermediate result, the second intermediate result comprising a second portion and data;
the shifter is used for carrying out shift operation on the first part and the data and sending the shifted first part and the shifted data to the first vector adder;
the first vector adder is configured to correspondingly add the shifted first portion of sum data and the intermediate result data to obtain addition result data;
a second vector adder for correspondingly adding the addition result data and the second portion and data, respectively, to obtain the neuron data;
a first selector for selecting a first portion and data from the first register based on the intermediate result data;
and a second selector for selecting a second portion and data from the second register based on the addition result data.
Optionally, when the data multiplexing operation includes a neuronal data multiplexing operation, the first selector is specifically configured to select the first portion and the data according to a specific category of a limited quantized category in which the intermediate result data falls, and a position of a current operation cycle.
Optionally, when the data multiplexing operation comprises a neuronal data multiplexing operation, sending the intermediate result data to the first vector adder;
when the data multiplexing operation includes a weight data multiplexing operation, the intermediate result data is sent to the second vector adder.
A second aspect of the disclosed embodiments provides an operation method, the operation method being performed by a multi-core heterogeneous intelligent processor, the multi-core heterogeneous intelligent processor including a general purpose processor and/or at least one intelligent processor, the intelligent processor including: the device comprises a storage unit, a controller unit and an operation unit, wherein the storage unit stores data to be operated, and the operation method comprises the following steps:
the controller unit receives an operation instruction and analyzes the operation instruction to obtain an address and operation of data to be operated, which correspond to the operation instruction;
the operation unit accesses the address of the data to be operated, acquires the data to be operated, executes the operation, and is used for acquiring intermediate result data corresponding to the data to be operated in a preset table entry storage subunit based on the data to be operated, and acquiring an output result based on the intermediate result data.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic structural diagram of an arithmetic device according to an embodiment of the disclosure;
fig. 2 is a schematic structural diagram of an arithmetic device according to an embodiment of the disclosure;
fig. 3 is a schematic structural diagram of an arithmetic device according to an embodiment of the disclosure;
FIG. 4 is a flow chart of an operation method according to an embodiment of the disclosure;
FIG. 5 is a schematic diagram of a multi-core heterogeneous intelligent processor according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a multi-core heterogeneous intelligent processor according to an embodiment of the present disclosure.
Detailed Description
The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
The terms "first," "second," "third," and "fourth" in the description and claims of the present disclosure and in the drawings, etc. are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present disclosure. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a multi-core heterogeneous intelligent processor according to an embodiment of the present disclosure, where the multi-core heterogeneous intelligent processor includes a general purpose processor and/or at least one intelligent processor, and the intelligent processor includes:
A storage unit 101, a controller unit 102, and an arithmetic unit 103;
the storage unit 101 is configured to store data to be operated;
the controller unit 102 is configured to receive an operation instruction, and parse the operation instruction to obtain an address of data to be operated and an operation corresponding to the operation instruction;
the operation unit 103 is configured to access an address of the data to be operated, obtain the data to be operated, and perform the operation, where the operation is configured to obtain, based on the data to be operated, intermediate result data corresponding to the data to be operated in a preset table entry storage subunit 1031, and obtain an output result based on the intermediate result data.
The storage unit 101 may be used to store an operation instruction, entry data, and an output result in addition to the data to be operated. The data to be operated, that is, the input data of the operation unit 103, includes at least one input neuron data or at least one weight data. More, for different types of data, different types of data may be stored in different storage subunits. For example, for neuron data and weight data, they may be stored in a matrix storage subunit, for entry data in a vector storage subunit, and for arithmetic instructions in an instruction storage subunit. The above is merely one illustrative example of a division, which is not limiting in this disclosure.
The operation instruction in the controller unit 102 includes an operation code and an operation field, where the operation code is used to indicate the function of the operation instruction, the controller unit 102 confirms the operation by identifying the operation code, and the operation field is used to indicate the data information of the operation, where the data information may be an immediate number or a register number, for example, when a data to be operated is to be acquired, related information such as a data address of the data to be operated may be acquired in a corresponding register according to the register number, and then the data to be operated is acquired in the storage unit 101 according to the related information. By way of example, the arithmetic operations parsed by the arithmetic instructions in the controller unit 102 may include a neuron data multiplexing operation, a weight data multiplexing operation, an addition operation, a shift operation, a selection operation, and the like. The acquired data to be operated, such as neuron data and/or weight data acquired from the matrix storage subunit, and table entry data acquired from the vector storage subunit. Such as on-chip addresses and off-chip addresses where information is to be written when writing quantization tables.
The above-mentioned entry storage subunit 1031 may be disposed outside the operation unit 103 or disposed inside the operation unit 103, which is not limited by the present disclosure. For convenience of description, the following disclosure will take an example in which the entry storage subunit 1031 is disposed in the operation unit 103. The intermediate result data stored in the table entry storage subunit 1031 may be calculated in advance before the data to be operated is acquired, and may be input in advance and stored in the table entry storage subunit 1031, or may be calculated according to the data to be operated after the data to be operated is acquired, and then the intermediate result data is written into and stored in the table entry storage subunit 1031. The present disclosure is not limited in this regard.
In an alternative, the operation unit 103 may obtain the data to be operated from a preset data input buffer, where the data to be operated may be specifically input through the data input unit. The data input unit may in particular be one or more data I/O interfaces or I/O pins. In an alternative, the above-mentioned manner of the data input unit acquiring the data to be operated on may also be obtained in particular by DMA (Direct Memory Access ). Taking the example of transferring data from the off-chip space to the on-chip space, acquiring the off-chip data address and the transmission data size according to the instruction, receiving the data to be operated of the specified transmission data size from the off-chip data address by the DMA, and copying the data to the on-chip storage space specified by the instruction, namely, the specified position of the data input cache. Here, the data input buffer may be the storage unit 101, a part of it, or another block of storage space, which is not limited herein.
In an alternative, the operation unit 103 may temporarily store the output result in a preset data buffer unit, and then output the data through a data output unit. The data output unit may be one or more data I/0 interfaces or I/0 pins. In an alternative, the outputting of data by the data output unit may also be performed by DMA (Direct Memory Access ). Taking the example of transferring data from the on-chip space to the off-chip space, the on-chip data address and the size of the transferred data are acquired according to the instruction, and the DMA receives the data to be operated with the size of the designated transferred data from the on-chip data address (i.e. the designated position in the data output buffer) and copies the data to the off-chip storage space designated by the instruction. Here, the data output buffer may be the storage unit 101, a part of it, or another block of storage space, which is not limited herein.
In one embodiment of the present disclosure, the controller unit 102 is further configured to receive and parse an entry writing instruction to obtain an entry writing operation;
the table entry storage subunit 1031 is configured to perform the table entry writing operation, write and store table entry data.
It will be appreciated that the write entry instruction also includes an opcode and an operation field, where the opcode is used to indicate the function of the write entry instruction, i.e., writing entry data to the entry storage subunit 1031. The controller unit 102 acknowledges the write entry operation by identifying the opcode, and the operation field is used to indicate the data information of the write entry operation, where the data information may be an immediate or a register number of the data to be written.
In one embodiment of the present disclosure, referring to fig. 2, the operation unit 103 includes a table look-up subunit 1032, configured to look up the intermediate result data corresponding to the data to be operated in the preset table entry data of the table entry storage subunit 1031 according to the data to be operated.
The table entry data may include a result of performing a specified operation on the neuron data and the weight data after the finite value quantization, and may further include a result of performing a specified operation on the neuron data and at least one preset data after the finite value quantization.
In one example, when the data to be operated acquired from the storage unit 101 includes the neuron data and the weight data subjected to the limited value quantization, the intermediate result data corresponding to the neuron data and the weight data subjected to the limited value quantization is searched from the preset table item data of the table item storage subunit 1031 according to the neuron data and the weight data subjected to the limited value quantization. In another example, when the data to be operated acquired from the storage unit 101 includes the neuron data subjected to the limited value quantization, the intermediate result data corresponding to the neuron data subjected to the limited value quantization is searched from the preset entry data of the entry storage subunit 1031 according to the neuron data subjected to the limited value quantization.
In one embodiment of the present disclosure, the table entry data includes the intermediate result data after the neuron data subjected to the finite value quantization and the weight data are subjected to the specified operation. When the arithmetic operation includes a weight data multiplexing operation; the table lookup subunit 1032 is specifically configured to find the intermediate result data from the table data stored in the table storage subunit 1031 according to the neuronal data subjected to the finite value quantization.
In this embodiment, the specified operation may be a multiplication operation, an addition operation, or the like. For example, taking a multiplication operation as an example, when the data to be operated includes weight data and neuron data subjected to finite value quantization, the data to be operated performs a neural network multiplication operation, and the operation includes a weight data multiplexing operation, the multiplexed same weight data and different neuron data subjected to finite value quantization perform a multiplication operation, and then the table look-up subunit 1032 needs to obtain intermediate result data obtained by multiplying the neuron data subjected to finite value quantization and the weight data in the table entry storage subunit 1031.
In one embodiment of the disclosure, the entry data includes the intermediate result data after performing a specified operation on the neuron data subjected to the finite value quantization and at least one preset data. When the operation comprises a neuron data multiplexing operation, the data to be operated comprises weight data; the table lookup subunit 1032 is specifically configured to lookup the intermediate result data in the table data stored in the table entry storage subunit 1031 according to the weight data.
Wherein, at least one preset data can be N-bit 1-bit data, and the value of the 1-bit data comprises 0 and 1. For example, all possible cases of 3-bit 1-bit data are 000, 001, 010, 011, 100, 101, 110, 111.
In this embodiment, the specified operation may be a multiplication operation, an addition operation, or the like. For example, when the data to be operated includes weight data and finite value quantized neuron data, the multiplication operation is performed by the neural network, and the operation includes a multiplexing operation of the neuron data, the same finite value quantized neuron data and different weight data are multiplexed, and the table look-up subunit 1032 needs to obtain the multiplication and accumulation result of the finite value quantized neuron data and N-bit 1-bit data in the table entry storage subunit 1031, that is, the intermediate result data corresponding to the finite value quantized neuron data, where N is a configurable parameter.
For example, the data to be calculated includes that the neuronal data after finite quantization is Ia, and the table entry data includes that Ia and 3-bit 1-bit data are multiplied, that is, intermediate result data found from the table entry storage subunit 1031 are multiplied and accumulated with 000, 001, 010, 011, 100, 101, 110, 111, respectively.
Optionally, the table entry storage subunit 1031 may be further divided into a first table entry storage subunit and a second table entry storage subunit, where the first table entry storage subunit stores the intermediate result data after performing the assignment operation on the neuron data and the weight data after performing the finite value quantization. The second table entry storage subunit stores the intermediate result data after the neuron data subjected to limited value quantization and at least one preset data are subjected to specified operation. When the operation includes the weight data multiplexing operation, the intermediate result data is directly obtained from the first table entry storage subunit, and when the operation includes the neuron data multiplexing operation, the intermediate result data is directly obtained from the second table entry storage subunit.
In one embodiment of the disclosure, the operation unit 103 further includes an operation subunit 1033 and a quantization subunit 1034; the operator subunit 1033 is configured to obtain the neuron data based on the intermediate result data; the quantization subunit 1034 is configured to perform finite quantization on the neuron data, to obtain the output result.
In this embodiment, since the intermediate result of the data to be operated obtained from the entry storage subunit 1031 is already data after the multiply-accumulate operation, the operation subunit 1033 does not need to perform multiplication operation on the intermediate result data any more, and only needs simple addition operation, shift operation, etc., so as to obtain the operation result obtained after the data to be operated performs the operation, that is, the neuron data. Then, after the neuron data passes through the quantization subunit 1034, the quantization subunit 1034 performs finite quantization on the neuron data, so as to obtain the neuron data after finite quantization, i.e. the output result.
In one embodiment of the disclosure, the operation unit 103 includes a quantization table storage subunit 1035, where the quantization table storage subunit 1035 stores a quantization table in which a mapping relationship between the neuron data before and after the finite value quantization is stored.
In this embodiment, the quantization table in the quantization table storage subunit 1035 stores a mapping relationship between at least one piece of neuron data before and after the finite value quantization, which includes a mapping relationship between the neuron data obtained after the execution of the operation of the present disclosure before and after the finite value quantization. The mapping relation between the neuron data before and after the finite value quantization is performed may be written into the quantization table in the quantization table storage subunit 1035 before the neuron data is input into the quantization table storage subunit 1035. The mapping relation before and after the finite value quantization of the neuron data may be written into the quantization table in the quantization table storage subunit 1035 before the arithmetic operation is performed. The present disclosure is not limited in this regard.
In one embodiment of the present disclosure, the quantization subunit 1034 is specifically configured to search the neuron data subjected to limited value quantization according to the quantization table stored in the quantization table storage subunit 1035, and the output result is the neuron data subjected to limited value quantization.
For example, the mapping relationship before and after the finite value quantization of the neuron data is stored in the quantization table is I1-Ia (I1 is the neuron data and Ia is the neuron data after the finite value quantization). The quantization subunit 1034 searches the neuron data Ia after performing limited quantization according to the quantization table.
In one embodiment of the present disclosure, the controller unit 102 is further configured to receive and parse a quantization table writing instruction to obtain a quantization table writing operation; the quantization table storage subunit 1035 is configured to perform the quantization table writing operation, write and store the mapping relationship between the neuron data and the neuron data before and after the finite quantization.
In this embodiment, the mapping relationship between the neuron data stored in the quantization table storage subunit 1035 before and after the finite value quantization is received and parsed by the controller unit 102 to obtain the quantization table writing operation. The quantization table storage subunit 1035 then performs the write quantization table operation, and writes and stores the mapping relation between the neuron data before and after the finite value quantization.
It is understood that the above-mentioned quantization table writing instruction also includes an operation code and an operation field, where the operation code is used to indicate the function of the quantization table writing instruction, that is, the mapping relationship before and after the writing of the neuron data for the finite value quantization is mapped to the quantization table storage subunit 1035. The controller unit 102 confirms the performing of the writing of the quantization table by recognizing the operation code, and the operation field is used for indicating the data information of the writing of the quantization table, wherein the data information may be the neuron data, the neuron data after the finite value quantization, the immediate number or the register number of the parameter of the mapping relation before and after the finite value quantization of the neuron data.
It is understood that the mapping relation before and after the finite value quantization of the neuron data includes the neuron data, the mapping relation is a parameter and the neuron data after the finite value quantization. The mapping relation before and after the finite value quantization of the neuron data is written and stored may store the neuron data and the neuron data after the finite value quantization in the form of a group. For example, if I1 and I2 are neuronal data, the neuronal data after limited quantization of I1 is Ia, and the neuronal data after limited quantization of I2 is Ib, the parameters of the mapping relationship are stored separately in the form of I1-Ia, I2-Ib, etc.; the neuron data, the map parameter, and the neuron data subjected to the finite value quantization may be stored in the form of a group. Taking the mapping relation parameter as T (x) or G (x) as an example, for example, I1-T (x) -Ia, I2-T (x) -Ib, I1-G (x) -Ia and I2-G (x) -Ib. It can be appreciated that in the same trained neural network, the neuron data may be quantized to a limited value using the same mapping parameters. The type, number and form of the mapping relation parameters are not limited in the disclosure, and the mapping relation parameters may be one or more numerical coefficients, one or more mapping functions, or a combination thereof. For example, the mapping relation parameter may be T (x) or G (x) in the above example, or the mapping relation parameter may include an exp (x) function, a preset function T (x), and an optional constant L, which form a mapping relation of G (x) =exp (1/l×t (x)).
In one embodiment of the disclosure, the operator subunit 1033 includes: a first register 201 for storing a first intermediate result, the first intermediate result comprising a first portion and data; a second register 202 for storing a second intermediate result, the second intermediate result comprising a second portion and data; a shifter 203, configured to perform a shift operation on the first portion and the data, and send the shifted first portion and the shifted data to a first vector adder; the first vector adder 204 is configured to add the shifted first portion of sum data and the intermediate result data correspondingly to obtain addition result data; a second vector adder 205, configured to add the addition result data and the second portion and data correspondingly, to obtain the neuron data; a first selector 206 for selecting a first portion and data from the first register 201 based on the intermediate result data; a second selector 207 for selecting a second portion and data from the second register 202 based on the addition result data.
Referring to fig. 3, in the operation subunit 1033, a first vector adder 204 is connected to a first register 201, the first register 201 is connected to a first selector 206, the first selector 206 is connected to a shifter 203, the first shifter 203 is connected to the first vector adder 204, the first register 201 is connected to a second vector adder 205, the second vector adder 205 is connected to a second register 202, the second register 202 is connected to a second selector 207, and the second selector 207 is connected to the second vector adder 205.
In one embodiment of the present disclosure, the first selector 206 is specifically configured to select the first portion and the data based on the specific class of the limited quantized class in which the intermediate result data falls, and the location of the current arithmetic operation cycle. Wherein the intermediate result data is neuron data subjected to limited value quantization
Specifically, let M be the type of the neuron data after the finite quantization, RT be the multiplexing number of the multiplexing operation, then the first register 201 stores m×rt first intermediate results, and the first selector 206 selects RT data from the m×rt first intermediate results according to the M quantized specific types of the intermediate result data and the current position of the multiplexing operation cycle, and sends the data to the vector adder. It will be appreciated that when the multiplexing number RT is 1, the first register 201 stores M first intermediate results, and the first selector 206 selects 1 first portion and data from the M first intermediate results to be sent to the first vector adder 204 according to M quantized specific categories in which the intermediate result data falls, and the position of the current multiplexing operation cycle.
In some embodiments, the second selector 207 is specifically configured to select the second portion and data based on the specific class of finite quantized categories in which the addition result data falls, and the location of the current arithmetic operation cycle, similar to the first selector 206.
Specifically, let M be the type of the neuron data after the finite quantization, RT be the multiplexing number of the multiplexing operation, then the second register 202 stores m×rt second intermediate results, and the second selector 207 selects RT data from the m×rt second intermediate results according to the M quantized specific types of the intermediate result data and the current position of the multiplexing operation cycle, and sends the RT data to the vector adder. It will be appreciated that when the multiplexing number RT is 1, the second register 202 stores M second intermediate results, and the second selector 207 selects 1 second part and data from the M second intermediate results to be sent to the second vector adder 205 according to M quantized specific categories in which the intermediate result data falls, and the position of the current multiplexing operation cycle.
In one embodiment of the present disclosure, when the data multiplexing operation comprises a neuron data multiplexing operation, the intermediate result data is sent to the first vector adder 204; when the data multiplexing operation includes a weight data multiplexing operation, the intermediate result data is sent to the second vector adder 205.
The following description will be made of the execution flow of the operator subunit 1033 provided in the present disclosure, with the neuron data including M cases for finite value quantization.
In the case where the arithmetic operation includes a neuronal data multiplexing operation, the intermediate result data output by the table look-up subunit 1032 is input to the first vector adder 204, the first selector 206 selects the first portion and data from the first register 201 based on the intermediate result data, the shifter 203 performs a shift operation on the first portion and data again, and sends the shifted first portion and data to the first vector adder 204, at this time, the first vector adder 204 adds the intermediate result data and the first portion and data to obtain addition result data. The addition result data is then written back to the location of the fetched first portion and data in the first register 201. The first register 201 inputs the addition result data to the second vector adder 205, and the second selector 207 selects the second part and data from the second register 202 based on the addition result data and inputs the second part and data to the second vector adder 205. The second vector adder 205 adds the first portion and the data to the second portion and the data to obtain the above-described operation result, that is, the neuron data. The neuron data is then written to the location of the fetched second portion and data in the second register 202.
In the case where the arithmetic operation includes a weight data multiplexing operation, the intermediate result data output by the table look-up subunit 1032 is directly input to the second vector adder 205, and the second selector 207 selects a part of the sum data from the second register 202 based on the intermediate result data and inputs to the second vector adder 205. The second vector adder 205 adds the intermediate result data to the portion and the data to obtain the above-described operation result, that is, the neuron data. The neuron data is then written to the location of the fetched portion and data in the second register 202.
Referring to fig. 4, an embodiment of the present disclosure further provides an operation method, where the operation method is performed by the multi-core heterogeneous intelligent processor shown in fig. 1, where the multi-core heterogeneous intelligent processor includes a general purpose processor and/or at least one intelligent processor, and the intelligent processor includes: the device comprises a storage unit, a controller unit and an operation unit, wherein the storage unit comprises data to be operated, and the method comprises the following steps:
s401, a controller unit receives an operation instruction and analyzes the operation instruction to obtain an address and operation of data to be operated, which correspond to the operation instruction;
S402, an operation unit accesses the address of the data to be operated, acquires the data to be operated, executes the operation, and is used for acquiring intermediate result data corresponding to the data to be operated in a preset table entry storage subunit based on the data to be operated, and acquiring an output result based on the intermediate result data.
In one embodiment of the disclosure, the controller unit receives and parses a write entry instruction to obtain a write entry operation; and the table entry storage subunit executes the table entry writing operation, and writes and stores table entry data.
In one embodiment of the disclosure, a table look-up subunit of the operation unit searches, according to the data to be operated, the intermediate result data corresponding to the data to be operated in preset table entry data of a table entry storage subunit.
In one embodiment of the present disclosure, the table entry data includes the intermediate result data after the neuron data subjected to the finite value quantization and the weight data are subjected to the specified operation.
In one embodiment of the present disclosure, when the arithmetic operation includes a weight data multiplexing operation; and the table lookup subunit searches the intermediate result data in the table item data stored in the table item storage subunit according to the neuron data subjected to the finite value quantization.
In one embodiment of the disclosure, the entry data includes the intermediate result data after performing a specified operation on the neuron data subjected to the finite value quantization and at least one preset data.
In one embodiment of the present disclosure, when the arithmetic operation comprises a neuronal data multiplexing operation; and the table look-up subunit searches the intermediate result data in the table item data stored in the table item storage subunit according to the weight data.
In one embodiment of the disclosure, the operation unit further includes an operation subunit and a quantization subunit; the operator subunit obtains the neuron data based on the intermediate result data; and the quantization subunit performs finite value quantization on the neuron data to obtain the output result.
In one embodiment of the disclosure, the operation unit includes a quantization table storage subunit, in which a quantization table is stored, and in which the mapping relationship between the stored neuron data before and after the finite quantization is performed.
In one embodiment of the disclosure, the quantization subunit searches the neuron data subjected to finite value quantization according to the quantization table stored in the quantization table storage subunit, and the output result is the neuron data subjected to finite value quantization.
In one embodiment of the disclosure, the controller unit receives and parses a write quantization table instruction to obtain a write quantization table operation; and the quantization table storage subunit executes the quantization table writing operation, and writes and stores the mapping relation before and after the neuron data is quantized in a limited value.
In one embodiment of the disclosure, a first register in the operator subunit is configured to store a first intermediate result, where the first intermediate result includes a first portion and data;
a second register in the operator subunit for storing a second intermediate result, the second intermediate result comprising a second portion and data; the shifter in the operation subunit performs shift operation on the first part and the data, and sends the shifted first part and the shifted data to a first vector adder; the first vector adder in the operator subunit correspondingly adds the shifted first part of the data and the intermediate result data respectively to obtain addition result data; a second vector adder in the operator subunit correspondingly adds the addition result data and the second part and data respectively to obtain the neuron data; a first selector in the operator subunit selects a first portion and data from the first register according to the intermediate result data; a second selector in the operator subunit selects a second portion and data from the second register based on the addition result data.
In one embodiment of the present disclosure, when the data multiplexing operation includes a neuron data multiplexing operation, the first selector selects the first portion and data according to a specific category of a limited value quantization category in which the intermediate result data falls, and a position of a current operation cycle.
In one embodiment of the present disclosure, when the data multiplexing operation includes a neuron data multiplexing operation, the intermediate result data is sent to the first vector adder; when the data multiplexing operation includes a weight data multiplexing operation, the intermediate result data is sent to the second vector adder.
The multi-core heterogeneous intelligent processor provided by the embodiment of the disclosure commonly multiplexes the neuron data and/or the weights when a plurality of processing cores commonly process the same task.
The present disclosure provides a multi-core heterogeneous intelligent processor, including a general purpose processor, and/or at least one multi-core intelligent processor as shown in fig. 5 or 6; the general-purpose processor is used for generating program instructions; the multi-core intelligent processor is used for receiving the program instruction to finish operation according to the program instruction.
The multi-core heterogeneous intelligent processor provided by the present disclosure includes: memory, cache, and heterogeneous kernel; the memory is used for storing data to be operated (hereinafter referred to as data) and operation instructions (hereinafter referred to as instructions) of the neural network operation; the buffer is connected with the memory through a memory bus; the heterogeneous kernel is connected with the buffer through a buffer bus, reads data and instructions of the neural network operation through the buffer, completes the neural network operation, sends an operation result back to the buffer, and controls the buffer to write the operation result back to the memory.
Wherein the heterogeneous cores refer to cores comprising at least two different types, i.e. cores of two different structures.
In some embodiments, the heterogeneous kernel comprises: a plurality of operation cores having at least two different types of operation cores for performing a neural network operation or a neural network layer operation; and one or more logic control cores for determining to perform neural network operations or neural network layer operations by the dedicated core and/or the general core based on the data of the neural network operations.
Further, the plurality of operation cores includes m general-purpose cores and n special-purpose cores; the special kernel is dedicated to executing the specified neural network/neural network layer operation, and the general kernel is used for executing any neural network/neural network layer operation. Alternatively, the general purpose core may be cpu and the special purpose core may be npu. The dedicated cores may have the same or different structures.
In some embodiments, a buffer may also be included. The buffer comprises a shared buffer and/or an unshared buffer; the shared buffer is correspondingly connected with at least two cores in the heterogeneous cores through a buffer bus; the non-shared buffer is correspondingly connected with one core in the heterogeneous cores through a buffer bus. The buffer may be any structure of a scratch pad memory, a cache memory, etc., as this disclosure is not limited in this regard.
In particular, the buffer may comprise only one or more shared buffers, each connected to multiple cores (logic control cores, dedicated cores or general purpose cores) of the heterogeneous cores. The buffer may also include only one or more non-shared buffers, each of which is connected to one of the heterogeneous cores (logic control core, dedicated core, or general purpose core). The buffer may also include one or more shared buffers and one or more non-shared buffers, where each shared buffer is connected to multiple cores (logic control cores, dedicated cores, or general cores) in the heterogeneous cores, and each non-shared buffer is connected to one core logic control core, dedicated core, or general core in the heterogeneous cores.
In some embodiments, the logic control kernel is connected to the buffer through a buffer bus, reads data of the neural network operation through the buffer, and decides to execute the neural network operation and/or the neural network layer operation by taking the special kernel and/or the general kernel as the target kernel according to the type and the parameters of the neural network model in the data of the neural network operation. And a path can be added between the cores, and the logic control core can directly send signals to the target core through a control bus or send signals to the target core through the buffer memory, so that the target core is controlled to execute neural network operation and/or neural network layer operation.
The multi-core heterogeneous intelligent processor provided by the embodiment of the present disclosure may also refer to fig. 5, including: memory 11, non-shared buffer 12, and heterogeneous core 13.
The memory 11 is used for storing data and instructions of neural network operation, wherein the data comprises weight values, input neuron data, output neuron data, bias, gradient, types and parameters of a neural network model and the like. Of course, the output neuron data may not be stored in the memory; the operation instructions comprise various instructions corresponding to the neural network operation, such as a data multiplexing instruction, a write operation table instruction and the like. Data and instructions stored in memory 11 may be transferred to heterogeneous core 13 via non-shared cache 12.
The non-shared buffer 12 includes a plurality of buffers 121, each buffer 121 is connected to the memory 11 through a memory bus, and is connected to the heterogeneous core 13 through a buffer bus, so as to implement data exchange between the heterogeneous core 13 and the non-shared buffer 12, and between the non-shared buffer 12 and the memory 11. When the neural network operation data or instructions required by the heterogeneous core 13 are not stored in the non-shared buffer 12, the non-shared buffer 12 reads the required data or instructions from the memory 11 through the memory bus, and then feeds them into the heterogeneous core 13 through the buffer bus.
The heterogeneous core 13 is configured to read instructions and data of the neural network operation from the non-shared buffer 12, complete the neural network operation, send the operation result back to the non-shared buffer 12, and control the non-shared buffer 12 to write the operation result back to the memory 11.
The logic control core 131 reads in the neural network operation data and instructions from the non-shared buffer 12, determines whether a special core 133 supporting the neural network operation and capable of completing the neural network operation scale exists according to the type and parameters of the neural network model in the data, if so, passes the corresponding special core 133 to complete the neural network operation, and if not, passes the general core 132 to complete the neural network operation. In order to determine the position of the special core and whether the special core is idle, a table (called a special/general core information table) can be set for each type of core (the special core supporting the same layer belongs to one type and the general core belongs to one type), the number (or address) of the similar core is recorded in the table, whether the special/general core information table is idle currently or not is recorded in the table, the special/general core information table is idle initially, then the idle state is changed, direct or indirect communication between the logic control core and the core is used for maintenance, the number of the core in the table can be obtained by scanning once when the network processor is initialized, and thus, a dynamically configurable heterogeneous core can be supported (namely, the type and the number of the special processor in the heterogeneous core can be changed at any time, and the core information table can be scanned and updated after the change); alternatively, dynamic configuration of heterogeneous cores may not be supported, and only the core numbers in the table need to be fixed, and multiple scanning and updating are not needed; alternatively, if the numbers of the dedicated cores of each class are always consecutive, a base address may be recorded, and then the dedicated cores may be represented by consecutive bits, and whether they are in an idle state may be represented by bits 0 or 1. In order to judge the type and parameters of the network model, a decoder can be arranged in the logic control kernel, the type of the network layer can be judged according to the instruction, the instruction which is a general kernel or the instruction which is a special kernel can be judged, and parameters, data addresses and the like can be analyzed from the instruction; optionally, the data may further include a data header, which includes the number and the scale of each network layer, and the address of the corresponding calculation data and instruction, and a special parser (software or hardware) may be provided to parse the information; optionally, the parsed information is stored to a designated area. In order to determine which core to use according to the resolved network layer number and size, a content addressable memory CAM (content addressable memory) may be set in the logic control core, where the content may be configurable, which requires the logic control core to provide instructions to configure/write the CAM, where the content in the CAM has the network layer number, the maximum size that can be supported by each dimension, and the address of the dedicated core information table and the general core information table address supporting the layer, under this scheme, the resolved layer number is used to find the corresponding table entry and compare the size limitation; if yes, the address of the special kernel information table is taken, an idle special kernel is searched, a control signal is sent according to the number of the idle special kernel, and a calculation task is distributed to the idle special kernel; if no corresponding layer is found in the CAM, or the scale limit is exceeded, or no idle kernel exists in the special kernel information table, searching an idle general kernel in the general kernel information table, sending a control signal according to the number of the idle general kernel, and distributing a calculation task for the idle general kernel; if an idle core is found in both tables, the task is added to a waiting queue and some necessary information is added, and once there is an idle core that can calculate the task, it is allocated to it for calculation.
Of course, there may be a variety of ways to determine the location of the dedicated core and whether it is idle, and the above-described ways to determine the location of the dedicated core and whether it is idle are merely illustrative. Each dedicated core 133 may independently perform a neural network operation, such as a neural network (SNN) operation, and write the result back to its corresponding connected register 121, and control the register 121 to write the result back to the memory 11.
The general-purpose core 132 can independently complete the neural network operation which exceeds the operation scale supported by the special-purpose core or is not supported by all the special-purpose cores 133, and write the operation result back to the corresponding connected buffer 121, and control the buffer 121 to write the operation result back to the memory 11.
One embodiment of the present disclosure proposes a multi-core heterogeneous intelligent processor, referring to fig. 6, including: memory 21, shared buffer 22, and heterogeneous core 23.
The memory 21 is used for storing data and instructions of the neural network operation, wherein the data comprises bias, weight, input data, output data and types and parameters of the neural network model, and the instructions comprise various instructions corresponding to the neural network operation. Data and instructions stored in memory are transferred to heterogeneous cores 23 via shared buffer 22.
The shared buffer 22 is connected to the memory 21 through a memory bus, and is connected to the heterogeneous core 23 through a shared buffer bus, so that data exchange between the heterogeneous core 23 and the shared buffer 22 and between the shared buffer 22 and the memory 21 is realized.
When the neural network operation data or instructions required by the heterogeneous core 23 are not stored in the shared buffer 22, the shared buffer 22 reads the required data or instructions from the memory 21 through the memory bus and then feeds them into the heterogeneous core 23 through the buffer bus.
The heterogeneous core 23 includes a logic control core 231, a plurality of general-purpose cores 232, and a plurality of special-purpose cores 233, where the logic control core 231, the general-purpose cores 232, and the special-purpose cores 233 are all connected to the shared buffer 22 through a buffer bus.
The heterogeneous core 23 is used for reading the neural network operation data and instructions from the shared buffer 22, completing the neural network operation, and sending the operation result back to the shared buffer 22, and controlling the shared buffer 22 to write the operation result back to the memory 21.
In addition, when data transmission is required between the logic control core 231 and the general core 232, between the logic control core 231 and the special core 233, between the general core 232 and between the special cores 233, the core sending data may transmit data to the shared buffer 22 through the shared buffer bus first, and then transmit data to the core receiving data, without going through the memory 21.
For neural network operation, the neural network model generally includes a plurality of neural network layers, each neural network layer performs corresponding operation by using an operation result of a previous neural network layer, the operation result is output to a next neural network layer, and an operation result of a last neural network layer is used as a result of the whole neural network operation. In the multi-core heterogeneous intelligent processor of the present embodiment, the general core 232 and the special core 233 can execute an operation of a neural network layer, and the logic control core 231, the general core 232 and the special core 233 are utilized to jointly complete the operation of the neural network, which is hereinafter referred to as a layer for convenience of description.
Each dedicated core 233 may independently perform a layer of operations, such as convolution operation, full-connection layer, concatenation operation, para-add/multiply operation, relu operation, pooling operation, batch Norm operation, etc., and the size of the neural network operation layer cannot be too large, i.e. cannot exceed the size of the corresponding dedicated core that can support the neural network operation layer, that is, the dedicated core operation has a limitation on the number of neurons and synapses of the layer, and after the layer operation is finished, the operation result is written back into the shared buffer 22.
The general-purpose core 232 is configured to execute layer operations that exceed the operation scale that can be supported by the special-purpose core 233 or that are not supported by all the special-purpose cores, and write the operation results back to the shared buffer 22, and control the shared buffer 22 to write the operation results back to the memory 21.
Further, after the special core 233 and the general core 232 write the operation result back to the memory 21, the logic control core 231 sends a start operation signal to the special core or the general core that performs the next operation, and notifies the special core or the general core that performs the next operation to start operation.
Further, the dedicated core 233 and the general core 232 start operation when receiving the start operation signal sent by the dedicated core or the general core executing the previous layer operation, and when no layer operation is currently being performed, if the layer operation is currently being performed, the current layer operation is completed, and the operation result is written back to the shared buffer 22 to start operation.
The logic control core 231 reads in the neural network operation data from the shared buffer 22, analyzes each layer of the neural network model for the type and parameters of the neural network model therein, determines whether or not a dedicated core 233 supporting the operation of the layer and capable of completing the operation scale of the layer exists for each layer, if so, transfers the operation of the layer to the corresponding dedicated core 233 for operation, and if not, transfers the operation of the layer to the general core 232 for operation. The logic control core 231 further sets corresponding addresses of data and instructions required for the general core 232 and the special core 233 to perform layer operations, and the general core 232 and the special core 233 read the data and instructions of the corresponding addresses and perform the layer operations.
For the special core 233 and the general core 232 that execute the first-layer operation, the logic control core 231 will send a start operation signal to the special core 233 or the general core 232 when the operation starts, and after the neural network operation ends, the special core 233 or the general core 232 that execute the last-layer operation will send a start operation signal to the logic control core 231, and after the logic control core 231 receives the start operation signal, the logic control core 22 will control the shared buffer 22 to write the operation result back to the memory 21.
It should be noted that, in the present disclosure, the number of logic control cores, the number of special cores, the number of general cores, the number of shared or non-shared buffers, and the number of memories are not limited, and may be appropriately adjusted according to the specific requirements of the neural network operation.
In the embodiment, heterogeneous kernels are adopted to perform neural network operation, different kernels can be selected to perform operation according to the type and the scale of an actual neural network, actual operation capability of hardware is fully utilized, cost is reduced, and power consumption expenditure is reduced. Different kernels perform operation of different layers, parallel operation among different layers can fully utilize the parallelism of the neural network, and the operation efficiency of the neural network is improved.
In some embodiments, the electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present invention may be embodied essentially or partly in the form of a software product, or all or part of the technical solution, which is stored in a memory, and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory includes: a usb disk, a Read-0nly Memory, a random access Memory ((RAM, random Access Memory)), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that directs associated hardware to perform, where the program may be stored in a computer readable memory, where the memory may include: flash disk, read-0nly Memory (English: ROM, for short), random access device (English: random Access Memory, for short: RAMD, magnetic disk or optical disk, etc.).
It should be noted that all units or modules provided in the present disclosure may be hardware circuits, and exemplary operation subunits may be operation circuits, quantization subunits may be quantization circuits, table entry storage subunits may be table entry storage circuits, and the like, and the operation circuits may include a first register circuit, a second register circuit, a shift circuit, a first vector addition circuit, a second vector addition circuit, a first selection circuit, and a second selection circuit, for example. The first register circuit is used for storing a first intermediate result, and the first intermediate result comprises a first part and data; a second register circuit for storing a second intermediate result, the second intermediate result comprising a second portion and data; the shift circuit is used for carrying out shift operation on the first part and the data and sending the shifted first part and the shifted data to the first vector adder; the first vector addition circuit is used for correspondingly adding the shifted first part of the data and the intermediate result data respectively to obtain addition result data; a second vector addition circuit for correspondingly adding the addition result data and the second portion and data, respectively, to obtain the neuron data; a first selection circuit for selecting a first portion and data from the first register based on the intermediate result data; and a second selection circuit for selecting a second portion and data from the second register according to the addition result data.
The foregoing has outlined rather broadly the more detailed description of embodiments of the invention, wherein the principles and embodiments of the invention are explained in detail using specific examples, the above examples being provided solely to facilitate the understanding of the method and core concepts of the invention; meanwhile, as those skilled in the art will appreciate, modifications will be made in the specific embodiments and application scope in accordance with the idea of the present invention, and the present disclosure should not be construed as limiting the present invention.

Claims (13)

1. A multi-core heterogeneous intelligent processor, the multi-core heterogeneous intelligent processor comprising a general purpose processor and/or at least one intelligent processor, the intelligent processor comprising:
the device comprises a storage unit, a controller unit and an operation unit;
the storage unit is used for storing data to be operated;
the controller unit is used for receiving an operation instruction and analyzing the operation instruction to obtain an address and operation of data to be operated corresponding to the operation instruction;
the operation unit is used for accessing the address of the data to be operated, acquiring the data to be operated, executing operation, acquiring intermediate result data corresponding to the data to be operated in a preset table entry storage subunit based on the data to be operated, and acquiring an output result based on the intermediate result data;
The table entry storage subunit is used for executing table entry writing operation, writing and storing table entry data, wherein the table entry data comprises the intermediate result data after the neuron data subjected to limited value quantization and the weight data are subjected to specified operation;
the operation unit comprises a quantization table storage subunit, wherein a quantization table is stored in the quantization table storage subunit, and the mapping relation before and after the neuron data is subjected to limited value quantization is stored in the quantization table.
2. The intelligent processor of claim 1, wherein the controller unit is further configured to receive and parse a write entry instruction to obtain a write entry operation.
3. The intelligent processor according to claim 2, wherein the operation unit includes a table look-up subunit, configured to look up the intermediate result data corresponding to the data to be operated in preset table entry data of a table entry storage subunit according to the data to be operated.
4. The intelligent processor of claim 2, wherein when the arithmetic operation comprises a weight data multiplexing operation;
and the table look-up subunit is specifically configured to look up the intermediate result data in the table item data stored in the table item storage subunit according to the neuron data subjected to the finite value quantization.
5. The intelligent processor according to claim 2, wherein the table entry data includes the intermediate result data obtained by performing a specified operation on the neuron data subjected to the finite value quantization and at least one preset data.
6. The intelligent processor of claim 5, wherein when the arithmetic operation comprises a neuronal data multiplexing operation;
and the table look-up subunit is specifically configured to look up the intermediate result data in the table item data stored in the table item storage subunit according to the weight data.
7. The intelligent processor of claim 1, wherein the arithmetic unit further comprises an arithmetic subunit and a quantization subunit;
the operator unit is used for obtaining the neuron data based on the intermediate result data;
and the quantization subunit is used for carrying out finite value quantization on the neuron data to obtain the output result.
8. The intelligent processor according to claim 7, wherein the quantization subunit is specifically configured to search the neuron data subjected to limited value quantization according to the quantization table stored in the quantization table storage subunit, and the output result is the neuron data subjected to limited value quantization.
9. The intelligent processor of claim 1, wherein the controller unit is further configured to receive and parse a write quantization table instruction to obtain a write quantization table operation;
the quantization table storage subunit is configured to perform the quantization table writing operation, write and store the mapping relationship between the neuron data and the neuron data before and after the finite value quantization.
10. The intelligent processor of claim 7, wherein the operator subunit comprises:
a first register to store a first intermediate result, the first intermediate result comprising a first portion and data;
a second register for storing a second intermediate result, the second intermediate result comprising a second portion and data;
the shifter is used for carrying out shift operation on the first part and the data and sending the shifted first part and the shifted data to the first vector adder;
the first vector adder is configured to correspondingly add the shifted first portion of sum data and the intermediate result data to obtain addition result data;
a second vector adder for correspondingly adding the addition result data and the second portion and data, respectively, to obtain the neuron data;
A first selector for selecting a first portion and data from the first register based on the intermediate result data;
and a second selector for selecting a second portion and data from the second register based on the addition result data.
11. The intelligent processor of claim 10, wherein when the data multiplexing operation comprises a neuronal data multiplexing operation, the first selector is operable to select the first portion and data based on a particular class of finite quantized categories in which the intermediate result data falls, and a location of a current operational cycle.
12. The intelligent processor of claim 10, wherein the intermediate result data is sent to the first vector adder when the data multiplexing operation comprises a neuronal data multiplexing operation;
when the data multiplexing operation includes a weight data multiplexing operation, the intermediate result data is sent to the second vector adder.
13. An operation method executed by a multi-core heterogeneous intelligent processor, wherein the multi-core heterogeneous intelligent processor comprises a general purpose processor and/or at least one intelligent processor, the intelligent processor comprising: the device comprises a storage unit, a controller unit and an operation unit, wherein the storage unit stores data to be operated, and the operation method comprises the following steps:
The controller unit receives an operation instruction and analyzes the operation instruction to obtain an address and operation of data to be operated, which correspond to the operation instruction;
the operation unit accesses the address of the data to be operated, acquires the data to be operated, executes operation, and is used for acquiring intermediate result data corresponding to the data to be operated in a preset table entry storage subunit based on the data to be operated, and acquiring an output result based on the intermediate result data;
the table entry storage subunit is used for executing table entry writing operation, writing and storing table entry data, wherein the table entry data comprises the intermediate result data after the neuron data subjected to limited value quantization and the weight data are subjected to specified operation;
the operation unit comprises a quantization table storage subunit, wherein a quantization table is stored in the quantization table storage subunit, and the mapping relation before and after the neuron data is subjected to limited value quantization is stored in the quantization table.
CN202010770240.5A 2020-08-03 2020-08-03 Multi-core heterogeneous intelligent processor and operation method Active CN111930669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010770240.5A CN111930669B (en) 2020-08-03 2020-08-03 Multi-core heterogeneous intelligent processor and operation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010770240.5A CN111930669B (en) 2020-08-03 2020-08-03 Multi-core heterogeneous intelligent processor and operation method

Publications (2)

Publication Number Publication Date
CN111930669A CN111930669A (en) 2020-11-13
CN111930669B true CN111930669B (en) 2023-09-01

Family

ID=73306660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010770240.5A Active CN111930669B (en) 2020-08-03 2020-08-03 Multi-core heterogeneous intelligent processor and operation method

Country Status (1)

Country Link
CN (1) CN111930669B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103645954A (en) * 2013-11-21 2014-03-19 华为技术有限公司 CPU scheduling method, device and system based on heterogeneous multi-core system
CN108805792A (en) * 2017-04-28 2018-11-13 英特尔公司 Programmable coarseness with advanced scheduling and sparse matrix computing hardware
CN109359736A (en) * 2017-04-06 2019-02-19 上海寒武纪信息科技有限公司 Network processing unit and network operations method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10853074B2 (en) * 2014-05-01 2020-12-01 Netronome Systems, Inc. Table fetch processor instruction using table number to base address translation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103645954A (en) * 2013-11-21 2014-03-19 华为技术有限公司 CPU scheduling method, device and system based on heterogeneous multi-core system
CN109359736A (en) * 2017-04-06 2019-02-19 上海寒武纪信息科技有限公司 Network processing unit and network operations method
CN108805792A (en) * 2017-04-28 2018-11-13 英特尔公司 Programmable coarseness with advanced scheduling and sparse matrix computing hardware

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于FPGA的寻址与运算操作数存储IP核设计;李克俭;李洋;柯宝中;雷琳;;广西科技大学学报;28(第04期);全文 *

Also Published As

Publication number Publication date
CN111930669A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN109101273B (en) Neural network processing device and method for executing vector maximum value instruction
CN110298443B (en) Neural network operation device and method
CN109219821B (en) Arithmetic device and method
CN111381871A (en) Operation method, device and related product
CN111930668B (en) Arithmetic device, method, multi-core intelligent processor and multi-core heterogeneous intelligent processor
CN111930669B (en) Multi-core heterogeneous intelligent processor and operation method
CN116860665A (en) Address translation method executed by processor and related product
CN111079925B (en) Operation method, device and related product
CN111078284B (en) Operation method, system and related product
CN111078291B (en) Operation method, system and related product
CN112395003A (en) Operation method, device and related product
CN111400341B (en) Scalar lookup instruction processing method and device and related product
CN111381872A (en) Operation method, device and related product
CN111399905B (en) Operation method, device and related product
CN111079924B (en) Operation method, system and related product
CN111078282B (en) Operation method, device and related product
CN111026440B (en) Operation method, operation device, computer equipment and storage medium
CN111047035B (en) Neural network processor, chip and electronic equipment
CN111382390B (en) Operation method, device and related product
CN111078125B (en) Operation method, device and related product
CN111079907B (en) Operation method, device and related product
CN111078283B (en) Operation method, device and related product
CN113033791A (en) Computing device for order preservation, integrated circuit device, board card and order preservation method
CN113032298A (en) Computing device for order preservation, integrated circuit device, board card and order preservation method
CN115878184A (en) Method, storage medium and device for moving multiple data based on one instruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant