CN111027690B - Combined processing device, chip and method for performing deterministic reasoning - Google Patents

Combined processing device, chip and method for performing deterministic reasoning Download PDF

Info

Publication number
CN111027690B
CN111027690B CN201911176119.3A CN201911176119A CN111027690B CN 111027690 B CN111027690 B CN 111027690B CN 201911176119 A CN201911176119 A CN 201911176119A CN 111027690 B CN111027690 B CN 111027690B
Authority
CN
China
Prior art keywords
instruction
module
vector
neural network
execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911176119.3A
Other languages
Chinese (zh)
Other versions
CN111027690A (en
Inventor
陈子祺
田甲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201911176119.3A priority Critical patent/CN111027690B/en
Publication of CN111027690A publication Critical patent/CN111027690A/en
Application granted granted Critical
Publication of CN111027690B publication Critical patent/CN111027690B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)

Abstract

The application relates to a combined processing device for executing deterministic reasoning, which comprises an instruction extraction module, an instruction pre-decoding module, an instruction execution module, a memory access module and a register write-back module; the instruction fetch module reads instructions from the instruction memory and updates the program counter to point to the next instruction; the instruction pre-decoding module decodes the compressed instruction into a native instruction, and the instruction decoding module accesses the register file and determines branch control; the instruction execution module obtains a result by executing vector calculation or scalar calculation for the instruction, accesses the accessed memory for the Load/Store instruction, calculates branches and jumps, and checks according to the expected result; in the memory access module, memory access of the pipeline is completed; the register write-back module writes the results of the execution phase to the register file. The application also relates to an operation device, a calculation chip and application of the deterministic neural network algorithm comprising the combined processing device.

Description

Combined processing device, chip and method for performing deterministic reasoning
Technical Field
The application relates to a combined processing device, a chip and a method for executing deterministic reasoning, which are suitable for the technical field of computers.
Background
An artificial neural network, for short, is a mathematical model or a computational model imitating the structure and function of a biological neural network in the fields of machine learning and cognitive science, and is used for estimating and approximating functions. Neural networks are made up of a large number of interconnections between nodes (or neurons). The connection between each two nodes represents a weight, called a weight, for the signal passing through the connection, which corresponds to the memory of the artificial neural network. The output of the network is different according to the connection mode of the network, the weight value and the excitation function. The network itself is usually an approximation to some algorithm or function in nature, and may also be an expression of a logic policy.
The research of artificial neural networks has been in progress in recent decades, and the artificial neural networks have successfully solved many practical problems which are difficult to solve by modern computers in the fields of pattern recognition, intelligent robots, automatic control, predictive estimation, biology, medicine, economy and the like, show good intelligent characteristics, and promote the continuous development of information technology and artificial intelligence fields.
While neural networks have achieved extensive success in many areas, most neural network algorithms are designed without consideration to data security and computational verifiability. First, existing neural network algorithms do not take into account the reproducibility and consistency of operations, and the results of the operations may not be consistent in different architectures or even in the same computing environment. Such uncertainty is multifactorial, including rounding errors of floating point operations, contention in parallel computations, and the like. Second, existing algorithms do not have the function of security protection for training data and reasoning results. The application of the neural network algorithm in fields with higher security requirements such as finance, trusted computing, blockchain, intelligent contracts and the like is greatly limited by the influence. Realizing deterministic neural network calculation, ensuring the safety, reproducibility and credibility of the calculated data of the neural network model has become urgent demands in industry and outside
Disclosure of Invention
The invention aims to provide an operation device and an operation method of a calculation chip capable of executing neural network calculation in a determined sequence, which can eliminate factors such as calculation rounding errors, parallel competition and the like in the process of the neural network calculation, avoid the condition of inconsistency of the neural network calculation and solve the verifiability problem of the deep neural network calculation.
The combined processing device for executing deterministic reasoning comprises an instruction extraction module, an instruction pre-decoding module, an instruction execution module, a memory access module and a register write-back module; the instruction fetching module reads an instruction from the instruction memory and updates the program counter to point to the next instruction; the instruction pre-decoding module decodes the compressed instruction into a native instruction, and the instruction decoding module accesses the register file and determines branch control; the instruction execution module obtains a result by executing vector calculation or scalar calculation for the instruction, accesses the accessed memory for the Load/Store instruction, calculates branches and jumps, and checks according to the expected result; in the memory access module, memory access of the pipeline is completed; the register write-back module writes the results of the execution phase to the register file.
An arithmetic device of a calculation chip of a deterministic neural network algorithm according to the present application, comprising:
integer vector arithmetic unit, which is used to carry out integer vector operation;
an execution pipeline for general purpose operations and reading, decoding and execution of instructions;
a storage unit for providing the execution pipeline access, and storing input values, output values, and temporary values for the integer vector operator to execute each instruction;
the execution pipeline is a combined processing apparatus as described above.
Preferably, the integer vector operator is provided therein with a temporary value storage unit for storing temporary values calculated in accordance with instructions and performing read and write operations to the storage unit. The integer vector operator and the storage unit are configured as random memories, and input values and temporary values in the integer vector operator and the storage unit are discarded after each operation instruction is completed. The integer vector operator performs integer vector addition, integer vector multiply-add, and integer vector nonlinear function value calculation.
Preferably, the integer vector operator accesses the memory unit through an address index, reads and writes an input value, an output value, and a temporary value; and calling and executing in the instruction execution stage of the execution pipeline, acquiring the address indexes corresponding to the input value, the output value and the temporary value in the storage unit, and completing the reading and writing operation of the storage unit of the corresponding address index through the address index distributed by the instruction execution pipeline. More preferably, the integer vector operation instruction reads the corresponding address index of the input vector in the memory from the register file and provides the address index of the placeholder of the output vector to the integer vector operator to complete the calculation.
A deterministic deep neural network computing chip according to the present application, comprising: the system comprises an execution pipeline, an integer vector arithmetic unit, a data interface, an instruction interface, a branch prediction module, a data cache module and a trusted computing coprocessor;
the execution pipeline interacts with the instruction cache unit, the register file, the control and status register to complete instruction execution, and the branch prediction module, the data cache module and the trusted computing coprocessor serve as additional modules of the execution pipeline to provide branch prediction, safe computing related functions and instruction cache related functions; the execution pipeline is a combined processing apparatus as described above.
Preferably, the trusted computing coprocessor receives external instructions and communicates via a data interface and a communication bus, and the security algorithm is accelerated during the instruction execution phase by a security engine comprising a symmetric encryption algorithm, an asymmetric encryption algorithm, and a random number generator. Preferably, the trusted computing coprocessor comprises an encryption and decryption coprocessor, and the encryption and decryption coprocessor comprises an encryption operation unit, a decryption operation unit, a random number generator, an encryption and decryption module and a security protection module.
The application also relates to a method for calculating by using the deterministic deep neural network calculation chip, which comprises the following steps:
step 1: the neural network compiler converts the neural network computational graph into a vector operation expression list through a deterministic topology ordering algorithm; acquiring storage indexes of all the corresponding placeholders of the vectors in the vector operation expression list through a storage planning algorithm; storing all input vectors into storage indexes of corresponding placeholders for neural network reasoning operation;
step 2: the instruction interface reads the instructions of the vector operation expression list from the instruction cache;
step 3: the controller decodes the instruction into a microinstruction for each functional component;
step 4: the operation unit obtains an input vector or scalar from the register file, determines whether the operation is finished by the integer vector operator according to the operation code and the operand type, and writes the result back to the register file;
step 5: the integer vector operation instruction reads the corresponding address index of the input vector in the memory from the register file, and provides the corresponding address index of the placeholder of the output vector together with the address index of the placeholder of the output vector to the integer vector operator to finish calculation;
step 6: executing memory access and register write-back according to the operation code;
and iteratively executing the steps until an operation result of the neural network calculation graph is obtained.
In step 3, the controller accesses the register file, calculates the immediate value, sets the branch, checks the illegal operation code and the combination of the operation code, and splits the vector instruction of the low-frequency operator and the vector instruction beyond the range of the definition domain of the vector operator into a plurality of general scalar instructions through vector expansion and decodes the general scalar instructions.
The application also relates to an electronic device comprising a deterministic deep neural network computing chip as described above.
The application also relates to a board card comprising a memory device, a communication device and a control device, and a deterministic deep neural network computing chip as described above; the computing chip is respectively connected with the storage device, the control device and the communication device.
Preferably, the board card further comprises a main board bus for connecting the memory device, the communication device and the control device, wherein the main board bus comprises a data bus, an address bus and a control bus;
the main control of the main board bus is the deterministic neural network computing chip, the computing chip sends control signals to all components through a control bus, the components needing to be accessed are designated through address signals of an address bus, and data information is transmitted through a data bus.
According to the deterministic deep neural network data processor, the chip and the electronic equipment provided by the application, the deterministic deep neural network data processor has the following beneficial technical effects:
(1) By executing the scheduling of the pipeline, the instructions of the neural network computation graph are executed strictly according to the topological order of the graph, and the uncertainty caused by competition in parallel computation is eliminated.
(2) The uncertainty caused by rounding errors of the traditional floating point vector operation is eliminated through equivalent transformation of an integer vector operator and a nonlinear vector operation.
(3) After the operation module generates the output value, the input value and the temporary value stored in the storage unit are discarded, so that the usable storage unit for calculation is increased, and the utilization rate of the storage unit is improved.
(4) The computing device and the computing method of the application provide deterministic neural network computation.
(5) The trusted computing coprocessor ensures the safety of data from the bottom layer, meets the requirements of high performance, high integration and miniaturization, and has the function of data transmission safety protection.
Drawings
FIG. 1 illustrates one embodiment of a deterministic deep neural network computing chip of the present invention.
FIG. 2 illustrates one embodiment of a trusted computing co-processor of the present invention.
Fig. 3 depicts a computational flow of a neural network according to an embodiment of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail hereinafter with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be arbitrarily combined with each other.
An arithmetic device of a calculation chip of a deterministic neural network algorithm according to the present invention includes:
integer vector arithmetic unit, is used for carrying on the addition, subtraction, multiplication and addition and nonlinear evaluation operation of the integer vector;
the execution pipeline is used for general operation and instruction reading, decoding and execution;
a memory unit for providing the memory unit for performing pipeline access and storing input values, output values and temporary values for each instruction executed by the integer vector operator;
the integer vector operator is provided with a temporary value storage unit for storing temporary values calculated in accordance with instructions and performing read and write operations on the storage unit.
Preferably, the integer vector operator and the storage unit are configured as a random access memory; the integer vector arithmetic unit accesses the storage unit through an address index, reads and writes an input value, an output value and a temporary value; the integer vector operator, the input value in the memory unit, and the temporary value will be discarded after each operation instruction is completed.
Preferably, execution is invoked in the instruction execution phase of the execution pipeline, address indexes corresponding to the input value, the output value and the temporary value in the storage unit are acquired, and the read and write operations of the storage unit with the corresponding address indexes are completed through the address indexes allocated by the instruction execution pipeline.
FIG. 1 illustrates one embodiment of a deterministic deep neural network computing chip 100 with privacy protection in accordance with the present invention. As shown, deterministic deep neural network computing chip 100 includes execution pipeline 101, integer vector operator 102, data interface 103, instruction interface 104, branch prediction module 105, data cache module 106, trusted computing coprocessor 200. The deterministic deep neural network computing chip 100 implements a harvard architecture for accessing both instruction and data memory. The execution pipeline 101 is a combined processing device that performs deterministic artificial intelligence reasoning. The harvard architecture is a memory architecture that separates program instruction storage from data storage. The harvard architecture is a parallel architecture and is mainly characterized by storing programs and data in different memory spaces, i.e. the program memory and the data memory are two independent memories, each of which is addressed and accessed independently.
The deterministic deep neural network computing chip 100 of the present application has an optimized folded 6-stage pipeline that optimizes the overlap between execution and memory access, thereby reducing stalls and improving efficiency. Specifically, deterministic deep neural network computing chip 100 has an optimized folded 6-level execution pipeline 101, which includes an instruction fetch module 110, an instruction pre-decode module 111, an instruction decode module 112, an instruction execute module 113, a memory access module 114, and a register write-back module 115. By overlapping execution stages, the execution pipeline 101 is able to execute one instruction per clock cycle.
Execution pipeline 101 may interact with instruction cache unit 120, register file 121, control and status registers 122 to complete instruction execution, and branch prediction module 105, data cache module 106, and trusted computing coprocessor 200 may provide branch prediction, secure computing-related functions, instruction cache-related functions as additional modules to execution pipeline 101.
The instruction cache unit 120 is used to speed up the fetching of instructions by caching recently fetched instructions. The instruction cache is able to fetch one packet per cycle on any 16-bit boundary, but cannot fetch across block boundaries. During a cache miss, the complete block will be loaded from the instruction memory. Instruction caches may be configured according to user needs. The cache size, block length, associativity and replacement algorithms are configurable. Register file 121 may be comprised of 32 register units (X0-X31). Register X0 is always zero. The register file has two read ports and one write port. The state of deterministic deep neural network computing chip 100 is maintained by control and status registers 122 (CSR). Control and status registers 122 determine the set of functions, set interrupts and interrupt masks, and determine privilege levels. The CSR maps to an internal 12-bit address space and can be accessed using special commands.
The deterministic deep neural network computing chip 100 of the present invention can execute one instruction per clock cycle. But due to the pipeline architecture, each instruction requires several clock cycles to complete. When a branch instruction is decoded, its condition and outcome are unknown, and waiting for the branch outcome before continuing to fetch a new instruction can cause excessive processor stalls, affecting the performance of the processor. The present application provides for the provision of a branch prediction module 105 so that the processor can predict the outcome of a branch without waiting and continue fetching instructions from the predicted address. When a branch error is predicted, the processor must flush its pipeline and resume fetching from the calculated branch address. The state of the processor is not affected because the pipeline is flushed and therefore virtually no erroneously fetched instructions are executed. However, branch prediction may have forced the instruction cache to load new instructions. The instruction cache state is not restored, meaning that the predicted instruction remains in the instruction cache.
The data cache module 106 is configured to accelerate data memory accesses by buffering recently accessed memory locations. Data caches can handle byte, half-word and word accesses as long as they are located on respective boundaries. Accessing a memory location on an unnatural boundary (e.g., a word access at address 0x 003) results in a data loading trap. During a cache miss, if necessary, the complete block is written back to memory and the newly loaded block is then loaded into the cache.
The trusted computing coprocessor 200 receives external instructions and communicates via the data interface 202 and the communication bus 203, and the security algorithm is accelerated by the security engine 204 during the instruction execution phase. The security engine comprises a common symmetric encryption algorithm, an asymmetric encryption algorithm and a random number generator, and specifically comprises the following steps: a hardware random number generator 210, an aes, 3DES algorithm operator 211, a hash and HMAC algorithm operator 212, a public key private key algorithm operator 213. The HASH and HMAC algorithm arithmetic unit supports the calculation acceleration of common HASH function algorithms such as SHA1, SHA256 and the like; the public key and private key algorithm arithmetic unit supports the calculation acceleration of common public key and private key algorithms such as RSA, diffie-Hellman, ECC and the like. The chip computing method of the trusted computing coprocessor based on the edge trusted cryptography technology is matched with a neural network computing processor, so that the high-efficiency and low-cost edge trusted computing can be realized, and the digital signature technology, the digital authentication technology and the digital authentication technology can be realized functionally.
Preferably, the trusted computing coprocessor comprises an encryption and decryption coprocessor supporting the cryptographic algorithms of ECC, AES, SHA and the like, and specifically comprises an encryption operation unit, a decryption operation unit, a data interface and a communication bus, and the encryption and decryption coprocessor further comprises a random number generator, an encryption and decryption module and a security protection module. The encryption operation unit and the decryption operation unit are used for carrying out encryption and decryption operation on the target data received by the password chip according to the random number generated by the random number generator. The safety protection module is used for acquiring real-time environment data of the password chip and carrying out safety control on the password chip.
Instruction fetch module 110 reads one instruction from instruction memory 121 and updates the program counter to point to the next instruction. Instruction pre-decode module 111 decodes the 16-bit packed instruction into a native 32-bit instruction and instruction decode module 112 accesses the register file and determines branch control. Instruction execution module 113 will perform vector calculations or scalar calculations for ALU, MUL, DIV instructions to access the accessed memory for Load/Store instructions and calculate branches and jumps and check for their expected results. In the memory access module 114, memory access of the pipeline is complete, including this stage, ensuring high performance of the pipeline. The register write back module 115 writes the results of the execution phase to the register file.
Wherein the instruction fetch unit in the instruction fetch module 110 loads the new package from the program memory. A package is a code field containing one or more instructions. The address of the package to be loaded is maintained by a Program Counter (PC). The program counter is 32 bits or 64 bits wide. The program counter is updated as long as the instruction pipeline is not stalled. If the pipeline is flushed, the program counter will restart from the given address.
Where the pre-decode unit in instruction pre-decode module 111 converts the 16-bit compressed instruction into a basic 32-bit instruction, and then the handler counter modifies the instruction, such as "jump and link" and "branch". This avoids waiting for the execution phase to trigger an update and reduces the need for pipeline refreshes. The target address of the branch is predicted from data provided by an optional branch prediction unit or is statically determined from an offset.
Wherein the instruction decode unit in the instruction decode module 112 ensures that operands of the execution unit are available. It accesses the register file, calculates the immediate value, sets the branch, and checks for illegal opcodes and opcode combinations.
Wherein the instruction execution module 113 performs desired operations on the data provided by the instruction decoding module 112. The execution phase has a plurality of execution units, each execution unit having a unique function. The integer vector operator 102 performs integer vector addition, integer vector multiply-add, and integer vector nonlinear function value calculation. ALUs perform logical and arithmetic operations. The multiplier unit calculates signed/unsigned multiplications. The divider unit calculates a signed/unsigned division and a remainder. The load store unit accesses the data store. The branch unit calculates the jump and branch addresses and validates the predicted branches. The operation can only be performed once per clock cycle. Most operations are completed in one clock cycle except for the division instruction and integer vector operation instruction, which always require multiple clock cycles to complete. The multiplier supports configurable delays to improve performance.
In addition, the instruction execution module 113 may also perform required operations on the data provided by the integer vector operator 102. The integer vector operation instruction reads the corresponding address index of the input vector in the memory from the register file, and provides the corresponding address index of the placeholder of the output vector to the integer vector operator to complete the related vector calculation.
In the memory access module 114, the memory access phase waits for the memory read access to complete. When accessing the memory, address, data and control signals will be calculated during the execution phase. The memory latches these signals and then performs the actual access. This means that read data is not available until after 1 clock cycle. In the register write back module 115, the write back phase writes the results from the execution unit and memory read operations to the register file.
Preferably, the integer vector operator 102 provides an adder 130, a multiply-add operator 131 and a function value operator 132. The integer vector arithmetic unit inputs a one-dimensional vector with a length not exceeding V_LEN, outputs a one-dimensional vector storing the corresponding result, and the length should not exceed V_LEN. In this embodiment, v_len is 216, i.e., the input vector length of the integer vector operator does not exceed 216. The integer vector operator obtains a vector from the address index of the memory and writes the result back into the output placeholder vector of the corresponding address index. The integer vector arithmetic unit can carry out integer numerical operation on three operators of addition, subtraction and multiplication addition through different components; calculating a common nonlinear operator such as tanh, sigmoid, exp by a table look-up method, wherein the definition domain of the common nonlinear operator does not exceed the limit of 8-bit unsigned integers; calculating conversion operators such as activation (ReLU), maximum pooling (max pooling), transposition (transfer), matrix extremum (max, min) and the like through a built-in hardware ordering network; and for other low-frequency operators and vector calculation beyond the operator definition domain range, calculating each element of the vector one by one through a pipeline by vector expansion.
The neural network computation graph is a directed graph, each node of the graph represents a layer of neurons in the neural network, and each side represents a connection relationship between layers, and the graph can be generally divided into a directed acyclic graph and a directed cyclic graph. The neural network computational graph of the present invention is specifically a loop-free directed graph. The neural network reasoning operation is performed on the deterministic neural network computing chip 100 according to the present invention, and the neural network computing graph with the contracted execution format and the input data corresponding to each placeholder in the computing graph should be provided. The neural network computation graph is sequenced in a deterministic topology to determine the computation sequence of all vector operators, and all vector operation instructions are executed one by one according to the topology sequence.
Fig. 3 depicts a computational flow of a neural network according to an embodiment of the present disclosure.
Step 1: the neural network compiler converts the neural network computational graph 300 into a vector operation expression list 301 through a deterministic topology ordering algorithm for use in the computational chips of the present invention; acquiring storage indexes of all the corresponding placeholders of the vectors in the vector operation expression list through a storage planning algorithm; all input vectors are stored in the storage index 302 of the corresponding placeholder for neural network reasoning operations.
Step 2: the instruction interface 104 reads instructions from the instruction cache into the vector operation expression list 301, such as vector add instructions, vector multiply add instructions, vector activate function instructions, or the like, for general vector instructions suitable for performing neural network operations, or general scalar instructions suitable for integer calculations.
Step 3: the controller decodes the instructions into microinstructions for each of the functional components. It accesses the register file, calculates the immediate value, sets the branch, and checks for illegal opcodes and opcode combinations. For the low-frequency operator and the vector instruction beyond the range of the operator definition domain of the vector operator, the vector is spread, split into a plurality of general scalar instructions and decoded.
Step 4: the arithmetic unit obtains the input vector/scalar from the register file and determines, based on the opcode and operand type, whether the operation is completed by the integer vector operator and writes the result back to the register file. The integer vector operation instruction reads the corresponding address index of the input vector in the memory from the register file, and provides the corresponding address index of the placeholder of the output vector to the integer vector operator to complete the related vector calculation. Integer scalar operation instructions, floating point scalar operation instructions, and vector operation instructions that have been split are executed by an operator built in an operation unit.
Step 5: memory access and register write back are performed according to the opcode.
And iteratively executing the five steps until an operation result of the neural network calculation graph is obtained.
The application also relates to a computing board card carrying the computing chip, wherein the computing board card is further provided with a storage unit module taking a random access memory as a medium, a SATA bus module used for being connected with an external memory, an Ethernet bus module used for being connected with external communication and a main board bus used for being connected with the module.
The computing board card carrying the computing chip comprises a main board bus, a data bus, an address bus and a control bus. The main control of the main board bus is a calculation chip of the deterministic neural network algorithm, the calculation chip sends control signals to all components through a control bus, the components which need to be accessed are designated by address signals through an address bus, and data information is transmitted through a data bus.
The application also relates to an electronic device comprising a deterministic deep neural network computing chip as described above.
The application also relates to a board card comprising a memory device, a communication device and a control device, and a deterministic deep neural network computing chip as described above; the computing chip is respectively connected with the storage device, the control device and the communication device. The storage device is used for storing data, the communication device is used for realizing data transmission between the chip and external equipment, and the control device is used for monitoring the state of the chip.
Preferably, the board card further comprises a main board bus for connecting the memory device, the communication device and the control device, wherein the main board bus comprises a data bus, an address bus and a control bus; the main control of the main board bus is the deterministic neural network computing chip, the computing chip sends control signals to all components through a control bus, the components needing to be accessed are designated through address signals of an address bus, and data information is transmitted through a data bus.
Although the embodiments disclosed in the present application are described above, the descriptions are merely for facilitating understanding of the present application, and are not intended to limit the present application. Any person skilled in the art to which this application pertains will be able to make any modifications and variations in form and detail of implementation without departing from the spirit and scope of the disclosure, but the scope of the patent claims of this application shall be subject to the scope of the claims that follow.

Claims (15)

1. The combined processing device for executing deterministic reasoning is characterized by comprising an instruction extraction module, an instruction pre-decoding module, an instruction execution module, a memory access module and a register write-back module; the instruction fetching module reads an instruction from the instruction memory and updates the program counter to point to the next instruction; the instruction pre-decoding module decodes the compressed instruction into a local instruction, an instruction decoding unit in the instruction decoding module ensures that operands of an execution unit are available, and the instruction decoding module accesses a register file and determines branch control; the instruction execution module executes integer vector calculation or scalar calculation for the instruction to obtain an integer result, accesses the accessed memory for the Load/Store instruction, calculates branches and jumps, and checks according to the expected result; in the memory access module, memory access of the pipeline is completed; the register write-back module writes the results of the execution phase to the register file.
2. An arithmetic device of a calculation chip of a deterministic neural network algorithm, comprising:
integer vector arithmetic unit, which is used to carry out integer vector operation;
an execution pipeline for general purpose operations and reading, decoding and execution of instructions;
a storage unit for providing the execution pipeline access, and storing input values, output values, and temporary values for the integer vector operator to execute each instruction;
the execution pipeline is a combined processing apparatus according to claim 1.
3. The arithmetic device according to claim 2, wherein the integer vector operator is provided therein with a temporary value storage unit for storing temporary values calculated in accordance with instructions and performing read and write operations to the storage unit.
4. The arithmetic device according to claim 2, wherein the integer vector operator and the storage unit are configured as a random memory, and input values and temporary values in the integer vector operator and the storage unit are to be discarded after each operation instruction is completed.
5. The arithmetic device according to claim 2 or 3 or 4, wherein the integer vector operator performs integer vector addition, integer vector multiply-add, and integer vector nonlinear function value calculation.
6. The arithmetic device according to claim 2 or 3 or 4, wherein the integer vector operator accesses the memory unit through an address index, reads and writes an input value, an output value, and a temporary value; and calling and executing in the instruction execution stage of the execution pipeline, acquiring the address indexes corresponding to the input value, the output value and the temporary value in the storage unit, and completing the reading and writing operation of the storage unit of the corresponding address index through the address index distributed by the instruction execution pipeline.
7. The computing device of claim 6, wherein the integer vector operation instruction reads a corresponding address index of the input vector from the register file and provides the address index of the placeholder of the output vector to the integer vector operator to perform the computation.
8. A deterministic deep neural network computing chip, comprising: the system comprises an execution pipeline, an integer vector arithmetic unit, a data interface, an instruction interface, a branch prediction module, a data cache module and a trusted computing coprocessor;
the execution pipeline interacts with the instruction cache unit, the register file, the control and status register to complete instruction execution, and the branch prediction module, the data cache module and the trusted computing coprocessor serve as additional modules of the execution pipeline to provide branch prediction, safe computing related functions and instruction cache related functions;
the execution pipeline is a combined processing apparatus according to claim 1.
9. The deterministic deep neural network computing chip according to claim 8, wherein the trusted computing coprocessor receives external instructions and communicates via a data interface and a communication bus, and the security algorithm is accelerated during instruction execution by a security engine comprising a symmetric encryption algorithm, an asymmetric encryption algorithm, and a random number generator.
10. The deterministic deep neural network computing chip according to claim 8 or 9, wherein the trusted computing co-processor comprises an encryption and decryption co-processor comprising an encryption operation unit, a decryption operation unit, a random number generator, an encryption and decryption module and a security protection module.
11. A method of computing using the deterministic deep neural network computing chip of any one of claims 8-10, characterized by: the method comprises the following steps:
step 1: the neural network compiler converts the neural network computational graph into a vector operation expression list through a deterministic topology ordering algorithm; acquiring storage indexes of all the corresponding placeholders of the vectors in the vector operation expression list through a storage planning algorithm; storing all input vectors into storage indexes of corresponding placeholders for neural network reasoning operation;
step 2: the instruction interface reads the instructions of the vector operation expression list from the instruction cache;
step 3: the controller decodes the instruction into a microinstruction for each functional component;
step 4: the operation unit obtains an input vector or scalar from the register file, determines whether the operation is finished by the integer vector operator according to the operation code and the operand type, and writes the result back to the register file;
step 5: the integer vector operation instruction reads the corresponding address index of the input vector in the memory from the register file, and provides the corresponding address index of the placeholder of the output vector together with the address index of the placeholder of the output vector to the integer vector operator to finish calculation;
step 6: executing memory access and register write-back according to the operation code;
and iteratively executing the steps until an operation result of the neural network calculation graph is obtained.
12. The method according to claim 11, wherein: in step 3, the controller accesses the register file, calculates the immediate value, sets the branch, and checks the illegal operation code and operation code combination, and splits the vector instruction of the low frequency operator and the vector operator beyond the range of the vector operator definition domain into a plurality of general scalar instructions through vector expansion and decodes.
13. An electronic device, characterized in that it comprises a deterministic deep neural network computing chip according to any one of claims 8-10.
14. A board comprising a memory device, a communication device, and a control device, the deterministic deep neural network computing chip of any one of claims 8-10;
the computing chip is respectively connected with the storage device, the control device and the communication device.
15. The board card of claim 14, further comprising a motherboard bus for connecting the memory device, the communication device, and the control device, the motherboard bus including a data bus, an address bus, and a control bus;
the main control of the main board bus is the deterministic deep neural network computing chip, the computing chip sends control signals to all components through a control bus, the components needing to be accessed are specified through address signals of an address bus, and data information is transmitted through a data bus.
CN201911176119.3A 2019-11-26 2019-11-26 Combined processing device, chip and method for performing deterministic reasoning Active CN111027690B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911176119.3A CN111027690B (en) 2019-11-26 2019-11-26 Combined processing device, chip and method for performing deterministic reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911176119.3A CN111027690B (en) 2019-11-26 2019-11-26 Combined processing device, chip and method for performing deterministic reasoning

Publications (2)

Publication Number Publication Date
CN111027690A CN111027690A (en) 2020-04-17
CN111027690B true CN111027690B (en) 2023-08-04

Family

ID=70202282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911176119.3A Active CN111027690B (en) 2019-11-26 2019-11-26 Combined processing device, chip and method for performing deterministic reasoning

Country Status (1)

Country Link
CN (1) CN111027690B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688623A (en) * 2020-07-01 2024-03-12 陈子祺 Trusted computing chip based on blockchain
CN112633505B (en) * 2020-12-24 2022-05-27 苏州浪潮智能科技有限公司 RISC-V based artificial intelligence reasoning method and system
CN113010213B (en) * 2021-04-15 2022-12-09 清华大学 Simplified instruction set storage and calculation integrated neural network coprocessor based on resistance change memristor
CN115421788B (en) * 2022-08-31 2024-05-03 苏州发芯微电子有限公司 Register file system, method and automobile control processor using register file
CN115686635B (en) * 2023-01-03 2023-04-18 杭州米芯微电子有限公司 MCU structure without clock circuit and corresponding electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529668A (en) * 2015-11-17 2017-03-22 中国科学院计算技术研究所 Operation device and method of accelerating chip which accelerates depth neural network algorithm
CN107329936A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing neural network computing and matrix/vector computing
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN109523020A (en) * 2017-10-30 2019-03-26 上海寒武纪信息科技有限公司 A kind of arithmetic unit and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060259737A1 (en) * 2005-05-10 2006-11-16 Telairity Semiconductor, Inc. Vector processor with special purpose registers and high speed memory access

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529668A (en) * 2015-11-17 2017-03-22 中国科学院计算技术研究所 Operation device and method of accelerating chip which accelerates depth neural network algorithm
CN107329936A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing neural network computing and matrix/vector computing
CN109523020A (en) * 2017-10-30 2019-03-26 上海寒武纪信息科技有限公司 A kind of arithmetic unit and method
CN110084361A (en) * 2017-10-30 2019-08-02 上海寒武纪信息科技有限公司 A kind of arithmetic unit and method
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium

Also Published As

Publication number Publication date
CN111027690A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN111027690B (en) Combined processing device, chip and method for performing deterministic reasoning
CN109661647B (en) Data processing apparatus and method
EP3391195B1 (en) Instructions and logic for lane-based strided store operations
US10031765B2 (en) Instruction and logic for programmable fabric hierarchy and cache
BR102020019657A2 (en) apparatus, methods and systems for instructions of a matrix operations accelerator
US20200210516A1 (en) Apparatuses, methods, and systems for fast fourier transform configuration and computation instructions
US20170286122A1 (en) Instruction, Circuits, and Logic for Graph Analytics Acceleration
US10157059B2 (en) Instruction and logic for early underflow detection and rounder bypass
WO2017112176A1 (en) Instructions and logic for load-indices-and-prefetch-gathers operations
US20170177363A1 (en) Instructions and Logic for Load-Indices-and-Gather Operations
WO2017112177A1 (en) Instructions and logic for lane-based strided scatter operations
WO2017112193A1 (en) Instruction and logic for reoccurring adjacent gathers
US10061746B2 (en) Instruction and logic for a vector format for processing computations
US20170177353A1 (en) Instructions and Logic for Get-Multiple-Vector-Elements Operations
JP2019511056A (en) Complex multiplication instruction
WO2017112175A1 (en) Instructions and logic for load-indices-and-scatter operations
WO2017172172A1 (en) Instruction, circuits, and logic for piecewise linear approximation
TW201729077A (en) Instructions and logic for SET-multiple-vector-elements operations
TW201732556A (en) Hardware content-associative data structure for acceleration of set operations
WO2017105716A1 (en) Instructions and logic for even and odd vector get operations
US10467006B2 (en) Permutating vector data scattered in a temporary destination into elements of a destination register based on a permutation factor
JP2021507348A (en) Addition instruction with vector carry
US9588765B2 (en) Instruction and logic for multiplier selectors for merging math functions
Stepchenkov et al. Recurrent data-flow architecture: features and realization problems
Glossner et al. HSA-enabled DSPs and accelerators

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant