CN113032013A - Data transmission method, chip, equipment and storage medium - Google Patents

Data transmission method, chip, equipment and storage medium Download PDF

Info

Publication number
CN113032013A
CN113032013A CN202110518021.2A CN202110518021A CN113032013A CN 113032013 A CN113032013 A CN 113032013A CN 202110518021 A CN202110518021 A CN 202110518021A CN 113032013 A CN113032013 A CN 113032013A
Authority
CN
China
Prior art keywords
instruction
execution data
data
chip
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110518021.2A
Other languages
Chinese (zh)
Other versions
CN113032013B (en
Inventor
周军
常亮
周亮
王文强
杨雨桐
徐宁仪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Chengdu Sensetime Technology Co Ltd
Original Assignee
University of Electronic Science and Technology of China
Chengdu Sensetime Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China, Chengdu Sensetime Technology Co Ltd filed Critical University of Electronic Science and Technology of China
Publication of CN113032013A publication Critical patent/CN113032013A/en
Application granted granted Critical
Publication of CN113032013B publication Critical patent/CN113032013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Advance Control (AREA)

Abstract

The application provides a data transmission method, a chip, equipment and a storage medium. The chip may include an instruction processor and an instruction memory connected to the instruction processor. The method may include fetching instructions through the instruction memory and executing data stored in an external memory. Wherein the execution data includes data required for executing the instruction. The instructions and the execution data are input into the instruction processor. The instructions and the execution data are transmitted to a register array included in the chip through the instruction processor, so that each computing core executes the instructions according to the execution data.

Description

Data transmission method, chip, equipment and storage medium
Technical Field
The present application relates to computer technologies, and in particular, to a data transmission method, a chip, a device, and a storage medium.
Background
A chip generally includes a plurality of compute cores, and each compute core needs to execute a corresponding operation according to a received instruction.
For example, the AI chip typically includes a two-dimensional shift register array (hereinafter referred to as a register array). The array includes a plurality of compute kernels (hereinafter referred to as PEs) arranged in rows and columns. The plurality of PEs may perform an operation corresponding to the received instruction, based on the instruction.
When operating according to instructions, each PE usually needs to rely on execution data, and thus a data transmission method is needed.
Disclosure of Invention
In view of the above, the present application provides a data transmission method applied to a chip, where the chip includes an instruction processor and an instruction memory connected to the instruction processor; the method may include:
obtaining an instruction through the instruction memory and executing data stored in an external memory; wherein the execution data includes data required for executing the instruction;
inputting the instruction and the execution data into the instruction processor;
the instructions and the execution data are transmitted to a register array included in the chip through the instruction processor, so that each computing core executes the instructions according to the execution data.
In some embodiments shown, the instructions comprise Very Long Instruction Word (VLIW) instructions.
In some embodiments, the transmitting the instruction and the execution data to a register array included in the chip by the instruction processor includes:
the instruction processor transmits the instruction and the execution data to a register array included in the chip.
In some embodiments, the transmitting the instruction and the execution data to a register array included in the chip by the instruction processor includes:
adding the execution data to the instruction to form an instruction with the execution data;
and transmitting the instruction with the execution data to a register array included in the chip through the instruction processor.
In some embodiments shown, the instruction processor comprises an instruction processing subunit;
the inputting the instruction and the execution data into the instruction processor includes:
an instruction processing subunit for inputting the instruction and the execution data into the instruction processor; the method further comprises the following steps:
adding the execution data to the instruction through an instruction processing subunit to form an instruction with the execution data;
the transmitting the instruction and the execution data to a register array included in the chip through the instruction processor includes:
the instruction processor transmits the instruction with the execution data to a register array included in the chip in a single instruction multiple data stream mode.
In some embodiments shown, the adding the execution data to the instruction includes:
adding the execution data to a reserved field included in the instruction.
In some embodiments shown, each of the compute cores includes a decoder; the executing of the instructions by the computing cores according to the execution data includes:
the computing kernel responds to the received instruction with the execution data, analyzes the instruction with the execution data through the decoder, and obtains the execution data included in the instruction with the execution data;
the compute kernel stores the execution data in a corresponding register to obtain the execution data from the register when executing the instruction, and executes the instruction according to the execution data.
In some embodiments, the instruction memory is connected to an instruction compiler corresponding to the chip; the obtaining of the instruction through the instruction memory includes:
and acquiring the instruction compiled by the compiler through the instruction memory.
In some embodiments shown, the chip includes a data transfer controller, and a data transfer memory connected to the data transfer controller;
the transmitting the instruction and the execution data to a register array included in the chip through the instruction processor includes:
sending the instruction and the execution data to the data transmission memory through the instruction processor;
the instructions and the execution data are read from the data transfer memory through the data transfer controller and sent to the register array.
In some embodiments, the executing of the instructions by the computing cores according to the execution data includes:
each computation core responds to the step of obtaining the execution data indicated by the instruction, and obtains the execution data stored in a corresponding register;
and executing the subsequent steps indicated by the instruction according to the acquired execution data.
In some embodiments shown, the instruction processor comprises a scalar processor; the instruction memory includes a scalar memory.
In some embodiments shown, the VLIW instruction comprises a convolution operation instruction; the execution data includes weight data.
The application also provides a chip, which can comprise a data transmitter, an instruction processor and an instruction memory connected with the instruction processor;
the instruction memory is used for acquiring instructions and executing data stored in an external memory; wherein the execution data includes data required for executing the instruction;
the data transmitter is used for inputting the instruction and the execution data into the instruction processor;
the instruction processor is configured to transmit the instruction and the execution data to a register array included in the chip, so that each compute core executes the instruction according to the execution data.
In some illustrative embodiments, the instruction processor is configured to:
the instruction processor transmits the instruction and the execution data to a register array included in the chip.
In some illustrative embodiments, the instruction processor is configured to:
adding the execution data to the instruction to form an instruction with the execution data;
and transmitting the instruction with the execution data to a register array included in the chip through the instruction processor.
In some embodiments shown, the instruction processor comprises an instruction processing subunit; the data input device is configured to:
an instruction processing subunit for inputting the instruction and the execution data into the instruction processor;
the processing subunit is configured to add the execution data to the instruction to form an instruction with execution data;
the instruction processor is configured to:
the instruction processor transmits the instruction with the execution data to a register array included in the chip in a single instruction multiple data stream mode.
In some embodiments shown, each of the compute cores includes a decoder;
each of the compute kernels is configured to, in response to receiving the instruction with execution data, parse the instruction with execution data through the decoder to obtain execution data included in the instruction with execution data; and storing the execution data into a corresponding register so as to obtain the execution data from the register when the instruction is executed, and executing the instruction according to the execution data.
In some embodiments, the instruction memory is connected to an instruction compiler corresponding to the chip;
the instruction memory is used for acquiring the instruction compiled by the compiler.
In some embodiments shown, the chip includes a data transfer controller, and a data transfer memory connected to the data transfer controller;
the instruction processor is used for sending the instruction and the execution data to the data transmission memory;
the data transmission memory is used for storing the instruction and the execution data;
the data transfer controller is configured to read the command and the execution data from the data transfer memory and send the read command and the execution data to the register array.
In some embodiments shown, the compute kernel is configured to, in response to the step of obtaining the execution data indicated by the instruction, obtain the execution data stored in a corresponding register;
and executing the subsequent steps indicated by the instruction according to the acquired execution data.
In some embodiments shown, the instruction processor comprises a scalar processor; the instruction memory includes a scalar memory.
In some embodiments shown, the instructions comprise Very Long Instruction Word (VLIW) instructions.
In some embodiments shown, the VLIW instruction comprises a convolution operation instruction; the execution data includes weight data.
The application also provides an electronic device, which comprises the chip shown in any one of the embodiments.
In the above scheme, by using the instruction processor, the execution number and the instruction stored in the external memory corresponding to the chip are efficiently and quickly transmitted to the register array included in the chip, so that each computation core can execute the instruction according to the execution data, thereby avoiding allocating a storage space in the chip specially for the execution data, and simplifying the design difficulty of the chip on the one hand; and on the other hand, the occupation of the storage space of the chip is reduced.
The present application also provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a controller, implementing any of the data transmission methods described above.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate one or more embodiments of the present application or technical solutions in the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive exercise.
Fig. 1 is a schematic structural diagram of an AI chip shown in the present application;
fig. 2 is a schematic method flow diagram of a data transmission method according to the present application;
fig. 3 is a schematic structural diagram of an AI chip shown in the present application;
FIG. 4 is a schematic diagram of the present application illustrating the addition of execution data for a VLIW instruction;
fig. 5 is a flow chart of data transmission according to the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It should also be understood that the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.
Currently, in order to simplify the hardware structure of a chip, a Very Long Instruction Word (VLIW) Instruction (hereinafter referred to as a VLIW Instruction) is usually used to schedule the chip. Usually, a chip needs to execute a plurality of tasks with dependency relationship, and by adopting the VLIW instruction, the control of the tasks with dependency relationship can be realized by replacing a hardware form with a software form, so that the hardware structure of the chip is simplified.
In this application, taking the case that the AI chip executes the VLIW instruction as an example, when the AI chip executes the VLIW instruction, it is usually necessary to obtain some execution data required for executing the instruction, and then the AI chip can continue to execute the instruction. It is a common practice in the related art to divide a hardware memory space for the execution data in advance in the AI chip to store the execution data. And when the execution data is needed, each computation core included in the AI chip acquires the execution data from the hardware storage space and continues to execute the VLIW instruction.
However, this has some disadvantages, on one hand, since the capacity of the execution data is not determined, the size of the hardware memory space is not determined, which is not friendly to the chip designer; on the other hand, the capacity of the AI chip needs to be occupied, resulting in the occupation of memory space.
In view of the above, the present application provides a data transmission method. The method realizes that the execution number and the instruction stored in the external memory corresponding to the chip are efficiently and quickly transmitted to the register array included by the chip by using the instruction processor, so that each computing kernel can execute the instruction according to the execution data, thereby avoiding the storage space specially allocated for the execution data in the chip and simplifying the design difficulty of the chip on the one hand; and on the other hand, the occupation of the storage space of the chip is reduced.
The method can be applied to any type of chip, taking an AI chip as an example. Referring to fig. 1, fig. 1 is a schematic structural diagram of an AI chip shown in the present application. It should be noted that the structure shown in fig. 1 is only schematic, and all devices included in the chip are not shown in the drawing.
As shown in fig. 1, the AI chip may include an instruction processor 11, and an instruction memory 12 connected to the instruction processor. The command memory 12 is connected to an external memory 13 corresponding to the AI chip.
The instruction processor 11 is capable of processing (e.g., re-allocating instructions) the VLIW instruction received by the AI chip, dividing the VLIW instruction into a plurality of VLIW sub-instructions and sending the VLIW sub-instructions to each module unit (e.g., a computation core in a two-dimensional register array, a data shaping unit controller, a line buffer unit controller, etc.) included in the AI chip, so that each module unit executes corresponding steps according to the received VLIW sub-instructions.
In some examples, the chip may use Single Instruction Multiple Data (SIMD) for Data transmission. At this time, the instruction processor 11 may be a scalar processor. Data transfer in SIMD fashion using a scalar processor may facilitate the execution of the VLIW instruction described above by the AI chip.
The instruction memory 12 is used for temporarily storing data required by the instruction processor. In some examples, the instruction memory is connected to an external memory, and data stored in the external memory may be acquired from the external memory. In the present application, the connection mode between the instruction memory and the external memory is not particularly limited. It is to be understood that when the instruction processor 11 is a scalar processor, the instruction memory 12 may be a scalar memory allocated for the scalar processor.
The instruction Memory 12 may be a Static Random-Access Memory (SRAM), a Random-Access Memory (RAM), a Dynamic Random-Access Memory (DRAM), or the like. The application does not limit the type of memory.
The external memory may be a memory accessible by the AI chip. The external memory may store various types of data. In this application, the external memory may store VLIW instructions and/or execution data required for executing the instructions. The external memory may transmit data to the AI chip in response to an access request with the AI chip.
The external Memory may be a Memory such as a Double Data Rate (DDR) Memory, a Synchronous Dynamic Random Access Memory (SDRAM Memory), and the type of the external Memory is not limited in the present application.
Referring to fig. 2, fig. 2 is a schematic flow chart of a data transmission method according to the present application.
As shown in fig. 2, the method may include:
s202, obtaining an instruction through the instruction memory and executing data stored in an external memory; wherein the execution data includes data required for executing the instruction.
The instructions in the disclosed embodiments may comprise VLIW instructions. The description will be continued by taking the VLIW instruction as an example, and the description of other instructions is omitted for the same reason.
The VLIW instruction may be an instruction developed according to actual service requirements. For example, when the AI chip performs a convolution operation, the VLIW instruction may include a convolution operation instruction. It is understood that the VLIW instruction may include the steps of inputting a feature map, inputting weight data, shifting a multiply-add operation, outputting a convolution operation result, and the like. For another example, when the AI chip performs a pooling operation, the VLIW instruction may include a pooling operation instruction.
In some examples, the VLIW instruction may be developed and compiled by a developer and stored in the external memory in advance. At this time, the instruction memory may obtain the VLIW instruction from the external memory in response to a data transfer instruction issued by a main controller (the main controller, which is equivalent to a central processing unit of a computer and executes an instruction developed by a developer to control each module unit included in the AI chip) in the AI chip.
In some examples, the instruction memory is connected to an instruction compiler corresponding to the AI chip. The VLIW instruction needs to be compiled by the compiler before being executed by the AI chip. At this time, the very long instruction word VLIW instruction compiled by the compiler may be received through the instruction memory.
The execution data may be data required for executing the VLIW instruction. For example, when the AI chip performs a convolution operation, the execution data may be weight data (convolution kernel data) necessary for performing the convolution operation.
The execution data may be stored in the external memory. When the AI chip needs to perform the relevant processing by using the execution data, the execution data may be obtained from the external memory, and the data may be transmitted to the computation core for computation by using the data transmission method disclosed in the present application. Therefore, a storage space does not need to be specially allocated for the execution data in the AI chip, and the design difficulty of the AI chip is further simplified; and on the other hand, the occupation of the storage space of the AI chip is reduced.
After the instruction and the execution data are obtained, S204 may be executed to input the instruction and the execution data into the instruction processor.
In some examples, the instruction processor may be a scalar processor and the instruction memory may be a scalar memory. The AI chip may transfer VLIW instructions and the execution data from the scalar memory to the scalar processor in response to instructions developed by a developer.
Then, S206 may be executed, in which the instruction processor transmits the instruction and the execution data to a register array included in the chip, so that each compute core executes the instruction according to the execution data.
In some examples, the instruction processor may be a scalar processor. The scalar processor may process the received VLIW instructions and send the VLIW instructions to each compute core in the register array in SIMD fashion.
In some examples, the AI chip includes a data transfer controller, and a data transfer memory coupled to the data transfer controller.
The data transmission memory is used for temporarily storing data required to be sent to the register array.
The data transmission controller is used for transmitting data into the register array.
In S206, the instruction processor may send the instruction and the execution data to the data transmission memory. Then, the command and the execution data are read from the data transfer memory through the data transfer controller and transferred to the register array.
In some examples, the VLIW instruction and the execution data may be transmitted in SIMD fashion. At this time, a plurality of data transfer memories may be allocated to temporarily store a plurality of data streams, thereby ensuring that the data is successfully written into the register array.
Therefore, data transmission is carried out through the special hardware module, and the stability of data transmission can be ensured.
The computational core may begin executing VLIM instructions after receiving the VLIW instructions and the execution data.
When the VLIW instruction indicates that the execution data needs to be fetched, the compute kernel may fetch the execution data stored in the corresponding register in response to the step of fetching the execution data indicated by the VLIW instruction. And then executing the subsequent steps indicated by the VLIW instruction according to the acquired execution data.
In some examples, the VLIW instruction and the execution data may be written to registers of respective compute cores. The computational core may include a decoder corresponding to the VLIW instruction encoder described above.
The computing kernel can read the VLIW instruction from the register and analyze the VLIW instruction through the decoder to execute corresponding steps. When the execution is performed to the step of acquiring the execution data, the computational core may acquire the execution data from the register, and execute a subsequent step indicated by the VLIW instruction according to the execution data.
Therefore, the VLIW instruction and the execution data can be efficiently transmitted inside the chip.
In the above-described scheme, the instruction memory and the instruction processor mounted on the AI chip may be used to transmit the obtained VLIW instruction and the execution data stored in the external memory to each compute core in the register array included in the AI chip, so that each compute core may execute the VLIW instruction according to the execution data. Therefore, the memory space can be prevented from being specially allocated to the execution data in the AI chip, and the design difficulty of the AI chip is simplified on one hand; and on the other hand, the occupation of the storage space of the AI chip is reduced.
In some examples, the instructions and the execution data may be respectively transmitted to a register array included in the chip through the instruction processor when S206 is executed.
In some examples, the instruction processor may send the instructions and the execution data to a data transfer memory, respectively, so that the data transfer controller reads the instructions and/or the execution data from the memory according to business requirements and transfers the instructions and/or the execution data to the register array. Therefore, data can be flexibly transmitted to the register array.
In some examples, the execution data may be added to the instruction to form an instruction with the execution data when executing S206; and then transmitting the instruction with the execution data to a register array included in the chip through the instruction processor.
In some examples, the instruction processor includes an instruction processing subunit. The instruction processing subunit may perform modification processing on the instruction, for example, may add the execution data to the VLIW instruction.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an AI chip shown in the present application. It should be noted that the structure shown in fig. 3 is only schematic, and all devices included in the chip are not shown in the drawing.
As shown in fig. 3, the instruction processor 11 may include an instruction processing subunit.
The instruction memory may input the instruction and the execution data to an instruction processing subunit in the instruction processor.
The instruction processing subunit may then add the execution data to the instruction to form an instruction with the execution data.
In some examples, the execution data may be added to the beginning or end of the VLIW instruction, thereby ensuring that the execution data does not cause interference to the VLIW instruction.
In some examples, a reserved field may be included when designing the instruction format. This reserved field is used to add an immediate (data that can be directly fetched from the instruction) to the instruction. It should be noted that the present application does not limit the location of the reserved field.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating adding execution data to a VLIW instruction according to the present application.
As shown in fig. 4, when the execution data is added to the VLIW instruction, the execution data may be added to a reserved field included in the VLIW instruction.
Therefore, under the condition that the execution data does not interfere with the VLIW instruction, the computing kernel can conveniently obtain the execution data, and the computing efficiency of the computing kernel is further improved.
After the instructions with the execution data are formed, the instructions with the execution data can be transmitted to a register array included in the chip in a single instruction multiple data stream mode through the instruction processor.
The compute cores in the register array may, in response to receiving the instruction with execution data, parse the instruction with execution data through a decoder to obtain execution data included in the instruction with execution data. The execution data may then be stored in a corresponding register, such that when the instruction is executed, the execution data may be retrieved from the register and the instruction may be executed according to the execution data.
The following description will be made by taking convolution operation performed by the AI chip as an example.
Referring to fig. 5, fig. 5 is a flow chart of data transmission according to the present application.
As shown in fig. 5, the AI chip includes a scalar memory, a scalar processor, a number of data transfer memories, a data transfer controller, an array control, and a register array.
The devices have the connection relationship shown in the figure. The scalar processor may include a VLIW instruction processing sub-unit. The register array may include an array controller for controlling each compute core and memory RAM included in the register array. Registers (not shown in FIG. 5) may be included within the compute cores.
The AI chip is connected to a compiler and an external memory.
The compiler is used for compiling input VLIW instructions. In this example, the VLIW instruction may include the execution steps of inputting a feature map, inputting weight data, performing shift multiply-add operation, outputting a convolution operation result, and the like. The VLIW instruction described above includes a reserved field.
The external memory stores weight data required for convolution operation.
In fig. 5, the compiler may send the compiled VLIW instruction to a scalar memory after completing the VLIW instruction compilation. The scalar memory may retrieve the weight data from the external memory in response to control of the main controller.
The scalar memory may send the fetched weight data and VLIW instructions to the VLIW instruction processing sub-unit.
The VLIW instruction processing sub-unit may add the weight data to a reserved field reserved for the VLIW instruction. Then, the scalar processor can redistribute the instruction of the VLIW instruction added with the weight data in a SIMD mode, divide the VLIW instruction into a plurality of VLIW sub-instructions and respectively send the VLIW sub-instructions to each module unit (such as a computation core in a register array, a data shaping unit controller, a line buffer unit controller and the like) included in each AI chip, so that each module unit executes corresponding steps according to the received VLIW sub-instructions.
When sending the VLIW sub-instruction to the register array, multiple data streams corresponding to the VLIW sub-instruction may be stored in the data transmission memory, respectively. The multiple data streams are then read from the data transfer memory and sent to the array controller by the data transfer controller.
The array controller can complete the transmission of the multiple data streams to the registers of the computation cores through the memory RAM corresponding to the computation cores in the register array.
Then each computation core can obtain the VLIW sub-instruction from the register and execute the decoding result of decoding the VLIW sub-instruction through the decoder:
inputting feature map data to be convolved;
acquiring the weight data from the VLIW instruction;
shifting, adding, multiplying and the like are carried out on the register array to finish convolution operation;
and finally, outputting a convolution operation result.
It can be seen that the weight data is not always stored in the AI chip, but is read from the external memory when performing convolution operation, and is input into the register array by the data transmission method shown in the present application, so that it is avoided that a storage space is specially allocated in the AI chip for executing data, and on the one hand, the design difficulty of the AI chip is simplified; and on the other hand, the occupation of the storage space of the AI chip is reduced.
The application also provides a chip. The chip comprises a data transmitter, an instruction processor and an instruction memory connected with the instruction processor.
The instruction memory is used for acquiring instructions and executing data stored in an external memory; wherein the execution data includes data required for executing the instruction;
the data transmitter is used for inputting the instruction and the execution data into the instruction processor;
the instruction processor is configured to transmit the instruction and the execution data to a register array included in the chip, so that each compute core executes the instruction according to the execution data.
In some illustrative embodiments, the instruction processor is configured to:
the instruction processor transmits the instruction and the execution data to a register array included in the chip.
In some illustrative embodiments, the instruction processor is configured to:
adding the execution data to the instruction to form an instruction with the execution data;
and transmitting the instruction with the execution data to a register array included in the chip through the instruction processor.
In some embodiments shown, the instruction processor comprises an instruction processing subunit; the data input device is configured to:
an instruction processing subunit for inputting the instruction and the execution data into the instruction processor;
the processing subunit is configured to add the execution data to the instruction to form an instruction with execution data;
the instruction processor is configured to:
the instruction processor transmits the instruction with the execution data to a register array included in the chip in a single instruction multiple data stream mode.
In some embodiments shown, each of the compute cores includes a decoder;
each of the compute kernels is configured to, in response to receiving the instruction with execution data, parse the instruction with execution data through the decoder to obtain execution data included in the instruction with execution data; and storing the execution data into a corresponding register so as to obtain the execution data from the register when the instruction is executed, and executing the instruction according to the execution data.
In some embodiments, the instruction memory is connected to an instruction compiler corresponding to the chip;
the instruction memory is used for acquiring the instruction compiled by the compiler.
In some embodiments shown, the chip includes a data transfer controller, and a data transfer memory connected to the data transfer controller;
the instruction processor is used for sending the instruction and the execution data to the data transmission memory;
the data transmission memory is used for storing the instruction and the execution data;
the data transfer controller is configured to read the command and the execution data from the data transfer memory and send the read command and the execution data to the register array.
In some embodiments shown, the compute kernel is configured to, in response to the step of obtaining the execution data indicated by the instruction, obtain the execution data stored in a corresponding register;
and executing the subsequent steps indicated by the instruction according to the acquired execution data.
In some embodiments shown, the instruction processor comprises a scalar processor; the instruction memory includes a scalar memory.
In some embodiments shown, the instructions comprise Very Long Instruction Word (VLIW) instructions.
In some embodiments shown, the VLIW instruction comprises a convolution operation instruction; the execution data includes weight data.
The application also provides an electronic device, which comprises the chip shown in any one of the embodiments.
For example, the electronic device may be a smart terminal such as a mobile phone, or may be another device that has a camera and can perform image processing. For example, when the electronic device performs convolution processing on the acquired image, the chip of the embodiment of the present application may be used to transmit weight data required for the convolution task.
The chip can efficiently input the weight data into the calculation kernel for convolution operation and has higher performance, so that the chip can assist in improving the processing efficiency of convolution tasks, and the performance of electronic equipment is improved.
The present application also provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a controller, implementing any of the data transmission methods described above.
One skilled in the art will recognize that one or more embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, 0xCD _00-ROM, optical storage, and the like) having computer-usable program code embodied therein.
"and/or" as recited herein means having at least one of two, for example, "a and/or B" includes three scenarios: A. b, and "A and B".
The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Embodiments of the subject matter and functional operations described in this application may be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this application and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this application can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows described above can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and 0xCD _00ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this application contains many implementation details, these should not be construed as limiting the scope of any disclosure or of what may be claimed, but rather as merely describing features of particular disclosed embodiments. Certain features that are described in this application in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The foregoing is merely a preferred embodiment of one or more embodiments of the present application and is not intended to limit the scope of the one or more embodiments of the present application, such that any modifications, equivalents, improvements and the like which come within the spirit and principle of one or more embodiments of the present application are included within the scope of the one or more embodiments of the present application.

Claims (23)

1. A data transmission method is applied to a chip, and is characterized in that the chip comprises an instruction processor and an instruction memory connected with the instruction processor; the method comprises the following steps:
obtaining instructions through the instruction memory and execution data stored in an external memory; wherein the execution data comprises data required to execute the instruction;
inputting the instructions and the execution data into the instruction processor;
and transmitting the instructions and the execution data to a register array included in the chip through the instruction processor so that each computing core executes the instructions according to the execution data.
2. The method of claim 1, wherein the instruction comprises a Very Long Instruction Word (VLIW) instruction.
3. The method of claim 1 or 2, wherein transferring the instructions and the execution data by the instruction processor into a register array included in the chip comprises:
and respectively transmitting the instructions and the execution data to a register array included in the chip through the instruction processor.
4. The method of claim 1 or 2, wherein transferring the instructions and the execution data by the instruction processor into a register array included in the chip comprises:
adding the execution data to the instruction to form an instruction with the execution data;
and transmitting the instruction with the execution data to a register array included in the chip through the instruction processor.
5. The method of any of claims 1-4, wherein the instruction processor comprises an instruction processing subunit;
said inputting said instructions and said execution data into said instruction processor comprising:
inputting the instructions and the execution data into an instruction processing subunit in the instruction processor;
the method further comprises the following steps:
adding the execution data to the instruction through an instruction processing subunit to form an instruction with the execution data;
the transmitting, by the instruction processor, the instruction and the execution data into a register array included in the chip includes:
transmitting, by the instruction processor, the instruction with execution data into a register array included in the chip in a single instruction multiple data stream.
6. The method of claim 4 or 5, wherein adding the execution data to the instruction comprises:
adding the execution data to a reserved field included by the instruction.
7. The method of any of claims 4-6, wherein each compute kernel includes a decoder; the each computing core executes the instruction according to the execution data, and the execution method comprises the following steps:
the computing kernel responds to the received instruction with the execution data, analyzes the instruction with the execution data through the decoder, and obtains the execution data included in the instruction with the execution data;
and the computing kernel stores the execution data into a corresponding register so as to acquire the execution data from the register when the instruction is executed and execute the instruction according to the execution data.
8. The method according to any one of claims 1-7, wherein the instruction memory is connected to an instruction compiler corresponding to the chip;
the fetching of instructions by the instruction memory includes:
and acquiring the instruction compiled by the compiler through the instruction memory.
9. The method of any of claims 1-8, wherein the chip comprises a data transfer controller, and a data transfer memory coupled to the data transfer controller;
the transmitting, by the instruction processor, the instruction and the execution data into a register array included in the chip includes:
sending, by the instruction processor, the instructions and the execution data to the data transfer memory;
and reading the instruction and the execution data from the data transmission memory through the data transmission controller, and sending the instruction and the execution data to the register array.
10. The method of any of claims 1-9, wherein executing the instructions according to the execution data by the compute cores comprises:
each computing kernel responds to the step of obtaining the execution data indicated by the instruction, and obtains the execution data stored in a corresponding register;
and executing the subsequent steps indicated by the instruction according to the acquired execution data.
11. The method of any of claims 1-10, wherein the instruction processor comprises a scalar processor; the instruction memory includes a scalar memory.
12. The method of any of claims 2-11, wherein the VLIW instruction comprises a convolution operation instruction; the execution data includes weight data.
13. A chip, wherein the chip comprises a data transmitter, an instruction processor, and an instruction memory coupled to the instruction processor;
the instruction memory is used for acquiring instructions and executing data stored in the external memory; wherein the execution data comprises data required to execute the instruction;
the data transmitter is used for inputting the instruction and the execution data into the instruction processor;
the instruction processor is used for transmitting the instructions and the execution data to a register array included in the chip so that each computing core executes the instructions according to the execution data.
14. The chip of claim 13, wherein the instruction processor is configured to:
and respectively transmitting the instructions and the execution data to a register array included in the chip through the instruction processor.
15. The chip of claim 13, wherein the instruction processor is configured to:
adding the execution data to the instruction to form an instruction with the execution data;
and transmitting the instruction with the execution data to a register array included in the chip through the instruction processor.
16. The chip according to any one of claims 13 to 15, wherein the instruction processor comprises an instruction processing subunit; the data input device is used for:
inputting the instructions and the execution data into an instruction processing subunit in the instruction processor;
the processing subunit is configured to add the execution data to the instruction to form an instruction with execution data;
the instruction processor is to:
transmitting, by the instruction processor, the instruction with execution data into a register array included in the chip in a single instruction multiple data stream.
17. The chip of any one of claims 15 or 16, wherein each compute core includes a decoder;
each computing kernel is configured to, in response to receiving the instruction with the execution data, parse the instruction with the execution data through the decoder to obtain the execution data included in the instruction with the execution data; and storing the execution data into a corresponding register so as to obtain the execution data from the register when the instruction is executed and execute the instruction according to the execution data.
18. The chip of any one of claims 13-17, wherein the instruction memory is coupled to an instruction translator associated with the chip;
the instruction memory is used for acquiring the instruction compiled by the compiler.
19. The chip according to any one of claims 13-18, wherein the chip comprises a data transfer controller, and a data transfer memory connected to the data transfer controller;
the instruction processor is used for sending the instructions and the execution data to the data transmission memory;
the data transmission memory is used for storing the instructions and the execution data;
and the data transmission controller is used for reading the instruction and the execution data from the data transmission memory and sending the instruction and the execution data to the register array.
20. The chip according to any one of claims 13 to 19, wherein the compute core is configured to, in response to the step of obtaining the execution data indicated by the instruction, obtain the execution data stored in a corresponding register;
and executing the subsequent steps indicated by the instruction according to the acquired execution data.
21. The chip according to any of claims 13-20, wherein said instruction processor comprises a scalar processor; the instruction memory includes a scalar memory.
22. An electronic device comprising a chip according to any of claims 13-21.
23. A computer-readable storage medium on which a computer program is stored, the program, when executed by a controller, implementing the data transmission method of any one of claims 1 to 12.
CN202110518021.2A 2021-01-29 2021-05-12 Data transmission method, chip, equipment and storage medium Active CN113032013B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021101275938 2021-01-29
CN202110127593.8A CN112860318A (en) 2021-01-29 2021-01-29 Data transmission method, chip, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113032013A true CN113032013A (en) 2021-06-25
CN113032013B CN113032013B (en) 2023-03-28

Family

ID=75986887

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110127593.8A Pending CN112860318A (en) 2021-01-29 2021-01-29 Data transmission method, chip, equipment and storage medium
CN202110518021.2A Active CN113032013B (en) 2021-01-29 2021-05-12 Data transmission method, chip, equipment and storage medium

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202110127593.8A Pending CN112860318A (en) 2021-01-29 2021-01-29 Data transmission method, chip, equipment and storage medium

Country Status (1)

Country Link
CN (2) CN112860318A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702852A (en) * 2023-08-02 2023-09-05 电子科技大学 Dynamic reconfiguration neural network acceleration circuit and system based on multistage event driving

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114428630B (en) * 2022-03-31 2022-07-01 浙江地芯引力科技有限公司 Chip algorithm upgrading method and device and chip
CN116756079B (en) * 2023-08-21 2023-11-21 电子科技大学 Multi-task intelligent processor based on high-capacity nonvolatile storage

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4573118A (en) * 1983-03-31 1986-02-25 Fairchild Camera & Instrument Corporation Microprocessor with branch control
CN101021778A (en) * 2007-03-19 2007-08-22 中国人民解放军国防科学技术大学 Computing group structure for superlong instruction word and instruction flow multidata stream fusion
CN103761213A (en) * 2014-02-14 2014-04-30 上海交通大学 On-chip array system based on circulating pipeline computation
CN104049943A (en) * 2013-03-15 2014-09-17 英特尔公司 Limited Range Vector Memory Access Instructions, Processors, Methods, And Systems
CN105893660A (en) * 2016-03-30 2016-08-24 桂林电子科技大学 CPU design method and calculating system oriented at symbol BDD operation
US20170083314A1 (en) * 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Initiating instruction block execution using a register access instruction
US20170357512A1 (en) * 2016-06-14 2017-12-14 Imagination Technologies Limited Executing Memory Requests Out of Order
WO2018000765A1 (en) * 2016-06-27 2018-01-04 深圳市中兴微电子技术有限公司 Co-processor, data reading method, processor system and storage medium
CN108369515A (en) * 2015-12-30 2018-08-03 英特尔公司 System, apparatus and method for the load that strides
CN109978749A (en) * 2017-12-22 2019-07-05 三星电子株式会社 Graphics processor, rendering system and the method for operating graphics processor
CN110149802A (en) * 2015-04-23 2019-08-20 谷歌有限责任公司 Compiler for being translated between the target hardware with two-dimensional shift array structure in Virtual Image Processor instruction set architecture (ISA)
CN110622134A (en) * 2017-05-17 2019-12-27 谷歌有限责任公司 Special neural network training chip
CN110895445A (en) * 2018-09-12 2020-03-20 华为技术有限公司 Data processing method and system
CN110990060A (en) * 2019-12-06 2020-04-10 北京瀚诺半导体科技有限公司 Embedded processor, instruction set and data processing method of storage and computation integrated chip
CN111125617A (en) * 2019-12-23 2020-05-08 中科寒武纪科技股份有限公司 Data processing method, data processing device, computer equipment and storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4573118A (en) * 1983-03-31 1986-02-25 Fairchild Camera & Instrument Corporation Microprocessor with branch control
CN101021778A (en) * 2007-03-19 2007-08-22 中国人民解放军国防科学技术大学 Computing group structure for superlong instruction word and instruction flow multidata stream fusion
CN104049943A (en) * 2013-03-15 2014-09-17 英特尔公司 Limited Range Vector Memory Access Instructions, Processors, Methods, And Systems
CN103761213A (en) * 2014-02-14 2014-04-30 上海交通大学 On-chip array system based on circulating pipeline computation
CN110149802A (en) * 2015-04-23 2019-08-20 谷歌有限责任公司 Compiler for being translated between the target hardware with two-dimensional shift array structure in Virtual Image Processor instruction set architecture (ISA)
US20170083314A1 (en) * 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Initiating instruction block execution using a register access instruction
CN108369515A (en) * 2015-12-30 2018-08-03 英特尔公司 System, apparatus and method for the load that strides
CN105893660A (en) * 2016-03-30 2016-08-24 桂林电子科技大学 CPU design method and calculating system oriented at symbol BDD operation
US20170357512A1 (en) * 2016-06-14 2017-12-14 Imagination Technologies Limited Executing Memory Requests Out of Order
WO2018000765A1 (en) * 2016-06-27 2018-01-04 深圳市中兴微电子技术有限公司 Co-processor, data reading method, processor system and storage medium
CN110622134A (en) * 2017-05-17 2019-12-27 谷歌有限责任公司 Special neural network training chip
CN109978749A (en) * 2017-12-22 2019-07-05 三星电子株式会社 Graphics processor, rendering system and the method for operating graphics processor
CN110895445A (en) * 2018-09-12 2020-03-20 华为技术有限公司 Data processing method and system
CN110990060A (en) * 2019-12-06 2020-04-10 北京瀚诺半导体科技有限公司 Embedded processor, instruction set and data processing method of storage and computation integrated chip
CN111125617A (en) * 2019-12-23 2020-05-08 中科寒武纪科技股份有限公司 Data processing method, data processing device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAVID MONEY HARRIS: "instruction memory", 《HTTPS://WWW.SCIENCEDIRECT.COM/TOPICS/COMPUTER-SCIENCE/INSTRUCTION-MEMORY》 *
赵巍胜: "STT-MRAM存储器的研究进展", 《中国科学:物理学 力学 天文学》 *
邵响: "多核混合可重构计算系统的设计与优化", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702852A (en) * 2023-08-02 2023-09-05 电子科技大学 Dynamic reconfiguration neural network acceleration circuit and system based on multistage event driving
CN116702852B (en) * 2023-08-02 2023-10-20 电子科技大学 Dynamic reconfiguration neural network acceleration circuit and system based on multistage event driving

Also Published As

Publication number Publication date
CN113032013B (en) 2023-03-28
CN112860318A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN113032013B (en) Data transmission method, chip, equipment and storage medium
CN109388595B (en) High bandwidth memory system and logic die
CN111656367A (en) System and architecture for neural network accelerator
US8819345B2 (en) Method, apparatus, and computer program product for inter-core communication in multi-core processors
CN111913652A (en) Memory device including processing circuit, memory controller, and memory system
TWI768383B (en) Instructions for operating accelerator circuit
US11003429B1 (en) Compile-time scheduling
CN111324294B (en) Method and device for accessing tensor data
CN112506437A (en) Chip, data moving method and electronic equipment
JP6998991B2 (en) Information processing methods and equipment
CN111047036B (en) Neural network processor, chip and electronic equipment
CN110991619A (en) Neural network processor, chip and electronic equipment
US11175919B1 (en) Synchronization of concurrent computation engines
US8825465B2 (en) Simulation apparatus and method for multicore system
US11500962B1 (en) Emulating fine-grained sparsity in a systolic array
CN111047035A (en) Neural network processor, chip and electronic equipment
WO2022227563A1 (en) Hardware circuit, data migration method, chip, and electronic device
CN102542525A (en) Information processing equipment and information processing method
CN111832714B (en) Operation method and device
US11803736B1 (en) Fine-grained sparsity computations in systolic array
CN108268280B (en) Processor of semiconductor device and operation method thereof
Liu et al. ePUMA embedded parallel DSP processor with Unique Memory Access
CN113077042A (en) Data reuse and efficient processing method of convolutional neural network
CN111209041B (en) Neural network processor, system on chip and electronic equipment
JP6222079B2 (en) Computer system, processing method thereof, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant