CN117033298A - Tile processor, SOC chip and electronic equipment - Google Patents

Tile processor, SOC chip and electronic equipment Download PDF

Info

Publication number
CN117033298A
CN117033298A CN202310722611.6A CN202310722611A CN117033298A CN 117033298 A CN117033298 A CN 117033298A CN 202310722611 A CN202310722611 A CN 202310722611A CN 117033298 A CN117033298 A CN 117033298A
Authority
CN
China
Prior art keywords
data
tile
instruction
shared memory
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310722611.6A
Other languages
Chinese (zh)
Other versions
CN117033298B (en
Inventor
邵平平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tiantian Zhixin Semiconductor Technology Co ltd
Original Assignee
Shanghai Tiantian Smart Core Semiconductor Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Tiantian Smart Core Semiconductor Co ltd filed Critical Shanghai Tiantian Smart Core Semiconductor Co ltd
Publication of CN117033298A publication Critical patent/CN117033298A/en
Application granted granted Critical
Publication of CN117033298B publication Critical patent/CN117033298B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The application relates to a tile processor, an SOC chip and electronic equipment, and belongs to the field of electronic circuits. The tile processor includes: l2 shared memory, tile scheduler, computer unit, and tile load store unit. The computer unit is electrically connected with the tile dispatcher and the L2 shared memory respectively; the computer unit is configured to send the acquired first instruction of the first tile type to the tile scheduler. The tile loading storage unit is electrically connected with the tile dispatcher and the L2 shared memory respectively; the tile load store unit is configured to receive a first instruction sent by the tile scheduler and, in response to the first instruction, read data of a first size from the global memory and write the data to the L2 shared memory, or read data of the first size from the L2 shared memory and write the data to the global memory. The tile processor can rapidly read and write data based on a newly added path (LLC-L2 shared memory-computer unit internal path), and greatly improves the data reading and writing efficiency.

Description

Tile processor, SOC chip and electronic equipment
Technical Field
The application belongs to the field of electronic circuits, and particularly relates to a tile processor, an SOC chip and electronic equipment.
Background
In current SOC chip designs, each thread has its own private register (4 byte storage size) to store data, and a certain number of threads form a block (block, group) in which the threads exchange data via L1shared memory (L1 shared memory). With the advent of AI (Artificial Intelligence ) applications, matrix multiplication and convolution operations became important and fundamental operations, with large matrix or convolution operations requiring more thread to participate in the computation and exchange data to reuse the data, reducing repeated reads of the data.
Disclosure of Invention
In view of the above, an object of the present application is to provide a tile processor, an SOC chip and an electronic device, so as to improve the data read-write efficiency and the data recycling rate.
Embodiments of the present application are implemented as follows:
in a first aspect, an embodiment of the present application provides a tile processor, including: l2 shared memory, tile scheduler, computer unit, and tile load store unit; the L2 shared memory is configured to be electrically connected with a last level cache, and the last level cache is configured to be electrically connected with the global memory; the computer unit is electrically connected with the tile scheduler and the L2 shared memory respectively; the computer unit is configured to send the acquired first instruction of the first tile type to the tile scheduler; the tile loading storage unit is respectively and electrically connected with the tile dispatcher and the L2 shared memory; the tile load storage unit is configured to receive the first instruction sent by the tile scheduler, and read data with a first size from the global memory and write the data into the L2 shared memory in response to the first instruction, or read data with the first size from the L2 shared memory and write the data into the global memory.
In the embodiment of the application, by improving the structure of the existing tile processor, structures such as an L2 shared memory, a tile dispatcher, a tile loading storage unit and the like are at least newly added, so that when a computer unit reads and writes data, the computer unit can read and write data by utilizing the existing path (LLC-L2 Cache-computer unit internal path) and can also rapidly read and write data based on the newly added path (LLC-L2 shared memory-computer unit internal path), thereby greatly improving the data reading and writing efficiency. Meanwhile, by newly adding the L2 shared memory, data can be shared among all modules in the computer unit, and the utilization efficiency of the data is improved. In addition, the value of the first size can be changed by configuring parameters in the first instruction so as to meet various data loading or storage requirements, and the method has the characteristic of being programmable.
With reference to a possible implementation manner of the embodiment of the first aspect, the tile loading storage unit is specifically configured to: analyzing the first instruction, if the first instruction is a loading instruction, acquiring a first read base address, a first read offset and a first write offset carried in the first instruction, generating a first data read address based on the first read base address and the first read offset, and generating a first data write address based on a preset first write base address and the first write offset; reading data with a first size from the global memory pointed by the first data reading address and writing the data into the position pointed by the first data writing address of the L2 shared memory; if the first instruction is a storage instruction, acquiring a second read offset, a second write base address and a second write offset carried in the first instruction, generating a second data read address based on a preset second read base address and the second read offset, generating a second data write address based on the second write base address and the second write offset, and reading data with a first size from the L2 shared memory pointed by the second data read address to a position pointed by the second data write address of the global memory.
In the embodiment of the application, the data read-write address is written in the mode, and the data is read and written in the mode, so that the accurate reading and writing of the data can be realized, and the accuracy of the scheme is enhanced.
With reference to a possible implementation manner of the embodiment of the first aspect, the computer unit is further configured to read, from the L2 shared memory, data of a second size and write the data to a register included in the computer unit, or read, from the register included in the computer unit, data of the second size and write the data to the L2 shared memory, in response to the obtained second instruction of the second tile type.
In the embodiment of the application, the internal logic of the computer unit is improved, so that the data with the second size can be directly read from the L2 shared memory and written into the register contained in the computer unit, or the data with the second size can be read from the register contained in the computer unit and written into the L2 shared memory. Namely, the L1 cache can be skipped, and the data in the L2 shared memory can be directly loaded into the register contained in the computer unit, or the data in the register contained in the computer unit can be directly stored into the L2 shared memory, so that the processing efficiency of the data is greatly improved.
With reference to a possible implementation manner of the first aspect embodiment, the computer unit includes: a pipelined processing unit electrically connected to the tile scheduler, the pipelined processing unit configured to send the first instruction to the tile scheduler.
With reference to a possible implementation manner of the embodiment of the first aspect, the pipelined processing unit is further electrically connected to the L2 shared memory, and the pipelined processing unit is further configured to read, from the L2 shared memory, data of a second size and write the data into a register included in the pipelined processing unit, or read, from the register included in the pipelined processing unit, and write the data of the second size into the L2 shared memory, in response to an acquired second instruction of a second tile type.
In the embodiment of the application, the internal logic of the streaming processing unit is improved, so that the data with the second size can be directly read from the L2 shared memory and written into the register contained in the streaming processing unit, or the data with the second size can be read from the register contained in the streaming processing unit and written into the L2 shared memory. Namely, the L1 cache can be skipped, and the data in the L2 shared memory can be directly loaded into the register contained in the running water type processing unit, or the data in the register contained in the running water type processing unit can be directly stored into the L2 shared memory, so that the processing efficiency of the data is greatly improved.
With reference to a possible implementation manner of the first aspect embodiment, the computer unit further includes: l1 shares memory, block load store unit; the block loading storage unit is respectively and electrically connected with the L2 shared memory and the L1 shared memory; the pipelined processing unit is further configured to send the acquired third instruction of the first block type to the block load store unit through the tile scheduler; the block load storage unit is configured to read data with a third size from the L2 shared memory and write the data into the L1 shared memory in response to the third instruction, or read data with the third size from the L1 shared memory and write the data into the L2 shared memory.
In the embodiment of the application, the internal logic of the computer unit is improved, the structure of the computer unit is further improved, and the block loading storage unit is newly added, so that the data can be directly loaded from the L2 shared memory to the L1 shared memory, the loading efficiency of the data is increased, the data in the L1 shared memory can be directly stored in the L2 shared memory, and the storage efficiency of the data is increased.
With reference to a possible implementation manner of the embodiment of the first aspect, the pipelined processing unit is further electrically connected to the L1 shared memory, and the pipelined processing unit is further configured to read, from the L1 shared memory, data of a fourth size and write the data into a register included in the pipelined processing unit, or read, from the register included in the pipelined processing unit, data of the fourth size and write the data into the L1 shared memory, in response to an obtained fourth instruction of the second block type.
In the embodiment of the application, the block type instruction (the effect of data loading or storing is equal to that of a plurality of common loading or storing instructions) is introduced, so that one fourth instruction can realize the functions of a plurality of common instructions, thereby reducing the number of instructions and improving the execution efficiency.
With reference to a possible implementation manner of the first aspect embodiment, the running water processing unit includes: and a base scheduler electrically connected to the tile scheduler, the base scheduler configured to send the acquired first instruction of the first tile type to the tile scheduler.
With reference to a possible implementation manner of the first aspect embodiment, the running water processing unit further includes: a register file; the base scheduler is further configured to send the acquired third instruction of the first block type to a block load store unit via the tile scheduler.
In the embodiment of the application, the register file is added, so that the base scheduler can be further configured to send the acquired third instruction of the first block type to the block load storage unit through the tile scheduler.
With reference to a possible implementation manner of the first aspect embodiment, the running water processing unit further includes: a thread load store unit; the thread loading storage unit is electrically connected with the register file, the basic scheduler and the L2 shared memory respectively; the thread load store unit is configured to: reading data with a second size from the L2 shared memory and writing the data into the register file, or reading data with the second size from the register file and writing the data into the L2 shared memory in response to the acquired second instruction with the second tile type; and/or; the thread loading storage unit is electrically connected with the register file, the basic scheduler and the L1 shared memory respectively; the thread load storage unit is configured to respond to the acquired fourth instruction of the second block type, read data with a fourth size from the L1 shared memory and write the data into the register file, or read data with the fourth size from the register file and write the data into the L1 shared memory.
In the embodiment of the application, the thread is endowed with a new function of the loading and storing unit, so that the thread has the function, and the data reading and writing efficiency can be improved.
With reference to a possible implementation manner of the first aspect embodiment, the base scheduler is further configured to count an instruction to be executed sent by the base scheduler, so as to obtain a first value; counting the received completion signals to obtain a second numerical value; and responding to the acquired waiting instruction, and suspending instruction scheduling under the condition that the difference value between the first numerical value and the second numerical value is larger than the waiting value carried in the waiting instruction, wherein each waiting instruction generates one completion signal after execution is completed.
In the embodiment of the application, by counting the instructions to be executed and counting the completion signals, so as to accurately acquire the execution completion condition of the transmitted instructions to be executed, when the difference value between the first numerical value and the second numerical value is larger than the waiting value carried in the waiting instructions, instruction scheduling is suspended, and running errors (for example, some subsequent operations need to use the previous operation results, and the subsequent instructions can be run after the previous instructions are required to run) and congestion can be avoided.
In a second aspect, an embodiment of the present application further provides an SOC chip, including: a last level cache and at least one tile processor as provided in the above-described first aspect embodiment and/or in combination with any one of the possible implementations of the first aspect embodiment, each of the tile processors being electrically connected to the last level cache.
In a third aspect, an embodiment of the present application further provides an electronic device, including: global memory and as above-mentioned SOC chip, global memory with the SOC chip electricity is connected.
Additional features and advantages of the application will be set forth in the description which follows. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. The above and other objects, features and advantages of the present application will become more apparent from the accompanying drawings.
FIG. 1 shows a block diagram of a tile processor according to an embodiment of the present application.
Fig. 2 shows a block diagram of a computer unit according to an embodiment of the present application.
FIG. 3 is a block diagram showing a running water treatment unit according to an embodiment of the present application
FIG. 4 shows a block diagram of yet another tile processor provided by an embodiment of the present application.
Fig. 5 shows a schematic diagram of a tile processor connected to an LCC according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. The following examples are given by way of illustration for more clearly illustrating the technical solution of the present application, and are not to be construed as limiting the scope of the application. Those skilled in the art will appreciate that the embodiments described below and features of the embodiments can be combined with one another without conflict.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely to distinguish one entity or action from another entity or action in the description of the application without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Furthermore, the term "and/or" in the present application is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone.
In the description of the embodiments of the present application, unless explicitly specified and limited otherwise, the term "electrically coupled" may be either directly or indirectly via an intermediate medium.
First embodiment
In order to improve the read-write efficiency of data and the recycling rate of data, the embodiment of the application provides a tile processor (Tile Processing Unit, TPU) as shown in fig. 1. The tile processor includes: l2 shared memory, a Tile Scheduler (TS), a Tile load store Unit (Tile Load Store Unit, tile LSU for short), and at least one Computer Unit (CU). The tile processor may further include an L2 Cache, i.e., a level two Cache (not shown), the L2 Cache being electrically connected to each computer unit and configured to be electrically connected to a last level Cache.
The L2 shared Memory is configured to be electrically connected to a Last Level Cache (LLC) that is configured to be electrically connected to a Global Memory (Global Memory). Each computer unit is electrically connected to the tile scheduler and the L2 shared memory, respectively, and the tile load store unit is electrically connected to the tile scheduler, the L2 shared memory, and is configured to be electrically connected to the last level cache and the global memory, respectively.
In some implementations, each computer unit is configured to send the acquired first instruction of the first tile type to the tile scheduler. Specifically, each computer unit obtains a set of instructions to be executed (from user programming), decodes each instruction to be executed in sequence in the set of instructions to be executed, and when decoding a first instruction of a first tile type, sends the first instruction to the tile scheduler. Furthermore, the computer unit is configured to execute other instructions of the decoded set of instructions to be executed than the first instruction.
The first instruction of the first tile type may be a load instruction, for example, may be a load instruction such as tile_load_64, tile_load_128, tile_load_256, and the like. The first instruction may also be a store instruction, and may be a store instruction such as tile_store_64, tile_store_128, tile_store_256, and the like.
The three load instructions of tile_load_64, tile_load_128, and tile_load_256 are different in that the data size to be loaded is different, for example, the tile_load_64 instruction is used to read data with a size of 64 bytes from the global memory and write the data with a size of 128 bytes to the L2 shared memory, and similarly, the tile_load_128 instruction is used to read data with a size of 128 bytes from the global memory and write the data with a size of 128 bytes to the L2 shared memory. The store instruction is similar to a load instruction, e.g., the tile store 64 instruction is used to read 64 byte-sized data from L2 shared memory and write to global memory, and similarly the tile store 256 instruction is used to read 256 byte-sized data from L2 shared memory and write to global memory. In the embodiment of the application, the value of the data with the first size can be set according to the requirement of a user, so that the programmable characteristic is presented, and the data is not loaded or stored in a fixed-size mode any more.
The tile scheduler is responsible for scheduling instructions and is configured to schedule a first instruction to the tile load store unit after receiving the first instruction sent by each computer unit.
The tile load store unit is configured to receive a first instruction sent by the tile scheduler and, in response to the first instruction, read data of a first size from the global memory and write the data to the L2 shared memory, or read data of the first size from the L2 shared memory and write the data to the global memory.
Specifically, the tile load store unit is configured to: analyzing a first instruction, if the first instruction is a loading instruction, acquiring a first read base address, a first read offset and a first write offset carried in the first instruction, generating a first data read address based on the first read base address and the first read offset, generating a first data write address based on a preset first write base address (which is a base address of an L2 shared memory) and the first write offset, and reading data with a first size from a global memory pointed by the first data read address to a position pointed by the first data write address of the L2 shared memory; if the first instruction is a storage instruction, acquiring a second read offset, a second write base address and a second write offset carried in the first instruction, generating a second data read address based on a preset second read base address (which is a base address of an L2 shared memory) and the second read offset address, generating a second data write address based on the second write base address and the second write offset, and reading data of a first size from an L2 shared memory pointed by the second data read address to a position pointed by the second data write address of the global memory.
The tile load storage unit includes an address generator (as represented by AddGen) for generating a data read-write address based on information carried in the first instruction, such as a first read base address, a first read offset, and a first write offset, or a second read base address, a second read offset, and a second write offset.
In some embodiments, as shown in fig. 2, the computer unit may include: at least one pipelined processing unit (SPP), L1 shared memory, and L1 Cache. Each pipelined processing unit is electrically connected with the L1 shared memory, the L1Cache, the tile scheduler and the tile loading storage unit. The L1 shared memory, the L1Cache and the L2 Cache are electrically connected, and the L1Cache is electrically connected with the L2 shared memory.
The pipelined processing unit is configured to send a first instruction to the tile scheduler. Specifically, the pipelined processing unit obtains a set of instructions to be executed (from user programming), decodes each instruction to be executed in sequence in the set of instructions to be executed, and when decoding a first instruction of a first tile type, sends the first instruction to the tile scheduler. The pipelined processing unit is further configured to execute other instructions of the decoded set of instructions to be executed than the first instruction, e.g. to write data from RF to L1Cache and L1 shared memory or to load data from L1Cache and L1 shared memory to RF in response to a fetched normal load or store instruction (which may be considered as a fifth instruction). It should be noted that, for a normal load or store instruction, the data of at most 4 bytes is loaded or stored at a time, i.e., the data amount of the load or store of the fifth instruction is small.
In some embodiments, the pipelined processing unit may include a base Scheduler (NS), a thread load store unit (Thread Load Store Unit, thr LSU for short), at least one Register File (RF), and at least one arithmetic logic unit (NR for short), the schematic diagram of which is shown in fig. 3.
The basic scheduler is electrically connected with the thread load storage unit, each arithmetic logic unit and the tile scheduler respectively. Each arithmetic logic unit corresponds to a register file, and each register file is electrically connected with the corresponding arithmetic logic unit, the L1 shared memory and the L1 Cache. Each NR has multiple parallel processing lines (e.g., 16 threads), each with its own register.
The basic scheduler is responsible for the reading, decoding, scheduling and other tasks of instructions. For example, the base scheduler obtains a set of instructions to be executed (from user programming), decodes each instruction to be executed in sequence, sends a first instruction to the tile scheduler when decoding the first instruction of the first tile type, and sends a run instruction to the arithmetic logic unit for execution when decoding the run instruction to be executed by the arithmetic logic unit, and sends a load or store instruction to be executed by the thread load store unit for execution when decoding the load or store instruction.
Optionally, the base scheduler is further configured to count the instructions to be executed sent by the base scheduler, so as to obtain a first value; counting the received completion signals to obtain a second value; and responding to the acquired waiting instruction, and suspending instruction scheduling under the condition that the difference value between the first value and the second value is larger than the waiting value (the value of the waiting value can be configured) carried in the waiting instruction, wherein each waiting instruction generates a completion signal after the execution of each waiting instruction is completed. In this embodiment, each time the base scheduler sends an instruction to be executed to the thread load store unit, the arithmetic logic unit, and the tile scheduler, the instruction is counted, and each time the thread load store unit, the arithmetic logic unit, and the tile scheduler execute an instruction, a completion signal is returned to the base scheduler.
The waiting instruction may be an instruction set to be executed, and in the case that the difference between the first value and the second value is greater than the waiting value carried in the waiting instruction, instruction scheduling is suspended, so that operation errors (for example, some subsequent operations need to use the previous operation result, and the subsequent instructions need to be executed after the previous instructions are completely executed) and congestion can be avoided.
It will be appreciated that the counting process described above may also be: the base scheduler counts up 1 for each execution instruction and counts down 1 for each completion signal received.
The thread load store unit is configured to write data from the RF to the L1Cache and L1 shared memory in response to a normal store instruction sent by the base scheduler, or to load data from the L1Cache and L1 shared memory to the RF in response to a normal load instruction sent by the base scheduler.
According to the embodiment of the application, by improving the structure of the existing tile processor, at least structures such as the L2 shared memory, the tile dispatcher, the tile loading storage unit and the like are newly added, so that when a computer unit reads and writes data, the existing path can be utilized to read and write the data (LLC-L2 Cache-computer unit internal passage), and the data can be quickly read and written based on the newly added path (LLC-L2 shared memory-computer unit internal passage), thereby greatly improving the data reading and writing efficiency. Meanwhile, by newly adding the L2 shared memory, data can be shared among all modules in the computer unit, and the utilization efficiency of the data is improved. In addition, the value of the first size can be changed by configuring parameters in the first instruction so as to meet various data loading or storage requirements, and the method has the characteristic of being programmable.
Second embodiment
In some embodiments, the computer unit is further configured to read data of the second size from the L2 shared memory and write the data to a register included in the computer unit or read data of the second size from a register included in the computer unit and write the data to the L2 shared memory in response to the fetched second instruction of the second tile type. In this embodiment, the L2 shared memory is also directly RF-electrically connected to each register file in the pipelined processing unit.
In the embodiment of the application, besides improving the structure of the tile processor, the internal logic of the computer unit is improved, so that the internal logic of the computer unit can directly read the data with the second size from the L2 shared memory and write the data into the register contained in the computer unit, or read the data with the second size from the register contained in the computer unit and write the data into the L2 shared memory. Namely, the L1 cache can be skipped, and the data in the L2 shared memory can be directly loaded into the register contained in the computer unit, or the data in the register contained in the computer unit can be directly stored into the L2 shared memory, so that the processing efficiency of the data is greatly improved.
The second instruction of the second tile type may be a load instruction, for example, may be a load instruction such as tile_load_4, tile_load_8, tile_load_16, and the like. The second instruction may also be a store instruction, such as tile_store_4, tile_store_8, tile_store_16, and the like.
It will be appreciated that the second instruction and the first instruction are of tile type instructions, except that the second instruction loads (or saves) less data than the first instruction loads (or saves). That is, if the same size of data is to be loaded (or saved), multiple second instructions need to be executed to load (or save) the amount of data loaded (or saved) by one first instruction. For example, for a tile_load_64 load instruction, the amount of data loaded is equal to the amount of data loaded by 16 tile_load_4 load instructions, the amount of data loaded by 8 tile_load_8 load instructions, and the amount of data loaded by 4 tile_load_16 load instructions. In brief, the second size is smaller than the first size.
The instructions of the tile types in the embodiments of the present application include two types, a first tile type (belonging to tile layer instructions) and a second tile type (belonging to thread layer instructions). the tile layer instruction is executed by the tile load store unit and the thread layer instruction is executed by the thread load store unit.
Specifically, the computer unit is configured to parse the second instruction, if the second instruction is a load instruction, obtain a register number and a read offset carried in the second instruction, generate a first data read address based on a preset read base address (which is a base address of the L2 shared memory) and the read offset, read data of a second size from the L2 shared memory pointed by the first data read address, and write the data into a register included in the computer unit pointed by the register number; if the second instruction is a storage instruction, acquiring a register number and a write offset carried in the second instruction, generating a data write address based on a preset write base address (which is a base address of the L2 shared memory) and the write offset, and reading data with a second size from a register contained in a computer unit pointed by the register number and writing the data into a designated position (a position pointed by the data write address) of the L2 shared memory.
In a second embodiment, the structure of the computer unit may be as shown in fig. 2, and includes: at least one pipelined processing unit, an L1 shared memory, and an L1 Cache.
The structure of each of the running water type processing units may be as shown in fig. 3, including: a base Scheduler (NS), a thread load store unit (Thread Load Store Unit, thr LSU for short), at least one Register File (RF), and at least one arithmetic logic unit (NR for short).
It will be appreciated that the second embodiment differs from the first embodiment in that: the L2 shared memory is connected with the computer unit in different modes, and also endows the computer unit with new functions. Specifically, the L2 shared memory is electrically connected to a register and a thread load store unit included in the computer unit, in addition to the L1 Cache in the computer unit.
In a second embodiment, the pipelined processing unit in the computer unit is specifically configured to read the data of the second size from the L2 shared memory and write the data of the second size into the register included in the computer unit, or read the data of the second size from the register included in the computer unit and write the data of the second size into the L2 shared memory in response to the acquired second instruction of the second tile type.
Alternatively, the thread load store unit in the pipelined processing unit may be configured to read data of the second size from the L2 shared memory and write it to a register included in the computer unit, or to read data of the second size from a register included in the computer unit and write it to the L2 shared memory, in response to the fetched second instruction of the second tile type. At this time, the thread load store unit is also electrically connected to the L2 shared memory.
Third embodiment
In some embodiments, the computer unit may include, in addition to the pipelined processing unit: the schematic diagram of the L1 shared memory and Block load store unit (Block LSU, abbreviated as Blk LSU) is shown in fig. 4. The block loading storage unit is electrically connected with the L2 shared memory and the L1 shared memory respectively. In this way, not only is the internal logic of the computer unit improved, but also the structure is further improved, and the memory unit is newly added and loaded.
At this point, the pipelined processing unit is further configured to send the acquired third instruction of the first block type to the block load store unit through the tile scheduler. Alternatively, it may be that the base scheduler in the pipelined processing unit is configured to send the fetched third instruction of the first block type to the block load store unit via the tile scheduler.
The block load store unit is configured to read data of a third size from the L2 shared memory and write the data to the L1 shared memory, or to read data of the third size from the L1 shared memory and write the data to the L2 shared memory, in response to a third instruction. The method can directly load data from the L2 shared memory to the L1 shared memory, increases the loading efficiency of the data, and can directly store the data in the L1 shared memory to the L2 shared memory, thereby increasing the storage efficiency of the data.
The third instruction may be a load instruction, for example, may be a load instruction such as block_load_64, block_load_128, or the like. The third instruction may also be a store instruction, which may be a store instruction such as block_store_64, block_store_128, or the like.
Specifically, the block load store unit is configured to: analyzing the third instruction, if the third instruction is a loading instruction, acquiring a write offset and a read offset carried in the third instruction, generating a first data read address based on a preset read base address (which is a base address of an L2 shared memory) and the read offset, generating a first data write address based on a preset write base address (which is a base address of an L1 shared memory) and the write offset, and reading data with a third size from an L2 shared memory pointed by the first data read address to a designated position (a position pointed by the first data write address) of the L1 shared memory. If the third instruction is a storage instruction, acquiring a write offset and a read offset carried in the third instruction, generating a second data read address based on a preset read base address (which is a base address of the L1 shared memory) and the read offset, generating a second data write read address based on the preset write base address (which is a base address of the L2 shared memory) and the write offset, and reading data with a third size from the L1 shared memory pointed by the second data read address to a designated position (a position pointed by the second data write address) of the L2 shared memory.
Wherein the block load store unit comprises an address generator (as indicated by AddGen) for generating a data read-write address based on information carried in the third instruction, such as a write offset and a read offset.
In some possible embodiments, the pipelined processing unit is further electrically connected to the L1 shared memory, and the pipelined processing unit is further configured to read the fourth size of data from the L1 shared memory and write the fourth size of data to a register included in the pipelined processing unit or read the fourth size of data from the register included in the pipelined processing unit and write the fourth size of data to the L1 shared memory in response to the obtained fourth instruction of the second block type.
Alternatively, the thread load store unit in the pipelined processing unit may be configured to read the fourth size of data from the L1 shared memory and write the fourth size of data to a register included in the pipelined processing unit, or read the fourth size of data from the register included in the pipelined processing unit and write the fourth size of data to the L1 shared memory in response to the obtained fourth instruction of the second block type.
The effect of a data load or store of a fourth instruction of the second block type is equal to the effect of a plurality of normal load or store instructions. The fourth instruction can realize the functions of a plurality of common instructions, so that the number of instructions can be reduced, and the execution efficiency can be improved.
The fourth instruction may be a load instruction, for example, may be a load instruction such as block_load_4, block_load_8, block_load_16, and the like. The third instruction may be a store instruction, such as block_store_4, block_store_8, block_store_16, or the like.
It will be appreciated that the third instruction and the fourth instruction are both block type instructions, except that the fourth instruction loads (or saves) less data than the third instruction loads (or saves). That is, if the same size of data is to be loaded (or saved), multiple fourth instructions need to be executed to load (or save) the amount of data loaded (or saved) by one third instruction. For example, for a block_load_64 load instruction, the amount of data loaded is equal to the amount of data loaded by a 16 block_load_4 load instruction, the amount of data loaded by an 8 block_load_8 load instruction, and the amount of data loaded by a 4 block_load_16 load instruction. In brief, the fourth size is smaller than the third size.
The block type instructions in the embodiments of the present application include two types, a first block type (belonging to the block layer instruction) and a second block type (belonging to the thread layer instruction). The block layer instruction is executed by the block load store unit and the thread layer instruction is executed by the thread load store unit.
The structure of the running water type processing unit can be shown with reference to fig. 3. It will be appreciated that the third embodiment differs from the first embodiment or the second embodiment in that in the third embodiment, the concept of block (block) type instructions is introduced, and accordingly, the internal logic and structure of the computer unit are improved, the memory unit is newly added by block loading, and a new function is further given to the pipelined processing unit (in response to the acquired fourth instruction of the second block type, the data of the fourth size is read from the L1 shared memory and written into the register contained in the pipelined processing unit, or the data of the fourth size is read from the register contained in the pipelined processing unit and written into the L1 shared memory).
In an embodiment of the application, the TPU involves three layers of storage: register < - > L1Cache and L1 shared memory < - > L2 Cache and L2 shared memory. Spatial division corresponding to three layers: the thread < - > block < - > tile. The block contains a number of threads and the tile contains a number of blocks. The size of Tile and block is programmable and user-determined. The number of registers required for each thread, the size of the L1 shared memory required for block, and the size of the L2 shared memory required for tile may also be determined by the user.
The first size, the second size, the third size and the fourth size in the embodiment of the application can be configured, and the first size is greater than the third size and greater than the second size or the fourth size. The second size and the fourth size may be the same, and different application requirements are satisfied by configuring the first size, the second size, the third size, or the fourth size in the load or store instruction, thereby exhibiting a programmable feature. That is, the first instruction, the second instruction, the third instruction and the fourth instruction in the present application are all configurable, so as to provide a programmable data read-write mode.
For a better understanding of the present solution, a block diagram of the TPU may be shown in fig. 5, in one embodiment. It is understood that fig. 5 is only one of many embodiments of the present application. The TPU in FIG. 5 is electrically connected to the LLC through Fabric. The TPU comprises a Tile Scheduler (TS), a Tile loading and storing unit (Tile LSU), an L2 Cache and an L2 shared memory (L2 Cache & L2 shared memory) and a plurality of computer units, wherein the plurality of computer units share the L2 Cache and the L2 shared memory.
Each computer unit comprises an L1 Cache & L1 shared memory, a block load storage unit (Blk LSU) and a plurality of pipelined processing units, wherein the pipelined processing units share one L1 Cache and the L1 shared memory. Each pipelined processing unit includes a base scheduler (NS), a thread load store unit (Thr LSU), a plurality of Register Files (RF), and a plurality of arithmetic logic units (NR).
When the NS acquires an instruction (such as a first instruction and a third instruction) which needs to be executed by the Blk LSU and the Tile LSU, the NS sends the instruction to the TS, and the TS schedules the instruction. If an instruction (e.g., a third instruction) needs to be processed by the Blk LSU, it is sent to the Blk LSU and its completion is tracked; if an instruction (e.g., the first instruction) needs to be executed by the Tile LSU, it is sent to the TStile LSU and their completion is tracked. The NS, upon fetching an instruction (e.g., a second instruction, a fourth instruction, and some normal load or store instructions) that needs to be executed by the Thr LSU, sends it to the Thr LSU for execution.
The above-mentioned NR, thr LSU, blk LSU, tile LSU will return the completion signal to NS after the corresponding instruction is executed.
The embodiment of the application also provides an SOC chip, which comprises: the last level of cache and at least one tile processor as provided in any one of the above embodiments, each tile processor being electrically connected to the last level of cache, the schematic diagram of which is shown in fig. 5.
The SOC chip may be an integrated circuit chip with signal processing capabilities. The SOC chip may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a microprocessor, a graphics processor (Graphics Processing Units, GPU), a general-purpose graphics processor (General Purpose Graphics Processing Units, GPGPU), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Or the SOC chip may be any conventional processor or the like.
The tile processor provided by the SOC chip embodiment has the same implementation principle and technical effects as those of the tile processor embodiment, and for brevity, reference may be made to the corresponding content in the tile processor embodiment.
The embodiment of the application also provides electronic equipment, which comprises the global memory and the SOC chip, wherein the global memory is electrically connected with the SOC chip.
The global Memory may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
The electronic equipment provided by the embodiment of the application comprises, but is not limited to, electronic products such as mobile phones, tablets, computers, vehicle-mounted equipment, servers and the like.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (13)

1. A tile processor, comprising:
an L2 shared memory, the L2 shared memory configured to be electrically connected to a last level cache, the last level cache configured to be electrically connected to a global memory;
the system comprises a tile scheduler and a computer unit, wherein the computer unit is electrically connected with the tile scheduler and the L2 shared memory respectively; the computer unit is configured to send the acquired first instruction of the first tile type to the tile scheduler;
the tile loading storage unit is electrically connected with the tile dispatcher and the L2 shared memory respectively; the tile loading storage unit is configured to receive the first instruction sent by the tile scheduler, and respond to the first instruction, read data with a first size from the global memory and write the data into the L2 shared memory, or read data with the first size from the L2 shared memory and write the data into the global memory.
2. The tile processor of claim 1, wherein the tile load store unit is specifically configured to:
analyzing the first instruction, if the first instruction is a loading instruction, acquiring a first read base address, a first read offset and a first write offset carried in the first instruction, generating a first data read address based on the first read base address and the first read offset, and generating a first data write address based on a preset first write base address and the first write offset; reading data with a first size from the global memory pointed by the first data reading address and writing the data into the position pointed by the first data writing address of the L2 shared memory;
if the first instruction is a storage instruction, acquiring a second read offset, a second write base address and a second write offset carried in the first instruction, generating a second data read address based on a preset second read base address and the second read offset, generating a second data write address based on the second write base address and the second write offset, and reading data with a first size from the L2 shared memory pointed by the second data read address to a position pointed by the second data write address of the global memory.
3. The tile processor of claim 1, wherein the computer unit is further configured to read a second size of data from the L2 shared memory to write to a register contained in the computer unit or read a second size of data from a register contained in the computer unit to write to the L2 shared memory in response to the fetched second instruction of the second tile type.
4. The tile processor of claim 1, wherein the computer unit comprises:
a pipelined processing unit electrically connected to the tile scheduler, the pipelined processing unit configured to send the first instruction to the tile scheduler.
5. The tile processor of claim 3, wherein the computer unit comprises: the pipeline processing unit is further electrically connected with the L2 shared memory, and is further configured to read data with a second size from the L2 shared memory and write the data into a register contained in the pipeline processing unit or read data with the second size from the register contained in the pipeline processing unit and write the data into the L2 shared memory in response to the acquired second instruction with the second tile type.
6. The tile processor of claim 4 or 5, wherein the computer unit further comprises: l1 shares memory, block load store unit; the block loading storage unit is respectively and electrically connected with the L2 shared memory and the L1 shared memory;
the pipelined processing unit is further configured to send the acquired third instruction of the first block type to the block load store unit through the tile scheduler;
the block load storage unit is configured to read data with a third size from the L2 shared memory and write the data into the L1 shared memory in response to the third instruction, or read data with the third size from the L1 shared memory and write the data into the L2 shared memory.
7. The tile processor of claim 6, wherein the pipelined processing unit is further electrically coupled to the L1 shared memory, the pipelined processing unit further configured to read a fourth size of data from the L1 shared memory into a register contained in the pipelined processing unit or read a fourth size of data from a register contained in the pipelined processing unit into the L1 shared memory in response to a fourth instruction of the acquired second block type.
8. The tile processor of any one of claims 4-7, wherein the pipelined processing unit comprises:
and a base scheduler electrically connected to the tile scheduler, the base scheduler configured to send the acquired first instruction of the first tile type to the tile scheduler.
9. The tile processor of claim 8, wherein the pipelined processing unit further comprises: a register file; the base scheduler is further configured to send the acquired third instruction of the first block type to a block load store unit via the tile scheduler.
10. The tile processor of claim 9, wherein the pipelined processing unit further comprises: a thread load store unit;
the thread loading storage unit is electrically connected with the register file, the basic scheduler and the L2 shared memory respectively; the thread load store unit is configured to: reading data with a second size from the L2 shared memory and writing the data into the register file, or reading data with the second size from the register file and writing the data into the L2 shared memory in response to the acquired second instruction with the second tile type; and/or;
The thread loading storage unit is electrically connected with the register file, the basic scheduler and the L1 shared memory respectively; the thread load storage unit is configured to respond to the acquired fourth instruction of the second block type, read data with a fourth size from the L1 shared memory and write the data into the register file, or read data with the fourth size from the register file and write the data into the L1 shared memory.
11. The tile processor of claim 8, wherein the base scheduler is further configured to count instructions to be executed sent by the base scheduler to obtain a first value; counting the received completion signals to obtain a second numerical value; and responding to the acquired waiting instruction, and suspending instruction scheduling under the condition that the difference value between the first numerical value and the second numerical value is larger than the waiting value carried in the waiting instruction, wherein each waiting instruction generates one completion signal after execution is completed.
12. An SOC chip, comprising: a last level cache and at least one tile processor according to any one of claims 1-11, each of said tile processors being electrically connected to said last level cache.
13. An electronic device, comprising: a global memory and the SOC chip as claimed in claim 12, the global memory being electrically connected to the SOC chip.
CN202310722611.6A 2022-10-21 2023-06-16 Tile processor, SOC chip and electronic equipment Active CN117033298B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263418466P 2022-10-21 2022-10-21
US63/418,466 2022-10-21

Publications (2)

Publication Number Publication Date
CN117033298A true CN117033298A (en) 2023-11-10
CN117033298B CN117033298B (en) 2024-06-18

Family

ID=88623272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310722611.6A Active CN117033298B (en) 2022-10-21 2023-06-16 Tile processor, SOC chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN117033298B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309702A (en) * 2012-03-05 2013-09-18 辉达公司 Uniform load processing for parallel thread sub-sets
US20140129799A1 (en) * 2012-11-08 2014-05-08 International Business Machines Corporation Address generation in an active memory device
CN110147248A (en) * 2019-04-19 2019-08-20 中国科学院计算技术研究所 The single precision Matrix Multiplication optimization method and system accelerated using AMD GPU assembly instruction
CN110659118A (en) * 2019-09-11 2020-01-07 南京天数智芯科技有限公司 Configurable hybrid heterogeneous computing core architecture for multi-field chip design
CN111831328A (en) * 2019-04-18 2020-10-27 华为技术有限公司 Data processing method and device
KR20200138413A (en) * 2018-11-21 2020-12-09 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 Network-on-chip data processing method and device
CN112463415A (en) * 2020-12-17 2021-03-09 盛科网络(苏州)有限公司 Multi-port shared memory management system and method based on random address
CN112463719A (en) * 2020-12-04 2021-03-09 上海交通大学 In-memory computing method realized based on coarse-grained reconfigurable array
DE102020127704A1 (en) * 2019-10-29 2021-04-29 Nvidia Corporation TECHNIQUES FOR EFFICIENT TRANSFER OF DATA TO A PROCESSOR
US20210294638A1 (en) * 2020-03-20 2021-09-23 Nvidia Corporation Asynchronous data movement pipeline
CN113485834A (en) * 2021-07-12 2021-10-08 深圳华锐金融技术股份有限公司 Shared memory management method and device, computer equipment and storage medium
JP2021170234A (en) * 2020-04-15 2021-10-28 株式会社デンソー Multiprocessor system

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309702A (en) * 2012-03-05 2013-09-18 辉达公司 Uniform load processing for parallel thread sub-sets
US20140129799A1 (en) * 2012-11-08 2014-05-08 International Business Machines Corporation Address generation in an active memory device
KR20200138413A (en) * 2018-11-21 2020-12-09 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 Network-on-chip data processing method and device
CN111831328A (en) * 2019-04-18 2020-10-27 华为技术有限公司 Data processing method and device
CN110147248A (en) * 2019-04-19 2019-08-20 中国科学院计算技术研究所 The single precision Matrix Multiplication optimization method and system accelerated using AMD GPU assembly instruction
CN110659118A (en) * 2019-09-11 2020-01-07 南京天数智芯科技有限公司 Configurable hybrid heterogeneous computing core architecture for multi-field chip design
DE102020127704A1 (en) * 2019-10-29 2021-04-29 Nvidia Corporation TECHNIQUES FOR EFFICIENT TRANSFER OF DATA TO A PROCESSOR
US20210294638A1 (en) * 2020-03-20 2021-09-23 Nvidia Corporation Asynchronous data movement pipeline
JP2021170234A (en) * 2020-04-15 2021-10-28 株式会社デンソー Multiprocessor system
CN112463719A (en) * 2020-12-04 2021-03-09 上海交通大学 In-memory computing method realized based on coarse-grained reconfigurable array
CN112463415A (en) * 2020-12-17 2021-03-09 盛科网络(苏州)有限公司 Multi-port shared memory management system and method based on random address
WO2022127874A1 (en) * 2020-12-17 2022-06-23 苏州盛科通信股份有限公司 Multi-port shared memory management system and method based on random address
CN113485834A (en) * 2021-07-12 2021-10-08 深圳华锐金融技术股份有限公司 Shared memory management method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN117033298B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
US7473293B2 (en) Processor for executing instructions containing either single operation or packed plurality of operations dependent upon instruction status indicator
US8345053B2 (en) Graphics processors with parallel scheduling and execution of threads
CN102460420B (en) Conditional operation in an internal processor of a memory device
US8291431B2 (en) Dependent instruction thread scheduling
US20140359225A1 (en) Multi-core processor and multi-core processor system
US7958336B2 (en) System and method for reservation station load dependency matrix
CN106575220B (en) Multiple clustered VLIW processing cores
CN103513964A (en) Loop buffer packing
CN111989655B (en) SOC chip, method for determining hotspot function and terminal equipment
TW201719398A (en) Scheduling method and processing device using the same
US9354850B2 (en) Method and apparatus for instruction scheduling using software pipelining
US20080005537A1 (en) Quantifying core reliability in a multi-core system
She et al. Scheduling for register file energy minimization in explicit datapath architectures
CN115562838A (en) Resource scheduling method and device, computer equipment and storage medium
CN117033298B (en) Tile processor, SOC chip and electronic equipment
CN105988773B (en) Hardware interface assembly and method for hardware interface assembly
US8555097B2 (en) Reconfigurable processor with pointers to configuration information and entry in NOP register at respective cycle to deactivate configuration memory for reduced power consumption
Meenderinck et al. Nexus: Hardware support for task-based programming
US20140013087A1 (en) Processor system with predicate register, computer system, method for managing predicates and computer program product
CN115905040B (en) Counter processing method, graphics processor, device and storage medium
EP4020216A1 (en) Performance circuit monitor circuit and method to concurrently store multiple performance monitor counts in a single register
CN112463218B (en) Instruction emission control method and circuit, data processing method and circuit
US11144322B2 (en) Code and data sharing among multiple independent processors
CN111798363B (en) Graphics processor
US9436624B2 (en) Circuitry for a computing system, LSU arrangement and memory arrangement as well as computing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240527

Address after: 8002, 8th Floor, No. 36 Haidian West Street, Haidian District, Beijing, 100080

Applicant after: Beijing Tiantian Zhixin Semiconductor Technology Co.,Ltd.

Country or region after: China

Address before: Room 101-5, building 3, 2388 Chenhang Road, Minhang District, Shanghai 201100

Applicant before: Shanghai Tiantian smart core semiconductor Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right
GR01 Patent grant