WO2023184900A1 - Processor, chip, electronic device, and data processing method - Google Patents

Processor, chip, electronic device, and data processing method Download PDF

Info

Publication number
WO2023184900A1
WO2023184900A1 PCT/CN2022/120893 CN2022120893W WO2023184900A1 WO 2023184900 A1 WO2023184900 A1 WO 2023184900A1 CN 2022120893 W CN2022120893 W CN 2022120893W WO 2023184900 A1 WO2023184900 A1 WO 2023184900A1
Authority
WO
WIPO (PCT)
Prior art keywords
register
thread
target
register group
group
Prior art date
Application number
PCT/CN2022/120893
Other languages
French (fr)
Chinese (zh)
Inventor
王文强
夏晓旭
孙海涛
徐宁仪
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023184900A1 publication Critical patent/WO2023184900A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/461Saving or restoring of program or task context
    • G06F9/462Saving or restoring of program or task context with multiple register sets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of chip technology, and in particular to processors, chips, electronic devices, data processing methods and computer-readable storage media.
  • a graphics processor Graphics Processing Unit, GPU
  • Threads can form thread blocks to collaborate to complete an overall computing task.
  • the register file can be used for data multiplexing. For example, when performing a convolution operation, a feature map can first be stored on the register file for multiple use by the computing unit.
  • the register file is generally thread-private, and data on the register file cannot be reused between different threads.
  • an embodiment of the present disclosure provides a processor.
  • the processor includes: a first register file, the first register file includes at least one first register group and at least one second register group, each of which A first register set and each second register set include at least one register, each first register set is used for allocation to one of a plurality of threads, and each said second register set is used for allocation to at least two of the plurality of threads; and a processing unit for scheduling each of the plurality of threads and responding to a data access request of a target thread of the plurality of threads, Access a target register in a register set allocated to the target thread, wherein the register set includes the first register set or the second register set.
  • the data access request of the target thread carries the logical address of the target register; the processing unit is configured to: map the logical address to the physical address of the target register; based on the physical address accesses the destination register.
  • the processing unit when mapping the logical address to the physical address of the target register, is configured to: when the target register is a register in the first register group, based on The total number of registers allocated to each prior thread and the logical address determine the physical address; the prior thread includes each thread with a thread number smaller than the target thread; or the target register is the second In the case of a register in a register group, the physical address is determined based on the total number of registers allocated to each thread and the logical address.
  • the first register file is divided into at least one storage unit, each storage unit includes at least one first register group and at least one second register group; different storage units are physically isolated, and Different storage units correspond to different threads, a storage unit includes a first register group for allocating to a thread corresponding to the storage unit, and a storage unit includes a second register group for allocating to a thread corresponding to the storage unit.
  • the processing unit is configured to: when the target register is a register in the first register group, determine based on the thread number of the target thread and the number of the storage unit The storage unit where the target register is located, and the physical address is determined based on the total number of register groups allocated to each previous thread, the number of storage units and the logical address; the previous thread includes a thread number smaller than the Each thread of the target thread; or when the register accessed by the data access request is a register in the second register group, determine the location where the target register is located based on the logical address and the number of storage units. storage units, and determine the physical address based on the total number of register sets allocated to each thread, the number of storage units, and the logical address.
  • the target register is a register in the first register group; the processing unit is configured to: use the data read from the target register as index information, and calculate the data based on the index information. Registers in the second register group are accessed.
  • the data access request includes an indication bit, which is used to indicate whether to use the data read from the second register file as index information for accessing the target register; in the indication bit
  • the processing unit is configured to: obtain the index information read from the second register file; based on The target register is accessed from the index information read from the second register file.
  • the processor further includes: an instruction path for sending a data access request to the target register; and an execution path for obtaining data transmitted by the target register in response to the data access request. , and perform operations on the obtained data.
  • the instruction path includes: an instruction reading unit, used to read the data access request sent by the target thread; an instruction decoding unit, used to access the data read by the instruction reading unit Request for decoding; an instruction sending unit is used to send the decoded data access request to the target register.
  • the execution path includes: an arithmetic unit, used to perform arithmetic processing on the acquired data; and a memory access unit, used to output the arithmetic results to the memory, and/or transfer the data stored in the memory. Output to the arithmetic unit for arithmetic processing.
  • each of the first register groups includes the same number of registers.
  • an embodiment of the present disclosure provides a chip, which includes the processor described in any embodiment of the present disclosure.
  • an embodiment of the disclosure provides an electronic device, which includes the chip described in any embodiment of the disclosure.
  • an embodiment of the present disclosure provides a data processing method, which is applied to the processing unit in the processor according to any embodiment of the present disclosure.
  • the method includes: scheduling each thread in a plurality of threads; In response to a data access request of a target thread among the plurality of threads, a target register in a register group allocated to the target thread is accessed.
  • embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method described in any embodiment is implemented.
  • Embodiments of the present disclosure divide the register file into a first register group and at least one second register group, wherein the second register group can be allocated to at least two threads, so that the second register group can be configured by two or more threads. common access, thus realizing data reuse between threads; in addition, since each first register group is only allocated to one thread, so that the first register group can only be accessed by one thread alone, thus making it easier for different threads to The data still has a certain degree of data isolation.
  • FIG. 1 is a schematic diagram of the manner in which threads access a register file in a multi-thread situation in the related art.
  • FIG. 2 is a schematic structural diagram of a processor according to an embodiment of the present disclosure.
  • Figure 3 is a schematic diagram of the mapping relationship between physical addresses and logical addresses according to an embodiment of the present disclosure.
  • FIG. 4A and FIG. 4B are respectively schematic diagrams of the address mapping method of the register file according to the embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of the positional relationship between the first register group and the second register group according to the embodiment of the present disclosure.
  • FIG. 6 and FIG. 7 are respectively schematic diagrams of the data operation process of the embodiment of the present disclosure.
  • Figure 8 is a schematic diagram of a chip according to an embodiment of the present disclosure.
  • Figure 9 is a flow chart of a data processing method according to an embodiment of the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • first information may also be called second information, and similarly, the second information may also be called first information.
  • word “if” as used herein may be interpreted as "when” or “when” or “in response to determining.”
  • the traditional processor storage structure can be roughly divided into three levels: external memory/cache/register file.
  • the bandwidth situation is: external memory ⁇ cache ⁇ register file.
  • the cache can be further divided into multiple layers, such as the L1/L2 cache of the central processing unit (Central Processing Unit, CPU), etc.
  • Typical data reuse is implemented through cache. When data is stored in the cache, two types of reuse may occur: 1) different computing units access the same data; 2) a single computing unit accesses the same cached data multiple times. Such reuse can effectively reduce data access to external memory.
  • processors will introduce hardware multi-threading technology.
  • the GPU will schedule the execution of multiple threads at the same time. Threads can form thread groups to collaborate to complete an overall computing task. In the process of collaborative computing, a large amount of data interaction is required between different threads. In traditional processor design, efficient inter-thread data interaction is generally achieved through on-chip storage units, such as the CPU's cache or the GPU's shared memory. .
  • the register file can be further used for data multiplexing.
  • a feature map can be stored on the register file first for multiple use by the computing unit.
  • the register file is generally thread-private. Therefore, only the same thread can reuse the data on the register file, and different threads cannot reuse the data on the register file. For example, in Figure 1, thread 1 can only access register 1, register 2, and register 3 in the register file; thread 2 can only access register 4 and register 5 in the register file; thread 3 can only access registers in the register file. 6. Register 7 and register 8. In this way, data in the same register cannot be reused by multiple threads.
  • the processor includes:
  • the first register file 201 includes at least one first register group 2011 and at least one second register group 2012, each of the first register group 2011 and each of the second register group 2012. Comprising at least one register R, each first register group 2011 is used for allocation to one thread among the plurality of threads, and each second register group 2012 is used for allocation to at least two threads among the plurality of threads; and
  • the processing unit 202 is configured to schedule each thread in the plurality of threads, and in response to a data access request of a target thread in the plurality of threads, perform operations on a target register in a register group allocated to the target thread. Access, wherein the register set includes the first register set or the second register set.
  • the processor in the embodiment of the present disclosure may be a CPU, a GPU, a neural network processor (Neural Network Processing Unit, NPU), and other types of multi-threaded processors.
  • the embodiment of the present disclosure does not limit the type of processor.
  • the processor can schedule multiple threads to process data in parallel so that the target register is accessed through that thread.
  • the first register file 201 in the embodiment of the present disclosure may include at least two register groups.
  • the first register file 201 may be divided into at least one first register group 2011 and at least one second register group 2012, and the number of the first register group 2011 and the number of the second register group 2012 may or may not be equal.
  • the first register group 2011 may include at least one register, and the second register group 2012 may also include at least one register.
  • the number of registers included in the first register group 2011 and the number of registers included in the second register group 2012 may be equal or different, and the number of registers included in different first register groups 2011 may be equal or different. equal.
  • the number of the first register set 2011 may be determined based on the number of threads.
  • the number of the first register set 2011 may be equal to the number of threads, such that each thread may be allocated one first register set 2011.
  • the number of first register sets 2011 may be an integral multiple of the number of threads, so that each thread may be allocated one or more first register sets 2011.
  • Figure 2 takes the number of the first register group 2011 as 2, the number of the second register group 2012 as 1, the number of threads as 2, and each thread is allocated a first register group as an example for illustration, where, R Represents a register.
  • the number of the first register group 2011, the number of the second register group 2012, the number of threads and/or the number of first register groups allocated to each thread can also take other values, which will not be described again here.
  • the solution of the embodiment of the present disclosure will be described below by taking the example that each thread is allocated a first register group and the number of registers included in each first register group is equal.
  • the number of the first register group 2011, the number of the second register group 2012, the number of registers in the first register group 2011, and the number of registers in the second register group 2012 can be configured respectively through configuration information.
  • the configuration information only needs to specify the number of the first register group 2011 and the number of the second register group 2012, without specifying which register or registers specifically constitute the first register group 2011 and the second register group 2012. In this way, configuration flexibility is high.
  • the configuration information may be generated by the controller, and the processor may automatically specify a corresponding number of registers to form the first register group 2011 and the second register group 2012 based on the configuration information. When at least one of the number of the first register group 2011, the number of the second register group 2012, the number of registers in the first register group 2011, and the number of registers in the second register group 2012 needs to be changed, you only need to change the corresponding configuration information.
  • first register group 2011 and the second register group 2012 are collectively referred to as a register group below.
  • the register group mentioned below may be either the first register group 2011 or the second register group 2012.
  • a first register group 2011 can only be allocated to one thread. After allocation, the first register group 2011 is not visible to other threads. In this way, data isolation between different threads can be achieved.
  • Each second register group 2012 can be allocated to at least two threads, so that the at least two threads can achieve data multiplexing. For example, in FIG. 2 , one of the first register sets 2011 may be assigned to thread 0, the other first register set 2011 may be assigned to thread 1, and the second register set 2012 may be assigned to both thread 0 and thread 1.
  • Each thread can access a target register in the register group assigned to that thread to read data from or write data to the target register.
  • the number of target registers can be greater than or equal to 1.
  • Each thread can be scheduled by the processing unit 202. When scheduling a target thread, the processing unit 202 can access the target register in the register group assigned to the target thread in response to the target thread's data access request.
  • the data access request can carry the logical address of the target register.
  • the processing unit needs to map the logical address to the physical address of the target register, and then access the target register based on the physical address.
  • the logical address of a register can be used to represent the identification information of the register in the register group assigned to a certain thread
  • the physical address of a register can be used to represent the identification information of the register in the register file.
  • the physical address and logical address of each register can be sequentially numbered using integers (for example, 0, 1, 2, 3,).
  • the logical addresses of registers in the register groups assigned to different threads can be the same, but the physical addresses of different registers must be different. For example, in the embodiment shown in FIG.
  • the registers with physical addresses 0, 1, and 2 are registers in the register group allocated to thread 0, and the logical addresses of the above registers are 0, 1, and 2 respectively.
  • the registers with physical addresses 3, 4, and 5 are registers in the register group assigned to thread 1, and the logical addresses of the above registers are also 0, 1, and 2 respectively.
  • the processing unit 202 can determine a unique target register in order to access the correct register. For example, for thread 1, when the logical address of the target register it accesses is 0, the logical address needs to be mapped to the physical address of the register in the first register file 201 (that is, 3). In some embodiments, the logical addresses of the registers in the first register group and the registers in the second register group can be set independently.
  • the logical address assigned to the register in the first register group of thread 0 may be an integer starting from 0 (for example, 0, 1, 2, 3, ...), allocated
  • the logical address of the register in the second register group given to thread 0 can also be set to an integer starting from 0 (for example, 0, 1, 2, 3, ).
  • the processing unit 202 may be based on the type of the register group to which the target register belongs and the type of the register group to which the target register belongs.
  • the location of the register group in the register file jointly determines the physical address of the target register.
  • the type of a register group is used to characterize whether the register group is the first register group or the second register group.
  • the physical address may be determined based on the total number of registers allocated to each previous thread and the logical address.
  • the previous thread includes each thread whose thread number (ie, thread ID) is smaller than the target thread. For example, if the thread IDs of each thread are integers such as 0, 1, 2, etc., and the thread ID of the target thread is 2, then the previous threads include the thread with thread ID 0 and the thread with thread ID 1.
  • the number of registers allocated to each preceding thread can be summed to obtain the total number of registers allocated to each preceding thread.
  • the total number of registers allocated to each preceding thread can be obtained based on the product of the number of preceding threads and the number of registers allocated to a single thread.
  • the number of registers allocated to a single thread is the total number of registers included in the k register groups.
  • the second register group is shared by all N threads.
  • the storage space composed of the second register group is also called a shared space.
  • the physical address physical_addr of the target register can be recorded as:
  • reg_id is the logical address of the target register
  • thread_id is the thread ID
  • M is the number of registers in the first register group.
  • Adders and multipliers can be used to implement the addition and multiplication operations in the above formula respectively to obtain the physical address.
  • the physical address may be determined based on the total number of registers allocated to each thread and the logical address. Still taking the situation shown in Figure 4A as an example, when the target register is a register in the second register group, the physical address physical_addr of the target register can be recorded as:
  • N is the total number of threads.
  • the physical address of the register in the second register group is greater than the physical address of the register in the first register group, that is, the register in the second register group is the later register in the register file, and the first register group The register in is the previous register in the register file (case 1).
  • the physical address of the register in the second register group may also be smaller than the physical address of the register in the first register group. That is, the register in the second register group is the previous register in the register file, and the first register The register in the group is the last register in the register file (case 2).
  • several registers in the middle of the register file can also be used as registers in the second register group, and the front and back registers in the register file can be used as registers in the first register group (case 3).
  • the gray squares represent the registers in the second register group
  • the white squares represent the registers in the first register group
  • the numbers in the squares represent the physical addresses of each register.
  • the physical address physical_addr of the target register can be recorded as:
  • the physical address of the target register is equal to the logical address of the target register.
  • the physical address physical_addr of the target register can be recorded as:
  • the first register file is divided into at least one storage unit (Bank), each storage unit includes at least one first register group and at least one second register group; different storage units have physical Isolated, and different storage units correspond to different threads, one storage unit includes a first register group for allocation to a thread corresponding to the storage unit, and one storage unit includes a second register group for allocation to the corresponding
  • this approach is called interleaving. For register files with multiple banks, in order to ensure uniform access, the shared space will be evenly distributed on each bank through interleaving.
  • FIG. 4B there are K Banks, and P registers are reserved for shared space in each Bank.
  • the overall shared space capacity is K*P.
  • the shared storage space on each storage unit can be allocated to any thread.
  • the logical addresses of the shared space are 0 to K*P-1.
  • the logical addresses of the registers in the shared space (i.e. the second register group) can be interleaved on each Bank.
  • the register number in the second register group on Bank_0 is 0/3/6/9/12
  • the register number in the second register group on Bank_1 is 1/4/7/ On 10/13
  • the register number in the second register group on Bank_2 is 2/5/8/11/14.
  • the interleaving method is not limited to this.
  • Each Bank also reserves storage space dedicated to each thread (i.e., the first register group).
  • Bank_0 reserves dedicated storage space for thread 0
  • Bank_1 reserves Dedicated storage space for thread 1, thread K+1,..., thread N-K+1.
  • M the size of the dedicated storage space reserved for each thread in a Bank
  • the registers with addresses 0 to M-1 reserved for a thread form a first register group, that is, the first register group allocated to a thread includes M registers.
  • the storage unit Bank_0 corresponds to thread 0, thread 3 and thread 6
  • the storage unit Bank_0 includes the first register group Can be assigned to thread 0, thread 3 and thread 6 respectively.
  • Storage unit Bank_1 corresponds to thread 1, thread 4 and thread 7, then the first register group included in storage unit Bank_1 can be allocated to thread 1, thread 4 and thread 7 respectively.
  • Storage unit Bank_2 corresponds to thread 2, thread 5 and thread 8, then the first register group included in storage unit Bank_2 can be allocated to thread 2, thread 5 and thread 8 respectively.
  • other interleaving methods can also be used, and no examples are given here.
  • the storage unit where the target register is located can be determined based on the number of the target thread and the number of the storage units, and The physical address is determined based on the total number of register sets allocated to each preceding thread, the number of storage units, and the logical address; the preceding thread includes each thread with a thread number smaller than the target thread.
  • the number physical_bank of the storage unit where the target register is located can be recorded as:
  • the physical address physical_addr of the target register on the corresponding storage unit can be recorded as:
  • the target register is a register with physical address 1 in the first register group allocated to thread 0, the number of the storage unit where the target register is located is 0%3, that is, Bank_0.
  • the target register is a register with a physical address of 2 in the first register group allocated to thread 1
  • the number of the storage unit where the target register is located is 1%3, that is, Bank_1
  • the storage unit where the target register is located can be determined based on the logical address and the number of the storage units, and based on the allocation to each thread.
  • the total number of register groups, the number of storage units, and the logical address determine the physical address.
  • the number physical_bank of the storage unit where the target register is located can be recorded as:
  • the physical address physical_addr of the target register on the corresponding storage unit can be recorded as:
  • the number of storage units K equal to 3
  • K can be set to a power of 2, and K can be set to an integer multiple of N. If the value of reg_id is not an integer, it can be rounded down so that the resulting physical address is an integer.
  • the target register can be accessed, for example, data is read from the target register.
  • the data read from the target register can be used as an access address of another register to access data in the other register.
  • the target register may be a register in the first register group.
  • the processing unit 202 may first obtain the physical address of the target register (i.e., the index register number), and use the data read from the target register based on the index register number as index information (i.e., address information, i.e., the index register value in the figure) , and access registers in the second register group based on the index information.
  • the physical address of the target register can be calculated through the above method, assuming it is A2, access the register with the physical address of A2, and obtain the data A3, and use A3 as The logical address of a register in the second register group, and the physical address of the register is calculated based on A3, assuming it is address A4, and then the register with address A4 can be accessed.
  • the above process can be seen in Figure 6.
  • the instruction path can read instructions sent by multiple threads (ie, the aforementioned data access requests) based on multi-thread context information (context), decode the data access requests, and send them to the execution path.
  • the data access request may be used to access a target register in the first register group. After obtaining the data in the target register, the data is used as index information to access the register in the second register group. This method is called indirect addressing.
  • the processing unit 202 can first obtain the physical address of the register that needs to be accessed in the second register file (i.e., the index register number), and obtain the index information read from the second register file (i.e., the index register number in the figure) based on the index register number. Index register value); access the target register based on the index information read from the second register file.
  • the target register here can be either a register in the first register group or a register in the second register group.
  • the processor may contain two independent register files, wherein the second register file may be a vector register file used to store SIMD data for parallel calculations or Single Instruction Multiple Threads (Single Instruction Multiple Threads). Threads, SIMT) data; the first register file can be a scalar register file, used to store simple scalar data or control information.
  • the vector register file can be accessed using the value read from the scalar register file as an index, and then the data obtained from the vector register file is sent to the execution path. Since core operations occur in the vector register file, shared space support is mainly added to the vector register file.
  • the data read from the register can also be directly used for data operations. This method is called direct addressing.
  • direct addressing the data read from the registers included in the second register file or the registers included in the first register group are no longer used as index information to access other registers, but are directly used to perform data operations (such as , multiplication operations, addition operations, etc.).
  • the data access request may include an indication bit, and the indication bit is used to indicate whether the data read from the target register will be used as the index information.
  • the indication bit may include two indication states.
  • the indication bit When the indication bit is in the first indication state, it is determined that the data to be read from the target register is used as the index information; when the indication bit is in the second indication state, In the case of status, it is determined not to use the data read from the target register as the index information.
  • the first indication state and the second at least state may be represented by at least 1 bit of data bit. For example, binary data "0" can be used to represent the first indication state, and binary data "1" can be used to represent the second indication state.
  • the way of expressing the indication status is not limited to this. Those skilled in the art can use other ways to express different indication status according to the actual situation, which will not be listed here one by one.
  • the processor further includes an instruction path for sending a data access request to the target register; and an execution path for obtaining data transmitted by the target register in response to the data access request, And perform operations on the obtained data.
  • the instruction path may include: an instruction reading unit, used to read the data access request sent by the target thread; an instruction decoding unit, used to read the data access request sent by the instruction reading unit Perform decoding; an instruction issuing unit is used to send the decoded data access request to the target register.
  • the target register can output the stored data to the execution path for calculation processing, or can return the stored data to the instruction issuing unit as index information, so that the instruction issuing unit can send the data stored in the corresponding register to the execution path based on the index information. Perform computational processing.
  • the execution path includes: an operation unit, used to perform operation processing on the acquired data; and a memory access unit, used to output the operation results to the memory, and/or output the data stored in the memory to
  • the computing unit performs computing processing.
  • the operation unit may include one or more sub-operation units, such as an addition unit, a multiplication unit, a convolution unit, etc. The number and type of sub-operation units included in the operation unit may be set based on actual requirements.
  • the memory access unit is used to implement data transmission between the computing unit and the memory. When the register file does not include the data required for the operation, the memory access unit can be used to access the memory to obtain the corresponding data. Furthermore, the data obtained from the scalar register can also be output to the scalar execution unit for processing as data requiring operation.
  • an embodiment of the present disclosure also provides a chip.
  • the chip includes a processor 801 , and the processor 801 can be the processor described in any of the above embodiments.
  • the chip can be applied in an AI accelerator card.
  • the chip further includes a controller 802 for configuring at least one of the following information: information on the first number of registers included in the first register group, information on the number of registers included in the second register group. The second quantity information, the number of the first register group, the number of the second register group.
  • An embodiment of the present disclosure also provides an electronic device, including the chip described in any of the above embodiments.
  • an embodiment of the present disclosure also provides a data processing method, which is applied to the processing unit in the processor according to any embodiment of the present disclosure.
  • the method includes:
  • Step 901 Schedule each thread in the plurality of threads
  • Step 902 In response to the data access request of the target thread among the plurality of threads, access the target register in the register group allocated to the target thread, wherein the register group includes the first register group or the Second register group.
  • the data access request of the target thread carries the logical address of the target register; in response to the data access request of the target thread among the plurality of threads, the data access request allocated to the target thread is accessed.
  • the target register in the register group includes: mapping the logical address to the physical address of the target register; and accessing the target register based on the physical address.
  • mapping the logical address to a physical address of the target register includes: when the target register is a register in the first register group, based on the allocation to each previous The total number of registers of threads and the logical address determine the physical address; the previous thread includes each thread with a thread number smaller than the target thread; and/or the target register is the second register group.
  • the physical address is determined based on the total number of registers allocated to each thread and the logical address.
  • the first register file is divided into at least one storage unit, each storage unit includes at least one first register group and at least one second register group; different storage units are physically isolated, and Different storage units correspond to different threads, a storage unit includes a first register group for allocating to a thread corresponding to the storage unit, and a storage unit includes a second register group for allocating to a thread corresponding to the storage unit.
  • mapping the logical address to the physical address of the target register includes: when the target register is a register in the first register group, based on the target The thread number of the thread and the number of storage units determine the storage unit where the target register is located, and the physical address is determined based on the total number of register sets allocated to each previous thread, the number of storage units, and the logical address ;
  • the prior thread includes each thread with a thread number smaller than the target thread; and/or when the register accessed by the data access request is a register in the second register group, based on the logical address
  • the number of storage units determines the storage unit where the target register is located, and determines the physical address based on the total number of register groups allocated to each thread, the number of storage units, and the logical address.
  • the target register is a register in the first register group; in response to a data access request of a target thread in the plurality of threads, accessing a register group allocated to the target thread
  • the target register includes: taking the data read from the target register as index information, and accessing the registers in the second register group based on the index information.
  • accessing a target register in a register group allocated to the target thread includes: obtaining a read from a second register file index information; access the target register based on the index information read from the second register file.
  • the data access request includes an indication bit, and the indication bit is used to indicate whether to use the data read from the second register file as index information for accessing the target register.
  • the method further includes: sending a data access request to the target register through an instruction path; obtaining data transmitted by the target register in response to the data access request through an execution path, and processing the obtained data Perform computational processing.
  • sending a data access request to the target register through an instruction path includes: reading the data access request sent by the target thread through an instruction reading unit in the instruction path; The instruction decoding unit in the instruction path decodes the data access request read by the instruction reading unit; and sends the decoded data access request to the target register through the instruction issuing unit in the instruction path; and/or obtaining the data transmitted by the target register in response to the data access request through the execution path, and performing operations on the obtained data, including: performing operations on the obtained data through the operation unit in the execution path processing; and outputting the operation results to the memory through the memory access unit in the execution path, and/or outputting the data stored in the memory to the operation unit for operation processing.
  • each first register group includes the same number of registers.
  • An embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the method described in any of the foregoing embodiments is implemented.
  • Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • read-only memory read-only memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • compact disc read-only memory CD-ROM
  • DVD digital versatile disc
  • Magnetic tape cassettes tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • the embodiments of this specification can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the embodiments of this specification can be embodied in the form of software products in essence or those that contribute to the existing technology.
  • the computer software products can be stored in storage media, such as ROM/RAM, A magnetic disk, optical disk, etc., includes a number of instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments of this specification.
  • a typical implementation device is a computer, which may be in the form of a personal computer, a laptop, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, or a game controller. desktop, tablet, wearable device, or a combination of any of these devices.
  • each embodiment in this specification is described in a progressive manner.
  • the same and similar parts between the various embodiments can be referred to each other.
  • Each embodiment focuses on its differences from other embodiments.
  • the description is relatively simple.
  • the device embodiments described above are only illustrative.
  • the modules described as separate components may or may not be physically separated.
  • the functions of each module may be integrated into the same device. or implemented in multiple software and/or hardware. Some or all of the modules can also be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

Provided in the embodiments of the present disclosure are a processor, a chip, an electronic device, and a data processing method. The processor comprises: a first register file, wherein the first register file comprises at least one first register group and at least one second register group, each first register group and each second register group comprise at least one register, each first register group is allocated to one thread among a plurality of threads, and each second register group is allocated to at least two threads among the plurality of threads; and a processing unit, which is used for scheduling each thread among the plurality of threads, and accessing, in response to a data access request of a target thread among the plurality of threads, a target register in a register group which is allocated to the target thread, wherein the register group comprises the first register group or the second register group. The embodiments of the present disclosure implement data multiplexing between threads.

Description

处理器、芯片、电子设备及数据处理方法Processors, chips, electronic devices and data processing methods
交叉引用声明Cross reference statement
本申请要求于2022年03月31日提交中国专利局的申请号为202210345686.2的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application with application number 202210345686.2 submitted to the China Patent Office on March 31, 2022, the entire content of which is incorporated into this application by reference.
技术领域Technical field
本公开涉及芯片技术领域,尤其涉及处理器、芯片、电子设备、数据处理方法及计算机可读存储介质。The present disclosure relates to the field of chip technology, and in particular to processors, chips, electronic devices, data processing methods and computer-readable storage media.
背景技术Background technique
为提升调度效率,很多处理器会引入硬件多线程的技术。如图形处理器(Graphics Processing Unit,GPU)会同时调度多个线程的执行。线程之间可以组成线程块,协作完成一个整体的计算任务。在协同计算的过程中,不同的线程之间需要进行大量的数据交互,为了提高数据传输带宽,可以采用寄存器堆进行数据复用。例如,在进行卷积运算时,可以先将一块特征图(feature map)存储在寄存器堆上,供运算单元多次使用。然而,在常规的处理器设计中,寄存器堆一般是线程私有的,不同的线程之间无法对寄存器堆上的数据进行复用。In order to improve scheduling efficiency, many processors will introduce hardware multi-threading technology. For example, a graphics processor (Graphics Processing Unit, GPU) will schedule the execution of multiple threads at the same time. Threads can form thread blocks to collaborate to complete an overall computing task. In the process of collaborative computing, a large amount of data interaction is required between different threads. In order to improve the data transmission bandwidth, the register file can be used for data multiplexing. For example, when performing a convolution operation, a feature map can first be stored on the register file for multiple use by the computing unit. However, in conventional processor designs, the register file is generally thread-private, and data on the register file cannot be reused between different threads.
发明内容Contents of the invention
第一方面,本公开实施例提供一种处理器,所述处理器包括:第一寄存器堆,所述第一寄存器堆包括至少一个第一寄存器组以及至少一个第二寄存器组,每个所述第一寄存器组和每个所述第二寄存器组均包括至少一个寄存器,每个所述第一寄存器组用于分配给多个线程中的一个线程,每个所述第二寄存器组用于分配给所述多个线程中的至少两个线程;以及处理单元,用于对所述多个线程中的每个线程进行调度,并响应于所述多个线程中的目标线程的数据访问请求,访问分配给所述目标线程的寄存器组中的目标寄存器,其中,所述寄存器组包括所述第一寄存器组或所述第二寄存器组。In a first aspect, an embodiment of the present disclosure provides a processor. The processor includes: a first register file, the first register file includes at least one first register group and at least one second register group, each of which A first register set and each second register set include at least one register, each first register set is used for allocation to one of a plurality of threads, and each said second register set is used for allocation to at least two of the plurality of threads; and a processing unit for scheduling each of the plurality of threads and responding to a data access request of a target thread of the plurality of threads, Access a target register in a register set allocated to the target thread, wherein the register set includes the first register set or the second register set.
在一些实施例中,所述目标线程的数据访问请求中携带所述目标寄存器的逻辑地址;所述处理单元用于:将所述逻辑地址映射为所述目标寄存器的物理地址;基于所述物理地址访问所述目标寄存器。In some embodiments, the data access request of the target thread carries the logical address of the target register; the processing unit is configured to: map the logical address to the physical address of the target register; based on the physical address accesses the destination register.
在一些实施例中,在将所述逻辑地址映射为所述目标寄存器的物理地址时,所述处理单元用于:在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,基于分配给 各个在先线程的寄存器的总数量以及所述逻辑地址确定所述物理地址;所述在先线程包括线程编号小于所述目标线程的各个线程;或在所述目标寄存器为所述第二寄存器组中的寄存器的情况下,基于分配给各个线程的寄存器的总数量以及所述逻辑地址确定所述物理地址。In some embodiments, when mapping the logical address to the physical address of the target register, the processing unit is configured to: when the target register is a register in the first register group, based on The total number of registers allocated to each prior thread and the logical address determine the physical address; the prior thread includes each thread with a thread number smaller than the target thread; or the target register is the second In the case of a register in a register group, the physical address is determined based on the total number of registers allocated to each thread and the logical address.
在一些实施例中,所述第一寄存器堆被划分为至少一个存储单元,每个存储单元均包括至少一个第一寄存器组以及至少一个第二寄存器组;不同的存储单元之间物理隔离,且不同的存储单元对应于不同的线程,一个存储单元包括的第一寄存器组用于分配给对应于所述存储单元的一个线程,一个存储单元包括的第二寄存器组用于分配给对应于所述存储单元的至少两个线程;所述处理单元用于:在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,基于所述目标线程的线程编号以及所述存储单元的数量确定所述目标寄存器所在的存储单元,并基于分配给各个在先线程的寄存器组的总数量、存储单元的数量以及所述逻辑地址确定所述物理地址;所述在先线程包括线程编号小于所述目标线程的各个线程;或在所述数据访问请求所访问的寄存器为所述第二寄存器组中的寄存器的情况下,基于所述逻辑地址以及所述存储单元的数量确定所述目标寄存器所在的存储单元,并基于分配给各个线程的寄存器组的总数量、存储单元的数量以及所述逻辑地址确定所述物理地址。In some embodiments, the first register file is divided into at least one storage unit, each storage unit includes at least one first register group and at least one second register group; different storage units are physically isolated, and Different storage units correspond to different threads, a storage unit includes a first register group for allocating to a thread corresponding to the storage unit, and a storage unit includes a second register group for allocating to a thread corresponding to the storage unit. At least two threads of the storage unit; the processing unit is configured to: when the target register is a register in the first register group, determine based on the thread number of the target thread and the number of the storage unit The storage unit where the target register is located, and the physical address is determined based on the total number of register groups allocated to each previous thread, the number of storage units and the logical address; the previous thread includes a thread number smaller than the Each thread of the target thread; or when the register accessed by the data access request is a register in the second register group, determine the location where the target register is located based on the logical address and the number of storage units. storage units, and determine the physical address based on the total number of register sets allocated to each thread, the number of storage units, and the logical address.
在一些实施例中,所述目标寄存器为所述第一寄存器组中的寄存器;所述处理单元用于:将从所述目标寄存器中读取的数据作为索引信息,并基于所述索引信息对所述第二寄存器组中的寄存器进行访问。In some embodiments, the target register is a register in the first register group; the processing unit is configured to: use the data read from the target register as index information, and calculate the data based on the index information. Registers in the second register group are accessed.
在一些实施例中,所述数据访问请求中包括指示位,所述指示位用于指示是否将从第二寄存器堆中读取的数据作为访问所述目标寄存器的索引信息;在所述指示位指示将从所述第二寄存器堆中读取的数据作为访问所述目标寄存器的索引信息的情况下,所述处理单元用于:获取从所述第二寄存器堆中读取的索引信息;基于从所述第二寄存器堆中读取的索引信息对所述目标寄存器进行访问。In some embodiments, the data access request includes an indication bit, which is used to indicate whether to use the data read from the second register file as index information for accessing the target register; in the indication bit When indicating that the data read from the second register file is used as index information for accessing the target register, the processing unit is configured to: obtain the index information read from the second register file; based on The target register is accessed from the index information read from the second register file.
在一些实施例中,所述处理器还包括:指令通路,用于发送对所述目标寄存器的数据访问请求;以及执行通路,用于获取所述目标寄存器响应于所述数据访问请求传输的数据,并对获取的数据进行运算处理。In some embodiments, the processor further includes: an instruction path for sending a data access request to the target register; and an execution path for obtaining data transmitted by the target register in response to the data access request. , and perform operations on the obtained data.
在一些实施例中,所述指令通路包括:指令读取单元,用于读取所述目标线程发送的数据访问请求;指令译码单元,用于对所述指令读取单元读取的数据访问请求进行译码;指令发射单元,用于将译码后的数据访问请求发送至所述目标寄存器。In some embodiments, the instruction path includes: an instruction reading unit, used to read the data access request sent by the target thread; an instruction decoding unit, used to access the data read by the instruction reading unit Request for decoding; an instruction sending unit is used to send the decoded data access request to the target register.
在一些实施例中,所述执行通路包括:运算单元,用于对所述获取的数据进行运算 处理;以及访存单元,用于将运算结果输出至内存,和/或将内存中存储的数据输出至所述运算单元进行运算处理。In some embodiments, the execution path includes: an arithmetic unit, used to perform arithmetic processing on the acquired data; and a memory access unit, used to output the arithmetic results to the memory, and/or transfer the data stored in the memory. Output to the arithmetic unit for arithmetic processing.
在一些实施例中,各个所述第一寄存器组包括的寄存器的数量相同。In some embodiments, each of the first register groups includes the same number of registers.
第二方面,本公开实施例提供一种芯片,所述芯片包括本公开任一实施例所述的处理器。In a second aspect, an embodiment of the present disclosure provides a chip, which includes the processor described in any embodiment of the present disclosure.
第三方面,本公开实施例提供一种电子设备,所述电子设备包括本公开任一实施例所述的芯片。In a third aspect, an embodiment of the disclosure provides an electronic device, which includes the chip described in any embodiment of the disclosure.
第四方面,本公开实施例提供一种数据处理方法,应用于本公开任一实施例所述的处理器中的处理单元,所述方法包括:对多个线程中的每个线程进行调度;响应于所述多个线程中的目标线程的数据访问请求,访问分配给所述目标线程的寄存器组中的目标寄存器。In a fourth aspect, an embodiment of the present disclosure provides a data processing method, which is applied to the processing unit in the processor according to any embodiment of the present disclosure. The method includes: scheduling each thread in a plurality of threads; In response to a data access request of a target thread among the plurality of threads, a target register in a register group allocated to the target thread is accessed.
第五方面,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现任一实施例所述的方法。In a fifth aspect, embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method described in any embodiment is implemented.
本公开实施例将寄存器堆划分为第一寄存器组以及至少一个第二寄存器组,其中,第二寄存器组可以分配给至少两个线程,以使得第二寄存器组可以由两个或两个以上线程共同访问,从而实现了线程之间的数据复用;此外,由于每个第一寄存器组仅分配给一个线程,以使得第一寄存器组只能由一个线程单独进行访问,从而使不同线程之间的数据仍然具有一定程度的数据隔离。Embodiments of the present disclosure divide the register file into a first register group and at least one second register group, wherein the second register group can be allocated to at least two threads, so that the second register group can be configured by two or more threads. common access, thus realizing data reuse between threads; in addition, since each first register group is only allocated to one thread, so that the first register group can only be accessed by one thread alone, thus making it easier for different threads to The data still has a certain degree of data isolation.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not restrictive of the disclosure.
附图说明Description of drawings
此处的附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。The drawings herein illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure.
图1是相关技术中多线程情况下线程访问寄存器堆的方式的示意图。FIG. 1 is a schematic diagram of the manner in which threads access a register file in a multi-thread situation in the related art.
图2是本公开实施例的处理器的结构示意图。FIG. 2 is a schematic structural diagram of a processor according to an embodiment of the present disclosure.
图3是本公开实施例的物理地址与逻辑地址的映射关系的示意图。Figure 3 is a schematic diagram of the mapping relationship between physical addresses and logical addresses according to an embodiment of the present disclosure.
图4A和图4B分别是本公开实施例的寄存器堆的地址映射方式的示意图。FIG. 4A and FIG. 4B are respectively schematic diagrams of the address mapping method of the register file according to the embodiment of the present disclosure.
图5是本公开实施例的第一寄存器组与第二寄存器组的位置关系的示意图。FIG. 5 is a schematic diagram of the positional relationship between the first register group and the second register group according to the embodiment of the present disclosure.
图6和图7分别是本公开实施例的数据运算过程的示意图。FIG. 6 and FIG. 7 are respectively schematic diagrams of the data operation process of the embodiment of the present disclosure.
图8是本公开实施例的芯片的示意图。Figure 8 is a schematic diagram of a chip according to an embodiment of the present disclosure.
图9是本公开实施例的数据处理方法的流程图。Figure 9 is a flow chart of a data processing method according to an embodiment of the present disclosure.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of the disclosure as detailed in the appended claims.
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合。The terminology used in this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the present disclosure, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining."
为了使本技术领域的人员更好的理解本公开实施例中的技术方案,并使本公开实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本公开实施例中的技术方案作进一步详细的说明。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present disclosure, and to make the above objects, features and advantages of the embodiments of the present disclosure more obvious and easy to understand, the technical solutions in the embodiments of the present disclosure are described below in conjunction with the accompanying drawings. The plan is explained in further detail.
在人工智能或科学计算等领域,高性能处理器的设计非常重要。在密集计算的场景下,要提高处理器的性能,就必须解决存储墙的问题,通过数据复用降低对外部带宽的需求,提高运算单元的利用效率。In fields such as artificial intelligence or scientific computing, the design of high-performance processors is very important. In intensive computing scenarios, to improve processor performance, it is necessary to solve the problem of storage walls, reduce the demand for external bandwidth through data multiplexing, and improve the utilization efficiency of computing units.
传统的处理器存储结构大体可以分为外存/缓存/寄存器堆三个层次,带宽情况为:外存<缓存<寄存器堆。其中,缓存可以进一步分为多层,如中央处理单元(Central Processing Unit,CPU)的L1/L2缓存等。典型的数据复用是通过缓存实现的,当数据存储在缓存后,可能发生两种复用:1)不同的运算单元访问同一份数据;2)单个运算单元多次访问同一份缓存数据。这样的复用可以有效降低对外存的数据访问。The traditional processor storage structure can be roughly divided into three levels: external memory/cache/register file. The bandwidth situation is: external memory <cache <register file. Among them, the cache can be further divided into multiple layers, such as the L1/L2 cache of the central processing unit (Central Processing Unit, CPU), etc. Typical data reuse is implemented through cache. When data is stored in the cache, two types of reuse may occur: 1) different computing units access the same data; 2) a single computing unit accesses the same cached data multiple times. Such reuse can effectively reduce data access to external memory.
为提升调度效率,很多处理器会引入硬件多线程的技术。如GPU会同时调度多个线程的执行。线程之间可以组成线程组,协作完成一个整体的计算任务。在协同计算的过程中,不同的线程之间需要进行大量的数据交互,在传统的处理器设计中,一般通过片 上存储单元来实现高效的线程间数据交互,如CPU的cache或者GPU的shared memory。In order to improve scheduling efficiency, many processors will introduce hardware multi-threading technology. For example, the GPU will schedule the execution of multiple threads at the same time. Threads can form thread groups to collaborate to complete an overall computing task. In the process of collaborative computing, a large amount of data interaction is required between different threads. In traditional processor design, efficient inter-thread data interaction is generally achieved through on-chip storage units, such as the CPU's cache or the GPU's shared memory. .
然而,在密集计算场景下,有时候片上存储单元的带宽仍然无法满足运算的需求。此时,可以进一步采用寄存器堆进行数据复用。例如,在进行卷积运算时,可以先将一块特征图存储在寄存器堆上,供运算单元多次使用。在常规的处理器设计中,寄存器堆一般是线程私有的,因此,只有同一个线程能够复用寄存器堆上的数据,不同的线程之间无法对寄存器堆上的数据进行复用。例如,在图1中,线程1只能够访问寄存器堆中的寄存器1、寄存器2和寄存器3;线程2只能访问寄存器堆中的寄存器4和寄存器5;线程3只能够访问寄存器堆中的寄存器6、寄存器7和寄存器8。这样,同一个寄存器中的数据无法被多个线程复用。However, in intensive computing scenarios, sometimes the bandwidth of the on-chip memory unit still cannot meet the computing needs. At this point, the register file can be further used for data multiplexing. For example, when performing a convolution operation, a feature map can be stored on the register file first for multiple use by the computing unit. In conventional processor designs, the register file is generally thread-private. Therefore, only the same thread can reuse the data on the register file, and different threads cannot reuse the data on the register file. For example, in Figure 1, thread 1 can only access register 1, register 2, and register 3 in the register file; thread 2 can only access register 4 and register 5 in the register file; thread 3 can only access registers in the register file. 6. Register 7 and register 8. In this way, data in the same register cannot be reused by multiple threads.
基于此,本公开实施例提供一种处理器,参见图2,所述处理器包括:Based on this, an embodiment of the present disclosure provides a processor. Referring to Figure 2, the processor includes:
第一寄存器堆201,所述第一寄存器堆201包括至少一个第一寄存器组2011以及至少一个第二寄存器组2012,每个所述第一寄存器组2011和每个所述第二寄存器组2012均包括至少一个寄存器R,每个第一寄存器组2011用于分配给多个线程中的一个线程,每个第二寄存器组2012用于分配给多个线程中的至少两个线程;以及The first register file 201 includes at least one first register group 2011 and at least one second register group 2012, each of the first register group 2011 and each of the second register group 2012. Comprising at least one register R, each first register group 2011 is used for allocation to one thread among the plurality of threads, and each second register group 2012 is used for allocation to at least two threads among the plurality of threads; and
处理单元202,用于对多个线程中的每个线程进行调度,并响应于所述多个线程中的目标线程的数据访问请求,对分配给所述目标线程的寄存器组中的目标寄存器进行访问,其中,所述寄存器组包括所述第一寄存器组或所述第二寄存器组。The processing unit 202 is configured to schedule each thread in the plurality of threads, and in response to a data access request of a target thread in the plurality of threads, perform operations on a target register in a register group allocated to the target thread. Access, wherein the register set includes the first register set or the second register set.
本公开实施例中的处理器可以是CPU、GPU、神经网络处理器(Neural Network Processing Unit,NPU)等各种类型的多线程处理器,本公开实施例对处理器的类型不作限制。处理器可以调度多个线程来对数据进行并行处理,以便通过该线程访问目标寄存器。The processor in the embodiment of the present disclosure may be a CPU, a GPU, a neural network processor (Neural Network Processing Unit, NPU), and other types of multi-threaded processors. The embodiment of the present disclosure does not limit the type of processor. The processor can schedule multiple threads to process data in parallel so that the target register is accessed through that thread.
本公开实施例中的第一寄存器堆201可以包括至少两个寄存器组。可以将第一寄存器堆201划分为至少一个第一寄存器组2011和至少一个第二寄存器组2012,且第一寄存器组2011的数量和第二寄存器组2012的数量可以相等,也可以不相等。第一寄存器组2011可以包括至少一个寄存器,第二寄存器组2012也可以包括至少一个寄存器。此外,第一寄存器组2011包括的寄存器的数量与第二寄存器组2012包括的寄存器的数量可以相等,也可以不相等,且不同的第一寄存器组2011包括的寄存器的数量可以相等,也可以不相等。The first register file 201 in the embodiment of the present disclosure may include at least two register groups. The first register file 201 may be divided into at least one first register group 2011 and at least one second register group 2012, and the number of the first register group 2011 and the number of the second register group 2012 may or may not be equal. The first register group 2011 may include at least one register, and the second register group 2012 may also include at least one register. In addition, the number of registers included in the first register group 2011 and the number of registers included in the second register group 2012 may be equal or different, and the number of registers included in different first register groups 2011 may be equal or different. equal.
在一些实施例中,第一寄存器组2011的数量可以基于线程的数量确定,例如,第一寄存器组2011的数量可以与线程的数量相等,这样,每个线程可以分配一个第一寄存器组2011。或者,第一寄存器组2011的数量可以是线程的数量的整数倍,这样,每个 线程可以分配一个或多个第一寄存器组2011。为了简洁,图2中以第一寄存器组2011的数量为2,第二寄存器组2012的数量为1,线程数量为2,且每个线程分配一个第一寄存器组为例进行说明,其中,R表示寄存器。本领域技术人员可以理解,上述情况仅为示例性说明,在实际应用中,第一寄存器组2011的数量、第二寄存器组2012的数量、线程的数量和/或每个线程被分配的第一寄存器组的数量也可以取其他的数值,此处不再赘述。为了便于说明,下文以每个线程分配一个第一寄存器组,且各个第一寄存器组中包括的寄存器的数量相等为例,对本公开实施例的方案进行说明。In some embodiments, the number of the first register set 2011 may be determined based on the number of threads. For example, the number of the first register set 2011 may be equal to the number of threads, such that each thread may be allocated one first register set 2011. Alternatively, the number of first register sets 2011 may be an integral multiple of the number of threads, so that each thread may be allocated one or more first register sets 2011. For simplicity, Figure 2 takes the number of the first register group 2011 as 2, the number of the second register group 2012 as 1, the number of threads as 2, and each thread is allocated a first register group as an example for illustration, where, R Represents a register. Those skilled in the art can understand that the above situation is only an illustrative description. In actual applications, the number of the first register group 2011, the number of the second register group 2012, the number of threads and/or the number of first register groups allocated to each thread. The number of register groups can also take other values, which will not be described again here. For ease of explanation, the solution of the embodiment of the present disclosure will be described below by taking the example that each thread is allocated a first register group and the number of registers included in each first register group is equal.
可以通过配置信息对第一寄存器组2011的数量、第二寄存器组2012的数量、第一寄存器组2011中寄存器的数量以及第二寄存器组2012中寄存器的数量分别进行配置。在一些实施例中,配置信息只需指定第一寄存器组2011的数量和第二寄存器组2012的数量,而无需指定具体由哪个或哪些寄存器来组成第一寄存器组2011和第二寄存器组2012,这样,配置灵活性较高。配置信息可以由控制器生成,处理器可以基于配置信息自动指定相应数量的寄存器组成第一寄存器组2011和第二寄存器组2012。在第一寄存器组2011的数量、第二寄存器组2012的数量、第一寄存器组2011中寄存器的数量和第二寄存器组2012中寄存器的数量中的至少一种信息需要更改时,只需要更改相应的配置信息即可。The number of the first register group 2011, the number of the second register group 2012, the number of registers in the first register group 2011, and the number of registers in the second register group 2012 can be configured respectively through configuration information. In some embodiments, the configuration information only needs to specify the number of the first register group 2011 and the number of the second register group 2012, without specifying which register or registers specifically constitute the first register group 2011 and the second register group 2012. In this way, configuration flexibility is high. The configuration information may be generated by the controller, and the processor may automatically specify a corresponding number of registers to form the first register group 2011 and the second register group 2012 based on the configuration information. When at least one of the number of the first register group 2011, the number of the second register group 2012, the number of registers in the first register group 2011, and the number of registers in the second register group 2012 needs to be changed, you only need to change the corresponding configuration information.
为了便于说明,下面将第一寄存器组2011和第二寄存器组2012统称为寄存器组,下文中提到的寄存器组既可以是第一寄存器组2011,也可以是第二寄存器组2012。For ease of explanation, the first register group 2011 and the second register group 2012 are collectively referred to as a register group below. The register group mentioned below may be either the first register group 2011 or the second register group 2012.
一个第一寄存器组2011只能被分配给一个线程,分配以后,该第一寄存器组2011对其他线程不可见。这样,可以实现不同线程之间的数据隔离。每个第二寄存器组2012可以分配给至少两个线程,这样,可以使所述至少两个线程实现数据复用。例如,在图2中,可以将其中一个第一寄存器组2011分配给线程0,将另一个第一寄存器组2011分配给线程1,并将第二寄存器组2012同时分配给线程0和线程1。A first register group 2011 can only be allocated to one thread. After allocation, the first register group 2011 is not visible to other threads. In this way, data isolation between different threads can be achieved. Each second register group 2012 can be allocated to at least two threads, so that the at least two threads can achieve data multiplexing. For example, in FIG. 2 , one of the first register sets 2011 may be assigned to thread 0, the other first register set 2011 may be assigned to thread 1, and the second register set 2012 may be assigned to both thread 0 and thread 1.
各个线程均可以访问分配给该线程的寄存器组中的目标寄存器,以便从目标寄存器中读取数据,或者将数据写入目标寄存器。其中,目标寄存器的数量可以大于或等于1。各个线程均可以由处理单元202进行调度,处理单元202在调度一个目标线程时,可以响应于该目标线程的数据访问请求,对分配给所述目标线程的寄存器组中的目标寄存器进行访问。Each thread can access a target register in the register group assigned to that thread to read data from or write data to the target register. Among them, the number of target registers can be greater than or equal to 1. Each thread can be scheduled by the processing unit 202. When scheduling a target thread, the processing unit 202 can access the target register in the register group assigned to the target thread in response to the target thread's data access request.
数据访问请求中可以携带目标寄存器的逻辑地址。处理单元需要将所述逻辑地址映射为所述目标寄存器的物理地址,再基于所述物理地址访问所述目标寄存器。其中,一个寄存器的逻辑地址可以用于表示该寄存器在分配给某个线程的寄存器组中的标识信 息,一个寄存器的物理地址可以用于表示该寄存器在寄存器堆中的标识信息。各个寄存器的物理地址和逻辑地址均可以采用整数(例如,0,1,2,3,……)进行顺序编号。分配给不同的线程的寄存器组中的寄存器的逻辑地址可以相同,但不同寄存器的物理地址一定是不同的。例如,在图3所示的实施例中,物理地址为0、1、2的寄存器为分配给线程0的寄存器组中的寄存器,上述寄存器的逻辑地址分别为0、1、2。而物理地址为3、4、5的寄存器为分配给线程1的寄存器组中的寄存器,上述寄存器的逻辑地址也分别为0、1、2。The data access request can carry the logical address of the target register. The processing unit needs to map the logical address to the physical address of the target register, and then access the target register based on the physical address. Among them, the logical address of a register can be used to represent the identification information of the register in the register group assigned to a certain thread, and the physical address of a register can be used to represent the identification information of the register in the register file. The physical address and logical address of each register can be sequentially numbered using integers (for example, 0, 1, 2, 3,...). The logical addresses of registers in the register groups assigned to different threads can be the same, but the physical addresses of different registers must be different. For example, in the embodiment shown in FIG. 3 , the registers with physical addresses 0, 1, and 2 are registers in the register group allocated to thread 0, and the logical addresses of the above registers are 0, 1, and 2 respectively. The registers with physical addresses 3, 4, and 5 are registers in the register group assigned to thread 1, and the logical addresses of the above registers are also 0, 1, and 2 respectively.
处理单元202通过将逻辑地址映射为物理地址,可以确定唯一的目标寄存器,以便访问正确的寄存器。例如,对线程1而言,在其访问的目标寄存器的逻辑地址为0的情况下,需要将该逻辑地址映射为寄存器在第一寄存器堆201中的物理地址(即3)。在一些实施例中,第一寄存器组中的寄存器与第二寄存器组中的寄存器的逻辑地址可以独立设置。例如,在图2所示的实施例中,分配给线程0的第一寄存器组中的寄存器的逻辑地址可以是从0开始的整数(例如,0,1,2,3,……),分配给线程0的第二寄存器组中的寄存器的逻辑地址也可以设置为从0开始的整数(例如,0,1,2,3,……)。由于第一寄存器组在寄存器堆中所处的位置与第二寄存器组在寄存器堆中所处的位置是不同的,因此,处理单元202可以基于目标寄存器所属的寄存器组的类型以及目标寄存器所属的寄存器组在寄存器堆中所处的位置,共同确定目标寄存器的物理地址。一个寄存器组的类型用于表征该寄存器组是第一寄存器组还是第二寄存器组。By mapping logical addresses to physical addresses, the processing unit 202 can determine a unique target register in order to access the correct register. For example, for thread 1, when the logical address of the target register it accesses is 0, the logical address needs to be mapped to the physical address of the register in the first register file 201 (that is, 3). In some embodiments, the logical addresses of the registers in the first register group and the registers in the second register group can be set independently. For example, in the embodiment shown in Figure 2, the logical address assigned to the register in the first register group of thread 0 may be an integer starting from 0 (for example, 0, 1, 2, 3, ...), allocated The logical address of the register in the second register group given to thread 0 can also be set to an integer starting from 0 (for example, 0, 1, 2, 3, ...). Since the location of the first register group in the register file is different from the location of the second register group in the register file, the processing unit 202 may be based on the type of the register group to which the target register belongs and the type of the register group to which the target register belongs. The location of the register group in the register file jointly determines the physical address of the target register. The type of a register group is used to characterize whether the register group is the first register group or the second register group.
具体来说,在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,可以基于分配给各个在先线程的寄存器的总数量以及所述逻辑地址确定所述物理地址。其中,所述在先线程包括线程编号(即线程ID)小于所述目标线程的各个线程。例如,各个线程的线程ID依次为0,1,2等整数,且目标线程的线程ID是2,则在先线程包括线程ID为0的线程以及线程ID为1的线程。可以对分配给每个在先线程的寄存器的数量进行求和,得到分配给各个在先线程的寄存器的总数量。在分配给各个线程的寄存器的数量相等的情况下,可以基于在先线程的数量与分配给单个线程的寄存器的数量之间的乘积,得到分配给各个在先线程的寄存器的总数量。其中,在为每个线程分配k(k为正整数)个第一寄存器组的情况下,分配给单个线程的寄存器的数量即为这k个寄存器组包括的寄存器的总数量。Specifically, in the case where the target register is a register in the first register group, the physical address may be determined based on the total number of registers allocated to each previous thread and the logical address. Wherein, the previous thread includes each thread whose thread number (ie, thread ID) is smaller than the target thread. For example, if the thread IDs of each thread are integers such as 0, 1, 2, etc., and the thread ID of the target thread is 2, then the previous threads include the thread with thread ID 0 and the thread with thread ID 1. The number of registers allocated to each preceding thread can be summed to obtain the total number of registers allocated to each preceding thread. In the case where the number of registers allocated to each thread is equal, the total number of registers allocated to each preceding thread can be obtained based on the product of the number of preceding threads and the number of registers allocated to a single thread. Wherein, when k (k is a positive integer) first register groups are allocated to each thread, the number of registers allocated to a single thread is the total number of registers included in the k register groups.
如图4A所示,假设线程的总数量为N,k=1,即,第一寄存器组的总数量也为N,并假设每个第一寄存器组包括的寄存器的数量均为M,第二寄存器组包括的寄存器的数量为P,第二寄存器组由全部N个线程共享,第二寄存器组构成的存储空间也称为共享 空间。在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,目标寄存器的物理地址physical_addr可以记为:As shown in Figure 4A, it is assumed that the total number of threads is N and k=1, that is, the total number of the first register group is also N, and it is assumed that the number of registers included in each first register group is M, and the second The number of registers included in the register group is P. The second register group is shared by all N threads. The storage space composed of the second register group is also called a shared space. In the case where the target register is a register in the first register group, the physical address physical_addr of the target register can be recorded as:
physical_addr=reg_id+thread_id*M;physical_addr=reg_id+thread_id*M;
其中,reg_id为目标寄存器的逻辑地址,thread_id为线程ID,M为第一寄存器组中寄存器的数量。可以采用加法器与乘法器分别实现上述公式中的加法运算和乘法运算,从而得到物理地址。Among them, reg_id is the logical address of the target register, thread_id is the thread ID, and M is the number of registers in the first register group. Adders and multipliers can be used to implement the addition and multiplication operations in the above formula respectively to obtain the physical address.
在所述目标寄存器为所述第二寄存器组中的寄存器的情况下,可以基于分配给各个线程的寄存器的总数量以及所述逻辑地址确定所述物理地址。仍以图4A所示的情况为例,在所述目标寄存器为所述第二寄存器组中的寄存器的情况下,目标寄存器的物理地址physical_addr可以记为:In the case where the target register is a register in the second register group, the physical address may be determined based on the total number of registers allocated to each thread and the logical address. Still taking the situation shown in Figure 4A as an example, when the target register is a register in the second register group, the physical address physical_addr of the target register can be recorded as:
physical_addr=reg_id+N*M;physical_addr=reg_id+N*M;
其中,N为线程的总数量。Among them, N is the total number of threads.
在上述实施例中,第二寄存器组中的寄存器的物理地址大于第一寄存器组中的寄存器的物理地址,即,第二寄存器组中的寄存器为寄存器堆中在后的寄存器,第一寄存器组中的寄存器为寄存器堆中在前的寄存器(情况一)。在实际应用中,第二寄存器组中的寄存器的物理地址也可以小于第一寄存器组中的寄存器的物理地址,即,第二寄存器组中的寄存器为寄存器堆中在前的寄存器,第一寄存器组中的寄存器为寄存器堆中在后的寄存器(情况二)。或者,也可以将寄存器堆中处于中间位置的若干个寄存器作为第二寄存器组中的寄存器,将寄存器堆中靠前和靠后的寄存器作为第一寄存器组中的寄存器(情况三)。In the above embodiment, the physical address of the register in the second register group is greater than the physical address of the register in the first register group, that is, the register in the second register group is the later register in the register file, and the first register group The register in is the previous register in the register file (case 1). In practical applications, the physical address of the register in the second register group may also be smaller than the physical address of the register in the first register group. That is, the register in the second register group is the previous register in the register file, and the first register The register in the group is the last register in the register file (case 2). Alternatively, several registers in the middle of the register file can also be used as registers in the second register group, and the front and back registers in the register file can be used as registers in the first register group (case 3).
第一寄存器组和第二寄存器组在寄存器堆中的位置关系的三种分布情况如图5所示。其中,灰色方块表示第二寄存器组中的寄存器,白色方块表示第一寄存器组中的寄存器,方块中的数字表示各个寄存器的物理地址。在不同的位置关系下,物理地址的计算方式也不同。例如,仍假设线程的总数量为N,k=1,即,第一寄存器组的总数量也为N,并假设每个第一寄存器组包括的寄存器的数量均为M,第二寄存器组包括的寄存器的数量为P,第二寄存器组由全部N个线程共享。在上述情况二中,在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,目标寄存器的物理地址physical_addr可以记为:The three distributions of the positional relationship between the first register group and the second register group in the register file are shown in Figure 5. Among them, the gray squares represent the registers in the second register group, the white squares represent the registers in the first register group, and the numbers in the squares represent the physical addresses of each register. Under different location relationships, the physical address is calculated in different ways. For example, it is still assumed that the total number of threads is N and k=1, that is, the total number of first register groups is also N, and it is assumed that the number of registers included in each first register group is M, and the second register group includes The number of registers is P, and the second register set is shared by all N threads. In the above case 2, when the target register is a register in the first register group, the physical address physical_addr of the target register can be recorded as:
physical_addr=P+reg_id+thread_id*M。physical_addr=P+reg_id+thread_id*M.
在所述目标寄存器为所述第二寄存器组中的寄存器的情况下,目标寄存器的物理地址等于目标寄存器的逻辑地址。In the case where the target register is a register in the second register group, the physical address of the target register is equal to the logical address of the target register.
在上述情况三中,在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,假 设寄存器堆中位于第二寄存器组之前的第一寄存器组的数量为X,则目标寄存器的物理地址physical_addr可以记为:In the above case three, when the target register is a register in the first register group, assuming that the number of the first register group before the second register group in the register file is The address physical_addr can be recorded as:
physical_addr=reg_id+thread_id*M,0<thread_id<X;physical_addr=reg_id+thread_id*M,0<thread_id<X;
physical_addr=P+reg_id+thread_id*M,thread_id≥X。physical_addr=P+reg_id+thread_id*M, thread_id≥X.
在所述目标寄存器为所述第二寄存器组中的寄存器的情况下,目标寄存器的物理地址physical_addr可以记为:In the case where the target register is a register in the second register group, the physical address physical_addr of the target register can be recorded as:
physical_addr=reg_id+X*M。physical_addr=reg_id+X*M.
在一些实施例中,所述第一寄存器堆被划分为至少一个存储单元(Bank),每个存储单元均包括至少一个第一寄存器组以及至少一个第二寄存器组;不同的存储单元之间物理隔离,且不同的存储单元对应于不同的线程,一个存储单元包括的第一寄存器组用于分配给对应于所述存储单元的一个线程,一个存储单元包括的第二寄存器组用于分配给对应于所述存储单元的至少两个线程,这种方式称为交织。对于多个Bank的寄存器堆,为保证均匀访问,会将共享空间通过交织的方式均匀分配在每个Bank上。In some embodiments, the first register file is divided into at least one storage unit (Bank), each storage unit includes at least one first register group and at least one second register group; different storage units have physical Isolated, and different storage units correspond to different threads, one storage unit includes a first register group for allocation to a thread corresponding to the storage unit, and one storage unit includes a second register group for allocation to the corresponding For at least two threads of the memory unit, this approach is called interleaving. For register files with multiple banks, in order to ensure uniform access, the shared space will be evenly distributed on each bank through interleaving.
以图4B为例,共有K个Bank,在每个Bank上预留P个寄存器的共享空间,整体的共享空间容量为K*P,各个存储单元上的共享存储空间可以分配给任意一个线程。共享空间的逻辑地址为0到K*P-1,共享空间(即第二寄存器组)中的寄存器的逻辑地址在各个Bank上可以交织排列,如图所示,Bank_0上的第二寄存器组中的寄存器的逻辑地址为K*(0~P-1),即,寄存器编号分别为0,K,2K,……,(P-1)*K;Bank_1上的第二寄存器组中的寄存器的逻辑地址为1+K*(0~P-1),即,寄存器编号分别为1,K+1,2K+1,……,(P-1)*K+1,以此类推。例如,假设K=3,P=5,则Bank_0上第二寄存器组中寄存器的编号为0/3/6/9/12,Bank_1上第二寄存器组中寄存器的编号为1/4/7/10/13,Bank_2上第二寄存器组中寄存器的编号为2/5/8/11/14。当然,本领域技术人员可以理解,交织方式不限于此。Taking Figure 4B as an example, there are K Banks, and P registers are reserved for shared space in each Bank. The overall shared space capacity is K*P. The shared storage space on each storage unit can be allocated to any thread. The logical addresses of the shared space are 0 to K*P-1. The logical addresses of the registers in the shared space (i.e. the second register group) can be interleaved on each Bank. As shown in the figure, in the second register group on Bank_0 The logical address of the register is K*(0~P-1), that is, the register numbers are 0, K, 2K,..., (P-1)*K; the registers in the second register group on Bank_1 The logical address is 1+K*(0~P-1), that is, the register numbers are 1, K+1, 2K+1,..., (P-1)*K+1, and so on. For example, assuming K=3 and P=5, the register number in the second register group on Bank_0 is 0/3/6/9/12, and the register number in the second register group on Bank_1 is 1/4/7/ On 10/13, the register number in the second register group on Bank_2 is 2/5/8/11/14. Of course, those skilled in the art can understand that the interleaving method is not limited to this.
每个Bank上还预留有各个线程专用的存储空间(即第一寄存器组),例如,Bank_0上预留有线程0、线程K、……、线程N-K专用的存储空间,Bank_1上预留有线程1、线程K+1、……、线程N-K+1专用的存储空间。为了便于说明,这里假设一个Bank上为各个线程预留的专用的存储空间的大小均为M,即,每个线程在一个Bank上的逻辑地址都为0到M-1。预留给一个线程的地址为0到M-1的寄存器组成一个第一寄存器组,即分配给一个线程的第一寄存器组中包括M个寄存器。Each Bank also reserves storage space dedicated to each thread (i.e., the first register group). For example, Bank_0 reserves dedicated storage space for thread 0, thread K,..., thread N-K, and Bank_1 reserves Dedicated storage space for thread 1, thread K+1,..., thread N-K+1. For ease of explanation, it is assumed here that the size of the dedicated storage space reserved for each thread in a Bank is M, that is, the logical address of each thread in a Bank is 0 to M-1. The registers with addresses 0 to M-1 reserved for a thread form a first register group, that is, the first register group allocated to a thread includes M registers.
其中,不同的存储单元对应于不同的线程,以线程数量等于9,存储单元数量等于3为例,则存储单元Bank_0对应于线程0、线程3和线程6,存储单元Bank_0包括的第 一寄存器组可以分别分配给线程0、线程3和线程6。存储单元Bank_1对应于线程1、线程4和线程7,则存储单元Bank_1包括的第一寄存器组可以分别分配给线程1、线程4和线程7。存储单元Bank_2对应于线程2、线程5和线程8,则存储单元Bank_2包括的第一寄存器组可以分别分配给线程2、线程5和线程8。当然,除了上述实施例中所述的交织方式之外,还可以采用其他的交织方式,此处不再一一举例。Among them, different storage units correspond to different threads. Taking the number of threads equal to 9 and the number of storage units equal to 3 as an example, the storage unit Bank_0 corresponds to thread 0, thread 3 and thread 6, and the storage unit Bank_0 includes the first register group Can be assigned to thread 0, thread 3 and thread 6 respectively. Storage unit Bank_1 corresponds to thread 1, thread 4 and thread 7, then the first register group included in storage unit Bank_1 can be allocated to thread 1, thread 4 and thread 7 respectively. Storage unit Bank_2 corresponds to thread 2, thread 5 and thread 8, then the first register group included in storage unit Bank_2 can be allocated to thread 2, thread 5 and thread 8 respectively. Of course, in addition to the interleaving methods described in the above embodiments, other interleaving methods can also be used, and no examples are given here.
在上述多Bank的情况下,若所述目标寄存器为所述第一寄存器组中的寄存器,可以基于所述目标线程的编号以及所述存储单元的数量确定所述目标寄存器所在的存储单元,并基于分配给各个在先线程的寄存器组的总数量、存储单元的数量以及所述逻辑地址确定所述物理地址;所述在先线程包括线程编号小于所述目标线程的各个线程。例如,所述目标寄存器所在的存储单元的编号physical_bank可以记为:In the above case of multiple banks, if the target register is a register in the first register group, the storage unit where the target register is located can be determined based on the number of the target thread and the number of the storage units, and The physical address is determined based on the total number of register sets allocated to each preceding thread, the number of storage units, and the logical address; the preceding thread includes each thread with a thread number smaller than the target thread. For example, the number physical_bank of the storage unit where the target register is located can be recorded as:
physical_bank=thread_id%K。physical_bank=thread_id%K.
目标寄存器在对应存储单元上的物理地址physical_addr可以记为:The physical address physical_addr of the target register on the corresponding storage unit can be recorded as:
physical_addr=reg_id+thread_id/K*M;physical_addr=reg_id+thread_id/K*M;
其中,%表示求余数的操作符,公式中其余符号的物理含义与前述实施例相同。仍以线程数量N等于9,存储单元数量K等于3为例,并假设M=3,则线程编号为0~8,分配给每个线程的第一寄存器组中的寄存器的物理地址为0~2。在目标寄存器为分配给线程0的第一寄存器组中物理地址为1的寄存器的情况下,目标寄存器所在的存储单元的编号为0%3,即Bank_0,目标寄存器在Bank_0上的物理地址可以记为1+0/3*9=1。在目标寄存器为分配给线程1的第一寄存器组中的物理地址为2的寄存器的情况下,目标寄存器所在的存储单元的编号为1%3,即Bank_1,目标寄存器在Bank_1上的物理地址可以记为2+1/3*9=5。Among them, % represents the operator for finding the remainder, and the physical meanings of the remaining symbols in the formula are the same as those in the previous embodiment. Still taking the number of threads N equal to 9 and the number of storage units K equal to 3 as an example, and assuming M = 3, the thread numbers are 0~8, and the physical addresses of the registers in the first register group assigned to each thread are 0~ 2. When the target register is a register with physical address 1 in the first register group allocated to thread 0, the number of the storage unit where the target register is located is 0%3, that is, Bank_0. The physical address of the target register on Bank_0 can be recorded It is 1+0/3*9=1. In the case where the target register is a register with a physical address of 2 in the first register group allocated to thread 1, the number of the storage unit where the target register is located is 1%3, that is, Bank_1, and the physical address of the target register on Bank_1 can be Recorded as 2+1/3*9=5.
若所述数据访问请求所访问的寄存器为所述第二寄存器组中的寄存器,可以基于所述逻辑地址以及所述存储单元的数量确定所述目标寄存器所在的存储单元,并基于分配给各个线程的寄存器组的总数量、存储单元的数量以及所述逻辑地址确定所述物理地址。例如,所述目标寄存器所在的存储单元的编号physical_bank可以记为:If the register accessed by the data access request is a register in the second register group, the storage unit where the target register is located can be determined based on the logical address and the number of the storage units, and based on the allocation to each thread. The total number of register groups, the number of storage units, and the logical address determine the physical address. For example, the number physical_bank of the storage unit where the target register is located can be recorded as:
physical_bank=reg_id%K。physical_bank=reg_id%K.
目标寄存器在对应存储单元上的物理地址physical_addr可以记为:The physical address physical_addr of the target register on the corresponding storage unit can be recorded as:
physical_addr=reg_id/K+N/K*M。physical_addr=reg_id/K+N/K*M.
仍以线程数量N等于9,存储单元数量K等于3,M=3为例,并假设P=5。在目标寄存器为第二寄存器组中编号为6的寄存器的情况下,目标寄存器所在的存储单元的编号为6%3=0,即目标寄存器在Bank_0上,目标寄存器的物理地址为6/3+9/3*3=11。Still taking the number of threads N equal to 9, the number of storage units K equal to 3, M=3 as an example, and assuming P=5. When the target register is the register numbered 6 in the second register group, the number of the storage unit where the target register is located is 6%3=0, that is, the target register is on Bank_0, and the physical address of the target register is 6/3+ 9/3*3=11.
在一些实施例中,可以将K设置为2的幂次方,并将K设置为N的整数倍,如果reg_id的值不为整数,可以向下取整,以便得到的物理地址为整数。In some embodiments, K can be set to a power of 2, and K can be set to an integer multiple of N. If the value of reg_id is not an integer, it can be rounded down so that the resulting physical address is an integer.
在基于上述方式确定目标寄存器之后,可以访问目标寄存器,例如,从目标寄存器中读取数据。在一些实施例中,从目标寄存器中读取的数据可以作为另一个寄存器的访问地址,用来访问另一个寄存器中的数据。例如,所述目标寄存器可以为所述第一寄存器组中的寄存器。处理单元202可以先获取目标寄存器的物理地址(即索引寄存器编号),基于索引寄存器编号将从所述目标寄存器中读取的数据作为索引信息(即地址信息,也即图中的索引寄存器值),并基于所述索引信息对所述第二寄存器组中的寄存器进行访问。例如,假设第一寄存器组中的目标寄存器的逻辑地址为A1,通过上述方式可以计算该目标寄存器的物理地址,假设为A2,访问物理地址为A2的寄存器,得到其中的数据A3,将A3作为第二寄存器组中某个寄存器的逻辑地址,并基于A3计算该寄存器的物理地址,假设为地址A4,然后可以访问地址为A4的寄存器。上述过程可参见图6,指令通路可以基于多线程上下文信息(context)读取多个线程发送的指令(即前述数据访问请求),并对数据访问请求进行译码后,发送至执行通路。其中,所述数据访问请求可用于访问第一寄存器组中的目标寄存器。在获取到目标寄存器中的数据之后,将该数据作为索引信息来访问第二寄存器组中的寄存器。这种方式称为间接寻址。After the target register is determined based on the above method, the target register can be accessed, for example, data is read from the target register. In some embodiments, the data read from the target register can be used as an access address of another register to access data in the other register. For example, the target register may be a register in the first register group. The processing unit 202 may first obtain the physical address of the target register (i.e., the index register number), and use the data read from the target register based on the index register number as index information (i.e., address information, i.e., the index register value in the figure) , and access registers in the second register group based on the index information. For example, assuming that the logical address of the target register in the first register group is A1, the physical address of the target register can be calculated through the above method, assuming it is A2, access the register with the physical address of A2, and obtain the data A3, and use A3 as The logical address of a register in the second register group, and the physical address of the register is calculated based on A3, assuming it is address A4, and then the register with address A4 can be accessed. The above process can be seen in Figure 6. The instruction path can read instructions sent by multiple threads (ie, the aforementioned data access requests) based on multi-thread context information (context), decode the data access requests, and send them to the execution path. Wherein, the data access request may be used to access a target register in the first register group. After obtaining the data in the target register, the data is used as index information to access the register in the second register group. This method is called indirect addressing.
上述方案可拓展到包含并行计算单元的处理器,如包含单指令多数据结构(Single Instruction Multiple Data,SIMD)单元的CPU或者GPU中。参见图7,处理单元202可以先获取第二寄存器堆中需要访问的寄存器的物理地址(即索引寄存器编号),基于索引寄存器编号获取从第二寄存器堆中读取的索引信息(即图中的索引寄存器值);基于从第二寄存器堆中读取的索引信息对所述目标寄存器进行访问。这里的目标寄存器既可以是第一寄存器组中的寄存器,也可以是第二寄存器组中的寄存器。本公开实施例的主要思想为:处理器中可以包含两个独立的寄存器堆,其中,第二寄存器堆可以是向量寄存器堆,用于存储并行计算的SIMD数据或单指令多线程(Single Instruction Multiple Threads,SIMT)数据;第一寄存器堆可以是标量寄存器堆,用于存储简单标量数据或者控制信息。可以用标量寄存器堆读出来的值作为索引来访问向量寄存器堆,然后将从向量寄存器堆中获取的数据送入执行通路。由于核心的运算发生在向量寄存器堆中,因此主要为向量寄存器堆增加共享空间的支持。The above solution can be extended to processors containing parallel computing units, such as CPUs or GPUs containing single instruction multiple data structure (Single Instruction Multiple Data, SIMD) units. Referring to Figure 7, the processing unit 202 can first obtain the physical address of the register that needs to be accessed in the second register file (i.e., the index register number), and obtain the index information read from the second register file (i.e., the index register number in the figure) based on the index register number. Index register value); access the target register based on the index information read from the second register file. The target register here can be either a register in the first register group or a register in the second register group. The main idea of the embodiments of the present disclosure is that the processor may contain two independent register files, wherein the second register file may be a vector register file used to store SIMD data for parallel calculations or Single Instruction Multiple Threads (Single Instruction Multiple Threads). Threads, SIMT) data; the first register file can be a scalar register file, used to store simple scalar data or control information. The vector register file can be accessed using the value read from the scalar register file as an index, and then the data obtained from the vector register file is sent to the execution path. Since core operations occur in the vector register file, shared space support is mainly added to the vector register file.
在一些实施例中,除了进行间接寻址之外,也可以将从寄存器中读取的数据直接用于数据运算,这种方式称为直接寻址。在直接寻址情况下,从第二寄存器堆包括的寄存器或者从第一寄存器组包括的寄存器中读取的数据不再作为索引信息来访问其他的寄 存器,而是直接用于进行数据运算(例如,乘法运算、加法运算等)。为了便于区分直接寻址和间接寻址两种情况,所述数据访问请求中可以包括指示位,所述指示位用于指示是否将从所述目标寄存器中读取的数据作为所述索引信息。具体来说,指示位可以包括两个指示状态,在指示位处于第一指示状态的情况下,确定将从所述目标寄存器中读取的数据作为所述索引信息;在指示位处于第二指示状态的情况下,确定不将从所述目标寄存器中读取的数据作为所述索引信息。在一些实施例中,可以通过至少1比特的数据位来表示所述第一指示状态和所述第二至少状态。例如,可以用二进制数据“0”表示第一指示状态,用二进制数据“1”表示第二指示状态。当然,指示状态的表示方式不限于此,本领域技术人员可以根据实际情况采用其他的方式来表示不同的指示状态,此处不再一一列举。In some embodiments, in addition to indirect addressing, the data read from the register can also be directly used for data operations. This method is called direct addressing. In the case of direct addressing, the data read from the registers included in the second register file or the registers included in the first register group are no longer used as index information to access other registers, but are directly used to perform data operations (such as , multiplication operations, addition operations, etc.). In order to easily distinguish between direct addressing and indirect addressing, the data access request may include an indication bit, and the indication bit is used to indicate whether the data read from the target register will be used as the index information. Specifically, the indication bit may include two indication states. When the indication bit is in the first indication state, it is determined that the data to be read from the target register is used as the index information; when the indication bit is in the second indication state, In the case of status, it is determined not to use the data read from the target register as the index information. In some embodiments, the first indication state and the second at least state may be represented by at least 1 bit of data bit. For example, binary data "0" can be used to represent the first indication state, and binary data "1" can be used to represent the second indication state. Of course, the way of expressing the indication status is not limited to this. Those skilled in the art can use other ways to express different indication status according to the actual situation, which will not be listed here one by one.
在一些实施例中,所述处理器还包括指令通路,用于发送对所述目标寄存器的数据访问请求;以及执行通路,用于获取所述目标寄存器响应于所述数据访问请求传输的数据,并对获取的数据进行运算处理。本公开实施例的方案既可以应用于上述直接寻址场景,也可以应用于上述间接寻址场景。In some embodiments, the processor further includes an instruction path for sending a data access request to the target register; and an execution path for obtaining data transmitted by the target register in response to the data access request, And perform operations on the obtained data. The solution of the embodiment of the present disclosure can be applied to both the above direct addressing scenario and the above indirect addressing scenario.
具体来说,所述指令通路可以包括:指令读取单元,用于读取所述目标线程发送的数据访问请求;指令译码单元,用于对所述指令读取单元读取的数据访问请求进行译码;指令发射单元,用于将译码后的数据访问请求发送至所述目标寄存器。目标寄存器可以将存储的数据输出至执行通路进行运算处理,也可以将存储的数据作为索引信息返回给指令发射单元,以使指令发射单元基于索引信息将相应的寄存器中存储的数据发送至执行通路进行运算处理。Specifically, the instruction path may include: an instruction reading unit, used to read the data access request sent by the target thread; an instruction decoding unit, used to read the data access request sent by the instruction reading unit Perform decoding; an instruction issuing unit is used to send the decoded data access request to the target register. The target register can output the stored data to the execution path for calculation processing, or can return the stored data to the instruction issuing unit as index information, so that the instruction issuing unit can send the data stored in the corresponding register to the execution path based on the index information. Perform computational processing.
在一些实施例中,所述执行通路包括:运算单元,用于对获取的数据进行运算处理;以及访存单元,用于将运算结果输出至内存,和/或将内存中存储的数据输出至所述运算单元进行运算处理。运算单元可包括一个或多个子运算单元,例如加法运算单元、乘法运算单元、卷积乘运算单元等,运算单元中包括的子运算单元的数量和类型可以基于实际需求设置。访存单元用于实现运算单元与内存之间的数据传输,在寄存器堆中不包括运算所需的数据时,可以通过访存单元访问内存来获取相应数据。进一步地,从标量寄存器中获取的数据也可以作为需要运算的数据输出至标量执行单元进行处理。In some embodiments, the execution path includes: an operation unit, used to perform operation processing on the acquired data; and a memory access unit, used to output the operation results to the memory, and/or output the data stored in the memory to The computing unit performs computing processing. The operation unit may include one or more sub-operation units, such as an addition unit, a multiplication unit, a convolution unit, etc. The number and type of sub-operation units included in the operation unit may be set based on actual requirements. The memory access unit is used to implement data transmission between the computing unit and the memory. When the register file does not include the data required for the operation, the memory access unit can be used to access the memory to obtain the corresponding data. Furthermore, the data obtained from the scalar register can also be output to the scalar execution unit for processing as data requiring operation.
参见图8,本公开实施例还提供一种芯片,所述芯片包括处理器801,所述处理器801可以采用上述任一实施例中所述的处理器。在一些实施例中,所述芯片可以应用于AI加速卡中。在一些实施例中,所述芯片还包括控制器802,用于对以下至少一种信息进行配置:所述第一寄存器组包括的寄存器的第一数量信息,所述第二寄存器组包括的 寄存器的第二数量信息,所述第一寄存器组的数量,所述第二寄存器组的数量。Referring to FIG. 8 , an embodiment of the present disclosure also provides a chip. The chip includes a processor 801 , and the processor 801 can be the processor described in any of the above embodiments. In some embodiments, the chip can be applied in an AI accelerator card. In some embodiments, the chip further includes a controller 802 for configuring at least one of the following information: information on the first number of registers included in the first register group, information on the number of registers included in the second register group. The second quantity information, the number of the first register group, the number of the second register group.
本公开实施例的细节详见前述处理器的实施例,此处不再赘述。For details of the embodiments of the present disclosure, please refer to the foregoing embodiments of the processor and will not be described again here.
本公开实施例还提供一种电子设备,包括上述任一实施例中所述的芯片。An embodiment of the present disclosure also provides an electronic device, including the chip described in any of the above embodiments.
参见图9,本公开实施例还提供一种数据处理方法,应用于本公开任一实施例所述的处理器中的处理单元,所述方法包括:Referring to Figure 9, an embodiment of the present disclosure also provides a data processing method, which is applied to the processing unit in the processor according to any embodiment of the present disclosure. The method includes:
步骤901:对多个线程中的每个线程进行调度;Step 901: Schedule each thread in the plurality of threads;
步骤902:响应于所述多个线程中的目标线程的数据访问请求,访问分配给所述目标线程的寄存器组中的目标寄存器,其中,所述寄存器组包括所述第一寄存器组或所述第二寄存器组。Step 902: In response to the data access request of the target thread among the plurality of threads, access the target register in the register group allocated to the target thread, wherein the register group includes the first register group or the Second register group.
在一些实施例中,所述目标线程的数据访问请求中携带所述目标寄存器的逻辑地址;所述响应于所述多个线程中的目标线程的数据访问请求,访问分配给所述目标线程的寄存器组中的目标寄存器,包括:将所述逻辑地址映射为所述目标寄存器的物理地址;基于所述物理地址访问所述目标寄存器。In some embodiments, the data access request of the target thread carries the logical address of the target register; in response to the data access request of the target thread among the plurality of threads, the data access request allocated to the target thread is accessed. The target register in the register group includes: mapping the logical address to the physical address of the target register; and accessing the target register based on the physical address.
在一些实施例中,所述将所述逻辑地址映射为所述目标寄存器的物理地址,包括:在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,基于分配给各个在先线程的寄存器的总数量以及所述逻辑地址确定所述物理地址;所述在先线程包括线程编号小于所述目标线程的各个线程;和/或在所述目标寄存器为所述第二寄存器组中的寄存器的情况下,基于分配给各个线程的寄存器的总数量以及所述逻辑地址确定所述物理地址。In some embodiments, mapping the logical address to a physical address of the target register includes: when the target register is a register in the first register group, based on the allocation to each previous The total number of registers of threads and the logical address determine the physical address; the previous thread includes each thread with a thread number smaller than the target thread; and/or the target register is the second register group. In the case of registers, the physical address is determined based on the total number of registers allocated to each thread and the logical address.
在一些实施例中,所述第一寄存器堆被划分为至少一个存储单元,每个存储单元均包括至少一个第一寄存器组以及至少一个第二寄存器组;不同的存储单元之间物理隔离,且不同的存储单元对应于不同的线程,一个存储单元包括的第一寄存器组用于分配给对应于所述存储单元的一个线程,一个存储单元包括的第二寄存器组用于分配给对应于所述存储单元的至少两个线程;所述将所述逻辑地址映射为所述目标寄存器的物理地址,包括:在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,基于所述目标线程的线程编号以及所述存储单元的数量确定所述目标寄存器所在的存储单元,并基于分配给各个在先线程的寄存器组的总数量、存储单元的数量以及所述逻辑地址确定所述物理地址;所述在先线程包括线程编号小于所述目标线程的各个线程;和/或在所述数据访问请求所访问的寄存器为所述第二寄存器组中的寄存器的情况下,基于所述逻辑地址以及所述存储单元的数量确定所述目标寄存器所在的存储单元,并基于分配给各个线程的寄存器组的总数量、存储单元的数量以及所述逻辑地址确定所述物理地址。In some embodiments, the first register file is divided into at least one storage unit, each storage unit includes at least one first register group and at least one second register group; different storage units are physically isolated, and Different storage units correspond to different threads, a storage unit includes a first register group for allocating to a thread corresponding to the storage unit, and a storage unit includes a second register group for allocating to a thread corresponding to the storage unit. At least two threads of the storage unit; mapping the logical address to the physical address of the target register includes: when the target register is a register in the first register group, based on the target The thread number of the thread and the number of storage units determine the storage unit where the target register is located, and the physical address is determined based on the total number of register sets allocated to each previous thread, the number of storage units, and the logical address ; The prior thread includes each thread with a thread number smaller than the target thread; and/or when the register accessed by the data access request is a register in the second register group, based on the logical address And the number of storage units determines the storage unit where the target register is located, and determines the physical address based on the total number of register groups allocated to each thread, the number of storage units, and the logical address.
在一些实施例中,所述目标寄存器为所述第一寄存器组中的寄存器;所述响应于所 述多个线程中的目标线程的数据访问请求,访问分配给所述目标线程的寄存器组中的目标寄存器,包括:将从所述目标寄存器中读取的数据作为索引信息,并基于所述索引信息对所述第二寄存器组中的寄存器进行访问。In some embodiments, the target register is a register in the first register group; in response to a data access request of a target thread in the plurality of threads, accessing a register group allocated to the target thread The target register includes: taking the data read from the target register as index information, and accessing the registers in the second register group based on the index information.
在一些实施例中,所述响应于所述多个线程中的目标线程的数据访问请求,访问分配给所述目标线程的寄存器组中的目标寄存器,包括:获取从第二寄存器堆中读取的索引信息;基于从第二寄存器堆中读取的索引信息对所述目标寄存器进行访问。In some embodiments, in response to a data access request from a target thread among the plurality of threads, accessing a target register in a register group allocated to the target thread includes: obtaining a read from a second register file index information; access the target register based on the index information read from the second register file.
在一些实施例中,所述数据访问请求中包括指示位,所述指示位用于指示是否将从所述第二寄存器堆中读取的数据作为访问所述目标寄存器的索引信息。In some embodiments, the data access request includes an indication bit, and the indication bit is used to indicate whether to use the data read from the second register file as index information for accessing the target register.
在一些实施例中,所述方法还包括:通过指令通路发送对所述目标寄存器的数据访问请求;通过执行通路获取所述目标寄存器响应于所述数据访问请求传输的数据,并对获取的数据进行运算处理。In some embodiments, the method further includes: sending a data access request to the target register through an instruction path; obtaining data transmitted by the target register in response to the data access request through an execution path, and processing the obtained data Perform computational processing.
在一些实施例中,所述通过指令通路发送对所述目标寄存器的数据访问请求,包括:通过所述指令通路中的指令读取单元读取所述目标线程发送的数据访问请求;通过所述指令通路中的指令译码单元对所述指令读取单元读取的数据访问请求进行译码;通过所述指令通路中的指令发射单元将译码后的数据访问请求发送至所述目标寄存器;和/或所述通过执行通路获取所述目标寄存器响应于所述数据访问请求传输的数据,并对获取的数据进行运算处理,包括:通过所述执行通路中的运算单元对获取的数据进行运算处理;以及通过所述执行通路中的访存单元将运算结果输出至内存,和/或将内存中存储的数据输出至所述运算单元进行运算处理。In some embodiments, sending a data access request to the target register through an instruction path includes: reading the data access request sent by the target thread through an instruction reading unit in the instruction path; The instruction decoding unit in the instruction path decodes the data access request read by the instruction reading unit; and sends the decoded data access request to the target register through the instruction issuing unit in the instruction path; and/or obtaining the data transmitted by the target register in response to the data access request through the execution path, and performing operations on the obtained data, including: performing operations on the obtained data through the operation unit in the execution path processing; and outputting the operation results to the memory through the memory access unit in the execution path, and/or outputting the data stored in the memory to the operation unit for operation processing.
在一些实施例中,各个第一寄存器组包括的寄存器的数量相同。In some embodiments, each first register group includes the same number of registers.
本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任一实施例所述的方法。An embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method described in any of the foregoing embodiments is implemented.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本说明书实施例各个实施例或者实施例的某些部分所述的方法。From the above description of the embodiments, those skilled in the art can clearly understand that the embodiments of this specification can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the embodiments of this specification can be embodied in the form of software products in essence or those that contribute to the existing technology. The computer software products can be stored in storage media, such as ROM/RAM, A magnetic disk, optical disk, etc., includes a number of instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments of this specification.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer, which may be in the form of a personal computer, a laptop, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, or a game controller. desktop, tablet, wearable device, or a combination of any of these devices.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,在实施本说明书实施例方案时可以把各模块的功能在同一个或多个软件和/或硬件中实现。也可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。Each embodiment in this specification is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment. The device embodiments described above are only illustrative. The modules described as separate components may or may not be physically separated. When implementing the embodiments of this specification, the functions of each module may be integrated into the same device. or implemented in multiple software and/or hardware. Some or all of the modules can also be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
以上所述仅是本说明书实施例的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本说明书实施例原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本说明书实施例的保护范围。The above are only specific implementation modes of the embodiments of this specification. It should be pointed out that those of ordinary skill in the art can make several improvements and modifications without departing from the principles of the embodiments of this specification. Improvements and modifications should also be considered as the protection scope of the embodiments of this specification.

Claims (15)

  1. 一种处理器,其特征在于,所述处理器包括:A processor, characterized in that the processor includes:
    第一寄存器堆,所述第一寄存器堆包括至少一个第一寄存器组以及至少一个第二寄存器组,每个所述第一寄存器组和每个所述第二寄存器组均包括至少一个寄存器,每个所述第一寄存器组用于分配给多个线程中的一个线程,每个所述第二寄存器组用于分配给所述多个线程中的至少两个线程;以及A first register file, the first register file includes at least one first register group and at least one second register group, each of the first register group and each of the second register group includes at least one register, each Each of the first register sets is used for allocation to one of the plurality of threads, and each of the second register sets is used for allocation to at least two of the plurality of threads; and
    处理单元,用于对所述多个线程中的每个线程进行调度,并响应于所述多个线程中的目标线程的数据访问请求,访问分配给所述目标线程的寄存器组中的目标寄存器,其中,所述寄存器组包括所述第一寄存器组或所述第二寄存器组。a processing unit configured to schedule each of the plurality of threads and, in response to a data access request from a target thread of the plurality of threads, access a target register in a register group allocated to the target thread , wherein the register set includes the first register set or the second register set.
  2. 根据权利要求1所述的处理器,其特征在于,所述目标线程的数据访问请求中携带所述目标寄存器的逻辑地址;所述处理单元用于:The processor according to claim 1, wherein the data access request of the target thread carries the logical address of the target register; the processing unit is configured to:
    将所述逻辑地址映射为所述目标寄存器的物理地址;Mapping the logical address to the physical address of the target register;
    基于所述物理地址访问所述目标寄存器。The target register is accessed based on the physical address.
  3. 根据权利要求2所述的处理器,其特征在于,在将所述逻辑地址映射为所述目标寄存器的物理地址时,所述处理单元用于:The processor of claim 2, wherein when mapping the logical address to the physical address of the target register, the processing unit is configured to:
    在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,基于分配给各个在先线程的寄存器的总数量以及所述逻辑地址确定所述物理地址;所述在先线程包括线程编号小于所述目标线程的各个线程;或In the case where the target register is a register in the first register group, the physical address is determined based on the total number of registers allocated to each previous thread and the logical address; the previous thread includes a thread number Each thread that is smaller than the target thread; or
    在所述目标寄存器为所述第二寄存器组中的寄存器的情况下,基于分配给各个线程的寄存器的总数量以及所述逻辑地址确定所述物理地址。In the case where the target register is a register in the second register group, the physical address is determined based on the total number of registers allocated to each thread and the logical address.
  4. 根据权利要求2或3所述的处理器,其特征在于,所述第一寄存器堆被划分为至少一个存储单元,每个存储单元均包括至少一个第一寄存器组以及至少一个第二寄存器组;不同的存储单元之间物理隔离,且不同的存储单元对应于不同的线程,一个存储单元包括的第一寄存器组用于分配给对应于所述存储单元的一个线程,一个存储单元包括的第二寄存器组用于分配给对应于所述存储单元的至少两个线程;所述处理单元用于:The processor according to claim 2 or 3, characterized in that the first register file is divided into at least one storage unit, each storage unit includes at least one first register group and at least one second register group; Different storage units are physically isolated from each other, and different storage units correspond to different threads. One storage unit includes a first register group for allocation to a thread corresponding to the storage unit, and one storage unit includes a second register group. The register set is used to be allocated to at least two threads corresponding to the storage unit; the processing unit is used to:
    在所述目标寄存器为所述第一寄存器组中的寄存器的情况下,基于所述目标线程的线程编号以及所述存储单元的数量确定所述目标寄存器所在的存储单元,并基于分配给各个在先线程的寄存器组的总数量、存储单元的数量以及所述逻辑地址确定所述物理地址;所述在先线程包括线程编号小于所述目标线程的各个线程;或In the case where the target register is a register in the first register group, the storage unit where the target register is located is determined based on the thread number of the target thread and the number of storage units, and based on the allocation to each The total number of register sets, the number of storage units and the logical address of the previous thread determine the physical address; the previous thread includes each thread with a thread number smaller than the target thread; or
    在所述数据访问请求所访问的寄存器为所述第二寄存器组中的寄存器的情况下,基于所述逻辑地址以及所述存储单元的数量确定所述目标寄存器所在的存储单元,并基于 分配给各个线程的寄存器组的总数量、存储单元的数量以及所述逻辑地址确定所述物理地址。When the register accessed by the data access request is a register in the second register group, the storage unit where the target register is located is determined based on the logical address and the number of storage units, and based on the allocation to The total number of register sets for each thread, the number of storage units, and the logical address determine the physical address.
  5. 根据权利要求1所述的处理器,其特征在于,所述目标寄存器为所述第一寄存器组中的寄存器;所述处理单元用于:The processor according to claim 1, wherein the target register is a register in the first register group; the processing unit is configured to:
    将从所述目标寄存器中读取的数据作为索引信息,并The data read from the destination register is used as index information, and
    基于所述索引信息对所述第二寄存器组中的寄存器进行访问。Registers in the second register group are accessed based on the index information.
  6. 根据权利要求1所述的处理器,其特征在于,所述数据访问请求中包括指示位,所述指示位用于指示是否将从第二寄存器堆中读取的数据作为访问所述目标寄存器的索引信息;在所述指示位指示将从所述第二寄存器堆中读取的数据作为访问所述目标寄存器的索引信息的情况下,所述处理单元用于:The processor according to claim 1, characterized in that the data access request includes an indication bit, the indication bit is used to indicate whether to use the data read from the second register file as a method for accessing the target register. Index information; in the case where the indication bit indicates that the data read from the second register file will be used as index information for accessing the target register, the processing unit is configured to:
    获取从所述第二寄存器堆中读取的索引信息;Obtain index information read from the second register file;
    基于从所述第二寄存器堆中读取的索引信息对所述目标寄存器进行访问。The target register is accessed based on index information read from the second register file.
  7. 根据权利要求1至6任意一项所述的处理器,其特征在于,所述处理器还包括:The processor according to any one of claims 1 to 6, characterized in that the processor further includes:
    指令通路,用于发送对所述目标寄存器的数据访问请求;以及an instruction path for sending a data access request to the target register; and
    执行通路,用于获取所述目标寄存器响应于所述数据访问请求传输的数据,并对获取的数据进行运算处理。The execution path is used to obtain the data transmitted by the target register in response to the data access request, and perform operational processing on the obtained data.
  8. 根据权利要求7所述的处理器,其特征在于,所述指令通路包括:The processor of claim 7, wherein the instruction path includes:
    指令读取单元,用于读取所述目标线程发送的数据访问请求;An instruction reading unit is used to read the data access request sent by the target thread;
    指令译码单元,用于对所述指令读取单元读取的数据访问请求进行译码;An instruction decoding unit, used to decode the data access request read by the instruction reading unit;
    指令发射单元,用于将译码后的数据访问请求发送至所述目标寄存器。An instruction issuing unit is used to send the decoded data access request to the target register.
  9. 根据权利要求7或8所述的处理器,其特征在于,所述执行通路包括:The processor according to claim 7 or 8, characterized in that the execution path includes:
    运算单元,用于对所述获取的数据进行运算处理;以及An arithmetic unit, used to perform arithmetic processing on the acquired data; and
    访存单元,用于将运算结果输出至内存,和/或将内存中存储的数据输出至所述运算单元进行运算处理。The memory access unit is used to output operation results to the memory, and/or output data stored in the memory to the operation unit for operation processing.
  10. 根据权利要求1至9任意一项所述的处理器,其特征在于,各个所述第一寄存器组包括的寄存器的数量相同。The processor according to any one of claims 1 to 9, wherein each of the first register groups includes the same number of registers.
  11. 一种芯片,其特征在于,所述芯片包括权利要求1至10任意一项所述的处理器。A chip, characterized in that the chip includes the processor according to any one of claims 1 to 10.
  12. 根据权利要求11所述的芯片,其特征在于,所述芯片还包括控制器,所述控制器用于对以下至少一种信息进行配置:The chip according to claim 11, characterized in that the chip further includes a controller, the controller is used to configure at least one of the following information:
    所述第一寄存器组包括的寄存器的第一数量信息,the first quantity information of the registers included in the first register group,
    所述第二寄存器组包括的寄存器的第二数量信息,The second register group includes second quantity information of registers,
    所述第一寄存器组的数量,the number of the first register group,
    所述第二寄存器组的数量。The number of the second register group.
  13. 一种电子设备,其特征在于,所述电子设备包括权利要求11或12所述的芯片。An electronic device, characterized in that the electronic device includes the chip according to claim 11 or 12.
  14. 一种数据处理方法,其特征在于,应用于权利要求1至10任意一项所述的处理器中的处理单元,所述方法包括:A data processing method, characterized in that it is applied to the processing unit in the processor according to any one of claims 1 to 10, and the method includes:
    对多个线程中的每个线程进行调度;Scheduling each of multiple threads;
    响应于所述多个线程中的目标线程的数据访问请求,访问分配给所述目标线程的寄存器组中的目标寄存器,其中,所述寄存器组包括所述第一寄存器组或所述第二寄存器组。In response to a data access request of a target thread in the plurality of threads, access a target register in a register set allocated to the target thread, wherein the register set includes the first register set or the second register Group.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求14所述的方法。A computer-readable storage medium on which a computer program is stored, characterized in that when the program is executed by a processor, the method of claim 14 is implemented.
PCT/CN2022/120893 2022-03-31 2022-09-23 Processor, chip, electronic device, and data processing method WO2023184900A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210345686.2 2022-03-31
CN202210345686.2A CN114942831A (en) 2022-03-31 2022-03-31 Processor, chip, electronic device and data processing method

Publications (1)

Publication Number Publication Date
WO2023184900A1 true WO2023184900A1 (en) 2023-10-05

Family

ID=82907731

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/120893 WO2023184900A1 (en) 2022-03-31 2022-09-23 Processor, chip, electronic device, and data processing method

Country Status (2)

Country Link
CN (1) CN114942831A (en)
WO (1) WO2023184900A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117408194A (en) * 2023-12-15 2024-01-16 沐曦集成电路(南京)有限公司 Register access system based on chip
CN117667207A (en) * 2024-01-30 2024-03-08 北京壁仞科技开发有限公司 Scheduling method, scheduling system, processor and chip

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114942831A (en) * 2022-03-31 2022-08-26 上海阵量智能科技有限公司 Processor, chip, electronic device and data processing method
CN117076081A (en) * 2023-08-22 2023-11-17 上海合芯数字科技有限公司 Memory training method, device, storage medium, and program product
CN117389712B (en) * 2023-12-12 2024-03-12 沐曦集成电路(南京)有限公司 GPU multithread scheduling management system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1866237A (en) * 2005-05-19 2006-11-22 国际商业机器公司 Methods and apparatus for sharing processor resources
CN101344842A (en) * 2007-07-10 2009-01-14 北京简约纳电子有限公司 Multithreading processor and multithreading processing method
US20130246761A1 (en) * 2012-03-13 2013-09-19 International Business Machines Corporation Register sharing in an extended processor architecture
CN109522049A (en) * 2017-09-18 2019-03-26 展讯通信(上海)有限公司 The verification method and device of register are shared in a kind of synchronizing multiple threads system
CN113377438A (en) * 2021-08-13 2021-09-10 沐曦集成电路(上海)有限公司 Processor and data reading and writing method thereof
CN114942831A (en) * 2022-03-31 2022-08-26 上海阵量智能科技有限公司 Processor, chip, electronic device and data processing method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1866237A (en) * 2005-05-19 2006-11-22 国际商业机器公司 Methods and apparatus for sharing processor resources
CN101344842A (en) * 2007-07-10 2009-01-14 北京简约纳电子有限公司 Multithreading processor and multithreading processing method
US20130246761A1 (en) * 2012-03-13 2013-09-19 International Business Machines Corporation Register sharing in an extended processor architecture
CN109522049A (en) * 2017-09-18 2019-03-26 展讯通信(上海)有限公司 The verification method and device of register are shared in a kind of synchronizing multiple threads system
CN113377438A (en) * 2021-08-13 2021-09-10 沐曦集成电路(上海)有限公司 Processor and data reading and writing method thereof
CN114942831A (en) * 2022-03-31 2022-08-26 上海阵量智能科技有限公司 Processor, chip, electronic device and data processing method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117408194A (en) * 2023-12-15 2024-01-16 沐曦集成电路(南京)有限公司 Register access system based on chip
CN117408194B (en) * 2023-12-15 2024-02-27 沐曦集成电路(南京)有限公司 Register access system based on chip
CN117667207A (en) * 2024-01-30 2024-03-08 北京壁仞科技开发有限公司 Scheduling method, scheduling system, processor and chip
CN117667207B (en) * 2024-01-30 2024-04-26 北京壁仞科技开发有限公司 Scheduling method, scheduling system, processor and chip

Also Published As

Publication number Publication date
CN114942831A (en) 2022-08-26

Similar Documents

Publication Publication Date Title
WO2023184900A1 (en) Processor, chip, electronic device, and data processing method
CN106991011B (en) CPU multithreading and GPU (graphics processing unit) multi-granularity parallel and cooperative optimization based method
JP7078622B2 (en) Low power and low latency GPU coprocessor for persistent computing
US8200949B1 (en) Policy based allocation of register file cache to threads in multi-threaded processor
US11940922B2 (en) ISA extension for high-bandwidth memory
US9734079B2 (en) Hybrid exclusive multi-level memory architecture with memory management
US10007527B2 (en) Uniform load processing for parallel thread sub-sets
EP3017372B1 (en) Memory controlled data movement and timing
US20140181427A1 (en) Compound Memory Operations in a Logic Layer of a Stacked Memory
US7617338B2 (en) Memory with combined line and word access
US8902915B2 (en) Dataport and methods thereof
US20240105260A1 (en) Extended memory communication
US20230099163A1 (en) Processing-in-memory concurrent processing system and method
US8656120B2 (en) Device, method and computer-readable medium relocating remote procedure call data in heterogeneous multiprocessor system on chip
TW202215223A (en) Devices for accelerators and method for processing data
WO2023041002A1 (en) Near memory computing accelerator, dual in-line memory module and computing device
KR101349899B1 (en) Memory Controller and Memory Access Scheduling Method thereof
TWI760756B (en) A system operative to share code and a method for code sharing
CN117435353B (en) Comprehensive optimization method for high-frequency checkpoint operation
JP2023527770A (en) Inference in memory

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934719

Country of ref document: EP

Kind code of ref document: A1