WO2023184900A1

WO2023184900A1 - Processor, chip, electronic device, and data processing method

Info

Publication number: WO2023184900A1
Application number: PCT/CN2022/120893
Authority: WO
Inventors: 王文强; 夏晓旭; 孙海涛; 徐宁仪
Original assignee: 上海商汤智能科技有限公司
Priority date: 2022-03-31
Filing date: 2022-09-23
Publication date: 2023-10-05
Also published as: CN114942831A

Abstract

Provided in the embodiments of the present disclosure are a processor, a chip, an electronic device, and a data processing method. The processor comprises: a first register file, wherein the first register file comprises at least one first register group and at least one second register group, each first register group and each second register group comprise at least one register, each first register group is allocated to one thread among a plurality of threads, and each second register group is allocated to at least two threads among the plurality of threads; and a processing unit, which is used for scheduling each thread among the plurality of threads, and accessing, in response to a data access request of a target thread among the plurality of threads, a target register in a register group which is allocated to the target thread, wherein the register group comprises the first register group or the second register group. The embodiments of the present disclosure implement data multiplexing between threads.

Description

Processors, chips, electronic devices and data processing methods

Cross reference statement

This application claims priority to the Chinese patent application with application number 202210345686.2 submitted to the China Patent Office on March 31, 2022, the entire content of which is incorporated into this application by reference.

Technical field

The present disclosure relates to the field of chip technology, and in particular to processors, chips, electronic devices, data processing methods and computer-readable storage media.

Background technique

In order to improve scheduling efficiency, many processors will introduce hardware multi-threading technology. For example, a graphics processor (Graphics Processing Unit, GPU) will schedule the execution of multiple threads at the same time. Threads can form thread blocks to collaborate to complete an overall computing task. In the process of collaborative computing, a large amount of data interaction is required between different threads. In order to improve the data transmission bandwidth, the register file can be used for data multiplexing. For example, when performing a convolution operation, a feature map can first be stored on the register file for multiple use by the computing unit. However, in conventional processor designs, the register file is generally thread-private, and data on the register file cannot be reused between different threads.

Contents of the invention

In a first aspect, an embodiment of the present disclosure provides a processor. The processor includes: a first register file, the first register file includes at least one first register group and at least one second register group, each of which A first register set and each second register set include at least one register, each first register set is used for allocation to one of a plurality of threads, and each said second register set is used for allocation to at least two of the plurality of threads; and a processing unit for scheduling each of the plurality of threads and responding to a data access request of a target thread of the plurality of threads, Access a target register in a register set allocated to the target thread, wherein the register set includes the first register set or the second register set.

In some embodiments, the data access request of the target thread carries the logical address of the target register; the processing unit is configured to: map the logical address to the physical address of the target register; based on the physical address accesses the destination register.

In some embodiments, when mapping the logical address to the physical address of the target register, the processing unit is configured to: when the target register is a register in the first register group, based on The total number of registers allocated to each prior thread and the logical address determine the physical address; the prior thread includes each thread with a thread number smaller than the target thread; or the target register is the second In the case of a register in a register group, the physical address is determined based on the total number of registers allocated to each thread and the logical address.

In some embodiments, the first register file is divided into at least one storage unit, each storage unit includes at least one first register group and at least one second register group; different storage units are physically isolated, and Different storage units correspond to different threads, a storage unit includes a first register group for allocating to a thread corresponding to the storage unit, and a storage unit includes a second register group for allocating to a thread corresponding to the storage unit. At least two threads of the storage unit; the processing unit is configured to: when the target register is a register in the first register group, determine based on the thread number of the target thread and the number of the storage unit The storage unit where the target register is located, and the physical address is determined based on the total number of register groups allocated to each previous thread, the number of storage units and the logical address; the previous thread includes a thread number smaller than the Each thread of the target thread; or when the register accessed by the data access request is a register in the second register group, determine the location where the target register is located based on the logical address and the number of storage units. storage units, and determine the physical address based on the total number of register sets allocated to each thread, the number of storage units, and the logical address.

In some embodiments, the target register is a register in the first register group; the processing unit is configured to: use the data read from the target register as index information, and calculate the data based on the index information. Registers in the second register group are accessed.

In some embodiments, the data access request includes an indication bit, which is used to indicate whether to use the data read from the second register file as index information for accessing the target register; in the indication bit When indicating that the data read from the second register file is used as index information for accessing the target register, the processing unit is configured to: obtain the index information read from the second register file; based on The target register is accessed from the index information read from the second register file.

In some embodiments, the processor further includes: an instruction path for sending a data access request to the target register; and an execution path for obtaining data transmitted by the target register in response to the data access request. , and perform operations on the obtained data.

In some embodiments, the instruction path includes: an instruction reading unit, used to read the data access request sent by the target thread; an instruction decoding unit, used to access the data read by the instruction reading unit Request for decoding; an instruction sending unit is used to send the decoded data access request to the target register.

In some embodiments, the execution path includes: an arithmetic unit, used to perform arithmetic processing on the acquired data; and a memory access unit, used to output the arithmetic results to the memory, and/or transfer the data stored in the memory. Output to the arithmetic unit for arithmetic processing.

In some embodiments, each of the first register groups includes the same number of registers.

In a second aspect, an embodiment of the present disclosure provides a chip, which includes the processor described in any embodiment of the present disclosure.

In a third aspect, an embodiment of the disclosure provides an electronic device, which includes the chip described in any embodiment of the disclosure.

In a fourth aspect, an embodiment of the present disclosure provides a data processing method, which is applied to the processing unit in the processor according to any embodiment of the present disclosure. The method includes: scheduling each thread in a plurality of threads; In response to a data access request of a target thread among the plurality of threads, a target register in a register group allocated to the target thread is accessed.

In a fifth aspect, embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method described in any embodiment is implemented.

Embodiments of the present disclosure divide the register file into a first register group and at least one second register group, wherein the second register group can be allocated to at least two threads, so that the second register group can be configured by two or more threads. common access, thus realizing data reuse between threads; in addition, since each first register group is only allocated to one thread, so that the first register group can only be accessed by one thread alone, thus making it easier for different threads to The data still has a certain degree of data isolation.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not restrictive of the disclosure.

Description of drawings

The drawings herein illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure.

FIG. 1 is a schematic diagram of the manner in which threads access a register file in a multi-thread situation in the related art.

FIG. 2 is a schematic structural diagram of a processor according to an embodiment of the present disclosure.

Figure 3 is a schematic diagram of the mapping relationship between physical addresses and logical addresses according to an embodiment of the present disclosure.

FIG. 4A and FIG. 4B are respectively schematic diagrams of the address mapping method of the register file according to the embodiment of the present disclosure.

FIG. 5 is a schematic diagram of the positional relationship between the first register group and the second register group according to the embodiment of the present disclosure.

FIG. 6 and FIG. 7 are respectively schematic diagrams of the data operation process of the embodiment of the present disclosure.

Figure 8 is a schematic diagram of a chip according to an embodiment of the present disclosure.

Figure 9 is a flow chart of a data processing method according to an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of the disclosure as detailed in the appended claims.

The terminology used in this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the present disclosure, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining."

In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present disclosure, and to make the above objects, features and advantages of the embodiments of the present disclosure more obvious and easy to understand, the technical solutions in the embodiments of the present disclosure are described below in conjunction with the accompanying drawings. The plan is explained in further detail.

In fields such as artificial intelligence or scientific computing, the design of high-performance processors is very important. In intensive computing scenarios, to improve processor performance, it is necessary to solve the problem of storage walls, reduce the demand for external bandwidth through data multiplexing, and improve the utilization efficiency of computing units.

The traditional processor storage structure can be roughly divided into three levels: external memory/cache/register file. The bandwidth situation is: external memory <cache <register file. Among them, the cache can be further divided into multiple layers, such as the L1/L2 cache of the central processing unit (Central Processing Unit, CPU), etc. Typical data reuse is implemented through cache. When data is stored in the cache, two types of reuse may occur: 1) different computing units access the same data; 2) a single computing unit accesses the same cached data multiple times. Such reuse can effectively reduce data access to external memory.

In order to improve scheduling efficiency, many processors will introduce hardware multi-threading technology. For example, the GPU will schedule the execution of multiple threads at the same time. Threads can form thread groups to collaborate to complete an overall computing task. In the process of collaborative computing, a large amount of data interaction is required between different threads. In traditional processor design, efficient inter-thread data interaction is generally achieved through on-chip storage units, such as the CPU's cache or the GPU's shared memory. .

However, in intensive computing scenarios, sometimes the bandwidth of the on-chip memory unit still cannot meet the computing needs. At this point, the register file can be further used for data multiplexing. For example, when performing a convolution operation, a feature map can be stored on the register file first for multiple use by the computing unit. In conventional processor designs, the register file is generally thread-private. Therefore, only the same thread can reuse the data on the register file, and different threads cannot reuse the data on the register file. For example, in Figure 1, thread 1 can only access register 1, register 2, and register 3 in the register file; thread 2 can only access register 4 and register 5 in the register file; thread 3 can only access registers in the register file. 6. Register 7 and register 8. In this way, data in the same register cannot be reused by multiple threads.

Based on this, an embodiment of the present disclosure provides a processor. Referring to Figure 2, the processor includes:

The first register file 201 includes at least one first register group 2011 and at least one second register group 2012, each of the first register group 2011 and each of the second register group 2012. Comprising at least one register R, each first register group 2011 is used for allocation to one thread among the plurality of threads, and each second register group 2012 is used for allocation to at least two threads among the plurality of threads; and

The processing unit 202 is configured to schedule each thread in the plurality of threads, and in response to a data access request of a target thread in the plurality of threads, perform operations on a target register in a register group allocated to the target thread. Access, wherein the register set includes the first register set or the second register set.

The processor in the embodiment of the present disclosure may be a CPU, a GPU, a neural network processor (Neural Network Processing Unit, NPU), and other types of multi-threaded processors. The embodiment of the present disclosure does not limit the type of processor. The processor can schedule multiple threads to process data in parallel so that the target register is accessed through that thread.

The first register file 201 in the embodiment of the present disclosure may include at least two register groups. The first register file 201 may be divided into at least one first register group 2011 and at least one second register group 2012, and the number of the first register group 2011 and the number of the second register group 2012 may or may not be equal. The first register group 2011 may include at least one register, and the second register group 2012 may also include at least one register. In addition, the number of registers included in the first register group 2011 and the number of registers included in the second register group 2012 may be equal or different, and the number of registers included in different first register groups 2011 may be equal or different. equal.

In some embodiments, the number of the first register set 2011 may be determined based on the number of threads. For example, the number of the first register set 2011 may be equal to the number of threads, such that each thread may be allocated one first register set 2011. Alternatively, the number of first register sets 2011 may be an integral multiple of the number of threads, so that each thread may be allocated one or more first register sets 2011. For simplicity, Figure 2 takes the number of the first register group 2011 as 2, the number of the second register group 2012 as 1, the number of threads as 2, and each thread is allocated a first register group as an example for illustration, where, R Represents a register. Those skilled in the art can understand that the above situation is only an illustrative description. In actual applications, the number of the first register group 2011, the number of the second register group 2012, the number of threads and/or the number of first register groups allocated to each thread. The number of register groups can also take other values, which will not be described again here. For ease of explanation, the solution of the embodiment of the present disclosure will be described below by taking the example that each thread is allocated a first register group and the number of registers included in each first register group is equal.

The number of the first register group 2011, the number of the second register group 2012, the number of registers in the first register group 2011, and the number of registers in the second register group 2012 can be configured respectively through configuration information. In some embodiments, the configuration information only needs to specify the number of the first register group 2011 and the number of the second register group 2012, without specifying which register or registers specifically constitute the first register group 2011 and the second register group 2012. In this way, configuration flexibility is high. The configuration information may be generated by the controller, and the processor may automatically specify a corresponding number of registers to form the first register group 2011 and the second register group 2012 based on the configuration information. When at least one of the number of the first register group 2011, the number of the second register group 2012, the number of registers in the first register group 2011, and the number of registers in the second register group 2012 needs to be changed, you only need to change the corresponding configuration information.

For ease of explanation, the first register group 2011 and the second register group 2012 are collectively referred to as a register group below. The register group mentioned below may be either the first register group 2011 or the second register group 2012.

A first register group 2011 can only be allocated to one thread. After allocation, the first register group 2011 is not visible to other threads. In this way, data isolation between different threads can be achieved. Each second register group 2012 can be allocated to at least two threads, so that the at least two threads can achieve data multiplexing. For example, in FIG. 2 , one of the first register sets 2011 may be assigned to thread 0, the other first register set 2011 may be assigned to thread 1, and the second register set 2012 may be assigned to both thread 0 and thread 1.

Each thread can access a target register in the register group assigned to that thread to read data from or write data to the target register. Among them, the number of target registers can be greater than or equal to 1. Each thread can be scheduled by the processing unit 202. When scheduling a target thread, the processing unit 202 can access the target register in the register group assigned to the target thread in response to the target thread's data access request.

The data access request can carry the logical address of the target register. The processing unit needs to map the logical address to the physical address of the target register, and then access the target register based on the physical address. Among them, the logical address of a register can be used to represent the identification information of the register in the register group assigned to a certain thread, and the physical address of a register can be used to represent the identification information of the register in the register file. The physical address and logical address of each register can be sequentially numbered using integers (for example, 0, 1, 2, 3,...). The logical addresses of registers in the register groups assigned to different threads can be the same, but the physical addresses of different registers must be different. For example, in the embodiment shown in FIG. 3 , the registers with

physical addresses

0, 1, and 2 are registers in the register group allocated to thread 0, and the logical addresses of the above registers are 0, 1, and 2 respectively. The registers with

physical addresses

3, 4, and 5 are registers in the register group assigned to thread 1, and the logical addresses of the above registers are also 0, 1, and 2 respectively.

By mapping logical addresses to physical addresses, the processing unit 202 can determine a unique target register in order to access the correct register. For example, for thread 1, when the logical address of the target register it accesses is 0, the logical address needs to be mapped to the physical address of the register in the first register file 201 (that is, 3). In some embodiments, the logical addresses of the registers in the first register group and the registers in the second register group can be set independently. For example, in the embodiment shown in Figure 2, the logical address assigned to the register in the first register group of thread 0 may be an integer starting from 0 (for example, 0, 1, 2, 3, ...), allocated The logical address of the register in the second register group given to thread 0 can also be set to an integer starting from 0 (for example, 0, 1, 2, 3, ...). Since the location of the first register group in the register file is different from the location of the second register group in the register file, the processing unit 202 may be based on the type of the register group to which the target register belongs and the type of the register group to which the target register belongs. The location of the register group in the register file jointly determines the physical address of the target register. The type of a register group is used to characterize whether the register group is the first register group or the second register group.

Specifically, in the case where the target register is a register in the first register group, the physical address may be determined based on the total number of registers allocated to each previous thread and the logical address. Wherein, the previous thread includes each thread whose thread number (ie, thread ID) is smaller than the target thread. For example, if the thread IDs of each thread are integers such as 0, 1, 2, etc., and the thread ID of the target thread is 2, then the previous threads include the thread with thread ID 0 and the thread with thread ID 1. The number of registers allocated to each preceding thread can be summed to obtain the total number of registers allocated to each preceding thread. In the case where the number of registers allocated to each thread is equal, the total number of registers allocated to each preceding thread can be obtained based on the product of the number of preceding threads and the number of registers allocated to a single thread. Wherein, when k (k is a positive integer) first register groups are allocated to each thread, the number of registers allocated to a single thread is the total number of registers included in the k register groups.

As shown in Figure 4A, it is assumed that the total number of threads is N and k=1, that is, the total number of the first register group is also N, and it is assumed that the number of registers included in each first register group is M, and the second The number of registers included in the register group is P. The second register group is shared by all N threads. The storage space composed of the second register group is also called a shared space. In the case where the target register is a register in the first register group, the physical address physical_addr of the target register can be recorded as:

physical_addr=reg_id+thread_id*M;

Among them, reg_id is the logical address of the target register, thread_id is the thread ID, and M is the number of registers in the first register group. Adders and multipliers can be used to implement the addition and multiplication operations in the above formula respectively to obtain the physical address.

In the case where the target register is a register in the second register group, the physical address may be determined based on the total number of registers allocated to each thread and the logical address. Still taking the situation shown in Figure 4A as an example, when the target register is a register in the second register group, the physical address physical_addr of the target register can be recorded as:

physical_addr=reg_id+N*M;

Among them, N is the total number of threads.

In the above embodiment, the physical address of the register in the second register group is greater than the physical address of the register in the first register group, that is, the register in the second register group is the later register in the register file, and the first register group The register in is the previous register in the register file (case 1). In practical applications, the physical address of the register in the second register group may also be smaller than the physical address of the register in the first register group. That is, the register in the second register group is the previous register in the register file, and the first register The register in the group is the last register in the register file (case 2). Alternatively, several registers in the middle of the register file can also be used as registers in the second register group, and the front and back registers in the register file can be used as registers in the first register group (case 3).

The three distributions of the positional relationship between the first register group and the second register group in the register file are shown in Figure 5. Among them, the gray squares represent the registers in the second register group, the white squares represent the registers in the first register group, and the numbers in the squares represent the physical addresses of each register. Under different location relationships, the physical address is calculated in different ways. For example, it is still assumed that the total number of threads is N and k=1, that is, the total number of first register groups is also N, and it is assumed that the number of registers included in each first register group is M, and the second register group includes The number of registers is P, and the second register set is shared by all N threads. In the above case 2, when the target register is a register in the first register group, the physical address physical_addr of the target register can be recorded as:

physical_addr=P+reg_id+thread_id*M.

In the case where the target register is a register in the second register group, the physical address of the target register is equal to the logical address of the target register.

In the above case three, when the target register is a register in the first register group, assuming that the number of the first register group before the second register group in the register file is The address physical_addr can be recorded as:

physical_addr=reg_id+thread_id*M,0<thread_id<X;

physical_addr=P+reg_id+thread_id*M, thread_id≥X.

In the case where the target register is a register in the second register group, the physical address physical_addr of the target register can be recorded as:

physical_addr=reg_id+X*M.

In some embodiments, the first register file is divided into at least one storage unit (Bank), each storage unit includes at least one first register group and at least one second register group; different storage units have physical Isolated, and different storage units correspond to different threads, one storage unit includes a first register group for allocation to a thread corresponding to the storage unit, and one storage unit includes a second register group for allocation to the corresponding For at least two threads of the memory unit, this approach is called interleaving. For register files with multiple banks, in order to ensure uniform access, the shared space will be evenly distributed on each bank through interleaving.

Taking Figure 4B as an example, there are K Banks, and P registers are reserved for shared space in each Bank. The overall shared space capacity is K*P. The shared storage space on each storage unit can be allocated to any thread. The logical addresses of the shared space are 0 to K*P-1. The logical addresses of the registers in the shared space (i.e. the second register group) can be interleaved on each Bank. As shown in the figure, in the second register group on Bank_0 The logical address of the register is K*(0~P-1), that is, the register numbers are 0, K, 2K,..., (P-1)*K; the registers in the second register group on Bank_1 The logical address is 1+K*(0~P-1), that is, the register numbers are 1, K+1, 2K+1,..., (P-1)*K+1, and so on. For example, assuming K=3 and P=5, the register number in the second register group on Bank_0 is 0/3/6/9/12, and the register number in the second register group on Bank_1 is 1/4/7/ On 10/13, the register number in the second register group on Bank_2 is 2/5/8/11/14. Of course, those skilled in the art can understand that the interleaving method is not limited to this.

Each Bank also reserves storage space dedicated to each thread (i.e., the first register group). For example, Bank_0 reserves dedicated storage space for thread 0, thread K,..., thread N-K, and Bank_1 reserves Dedicated storage space for thread 1, thread K+1,..., thread N-K+1. For ease of explanation, it is assumed here that the size of the dedicated storage space reserved for each thread in a Bank is M, that is, the logical address of each thread in a Bank is 0 to M-1. The registers with addresses 0 to M-1 reserved for a thread form a first register group, that is, the first register group allocated to a thread includes M registers.

Among them, different storage units correspond to different threads. Taking the number of threads equal to 9 and the number of storage units equal to 3 as an example, the storage unit Bank_0 corresponds to thread 0, thread 3 and thread 6, and the storage unit Bank_0 includes the first register group Can be assigned to thread 0, thread 3 and thread 6 respectively. Storage unit Bank_1 corresponds to thread 1, thread 4 and thread 7, then the first register group included in storage unit Bank_1 can be allocated to thread 1, thread 4 and thread 7 respectively. Storage unit Bank_2 corresponds to thread 2, thread 5 and thread 8, then the first register group included in storage unit Bank_2 can be allocated to thread 2, thread 5 and thread 8 respectively. Of course, in addition to the interleaving methods described in the above embodiments, other interleaving methods can also be used, and no examples are given here.

In the above case of multiple banks, if the target register is a register in the first register group, the storage unit where the target register is located can be determined based on the number of the target thread and the number of the storage units, and The physical address is determined based on the total number of register sets allocated to each preceding thread, the number of storage units, and the logical address; the preceding thread includes each thread with a thread number smaller than the target thread. For example, the number physical_bank of the storage unit where the target register is located can be recorded as:

physical_bank=thread_id%K.

The physical address physical_addr of the target register on the corresponding storage unit can be recorded as:

physical_addr=reg_id+thread_id/K*M;

Among them, % represents the operator for finding the remainder, and the physical meanings of the remaining symbols in the formula are the same as those in the previous embodiment. Still taking the number of threads N equal to 9 and the number of storage units K equal to 3 as an example, and assuming M = 3, the thread numbers are 0~8, and the physical addresses of the registers in the first register group assigned to each thread are 0~ 2. When the target register is a register with physical address 1 in the first register group allocated to thread 0, the number of the storage unit where the target register is located is 0%3, that is, Bank_0. The physical address of the target register on Bank_0 can be recorded It is 1+0/3*9=1. In the case where the target register is a register with a physical address of 2 in the first register group allocated to thread 1, the number of the storage unit where the target register is located is 1%3, that is, Bank_1, and the physical address of the target register on Bank_1 can be Recorded as 2+1/3*9=5.

If the register accessed by the data access request is a register in the second register group, the storage unit where the target register is located can be determined based on the logical address and the number of the storage units, and based on the allocation to each thread. The total number of register groups, the number of storage units, and the logical address determine the physical address. For example, the number physical_bank of the storage unit where the target register is located can be recorded as:

physical_bank=reg_id%K.

physical_addr=reg_id/K+N/K*M.

Still taking the number of threads N equal to 9, the number of storage units K equal to 3, M=3 as an example, and assuming P=5. When the target register is the register numbered 6 in the second register group, the number of the storage unit where the target register is located is 6%3=0, that is, the target register is on Bank_0, and the physical address of the target register is 6/3+ 9/3*3=11.

In some embodiments, K can be set to a power of 2, and K can be set to an integer multiple of N. If the value of reg_id is not an integer, it can be rounded down so that the resulting physical address is an integer.

After the target register is determined based on the above method, the target register can be accessed, for example, data is read from the target register. In some embodiments, the data read from the target register can be used as an access address of another register to access data in the other register. For example, the target register may be a register in the first register group. The processing unit 202 may first obtain the physical address of the target register (i.e., the index register number), and use the data read from the target register based on the index register number as index information (i.e., address information, i.e., the index register value in the figure) , and access registers in the second register group based on the index information. For example, assuming that the logical address of the target register in the first register group is A1, the physical address of the target register can be calculated through the above method, assuming it is A2, access the register with the physical address of A2, and obtain the data A3, and use A3 as The logical address of a register in the second register group, and the physical address of the register is calculated based on A3, assuming it is address A4, and then the register with address A4 can be accessed. The above process can be seen in Figure 6. The instruction path can read instructions sent by multiple threads (ie, the aforementioned data access requests) based on multi-thread context information (context), decode the data access requests, and send them to the execution path. Wherein, the data access request may be used to access a target register in the first register group. After obtaining the data in the target register, the data is used as index information to access the register in the second register group. This method is called indirect addressing.

The above solution can be extended to processors containing parallel computing units, such as CPUs or GPUs containing single instruction multiple data structure (Single Instruction Multiple Data, SIMD) units. Referring to Figure 7, the processing unit 202 can first obtain the physical address of the register that needs to be accessed in the second register file (i.e., the index register number), and obtain the index information read from the second register file (i.e., the index register number in the figure) based on the index register number. Index register value); access the target register based on the index information read from the second register file. The target register here can be either a register in the first register group or a register in the second register group. The main idea of the embodiments of the present disclosure is that the processor may contain two independent register files, wherein the second register file may be a vector register file used to store SIMD data for parallel calculations or Single Instruction Multiple Threads (Single Instruction Multiple Threads). Threads, SIMT) data; the first register file can be a scalar register file, used to store simple scalar data or control information. The vector register file can be accessed using the value read from the scalar register file as an index, and then the data obtained from the vector register file is sent to the execution path. Since core operations occur in the vector register file, shared space support is mainly added to the vector register file.

In some embodiments, in addition to indirect addressing, the data read from the register can also be directly used for data operations. This method is called direct addressing. In the case of direct addressing, the data read from the registers included in the second register file or the registers included in the first register group are no longer used as index information to access other registers, but are directly used to perform data operations (such as , multiplication operations, addition operations, etc.). In order to easily distinguish between direct addressing and indirect addressing, the data access request may include an indication bit, and the indication bit is used to indicate whether the data read from the target register will be used as the index information. Specifically, the indication bit may include two indication states. When the indication bit is in the first indication state, it is determined that the data to be read from the target register is used as the index information; when the indication bit is in the second indication state, In the case of status, it is determined not to use the data read from the target register as the index information. In some embodiments, the first indication state and the second at least state may be represented by at least 1 bit of data bit. For example, binary data "0" can be used to represent the first indication state, and binary data "1" can be used to represent the second indication state. Of course, the way of expressing the indication status is not limited to this. Those skilled in the art can use other ways to express different indication status according to the actual situation, which will not be listed here one by one.

In some embodiments, the processor further includes an instruction path for sending a data access request to the target register; and an execution path for obtaining data transmitted by the target register in response to the data access request, And perform operations on the obtained data. The solution of the embodiment of the present disclosure can be applied to both the above direct addressing scenario and the above indirect addressing scenario.

Specifically, the instruction path may include: an instruction reading unit, used to read the data access request sent by the target thread; an instruction decoding unit, used to read the data access request sent by the instruction reading unit Perform decoding; an instruction issuing unit is used to send the decoded data access request to the target register. The target register can output the stored data to the execution path for calculation processing, or can return the stored data to the instruction issuing unit as index information, so that the instruction issuing unit can send the data stored in the corresponding register to the execution path based on the index information. Perform computational processing.

In some embodiments, the execution path includes: an operation unit, used to perform operation processing on the acquired data; and a memory access unit, used to output the operation results to the memory, and/or output the data stored in the memory to The computing unit performs computing processing. The operation unit may include one or more sub-operation units, such as an addition unit, a multiplication unit, a convolution unit, etc. The number and type of sub-operation units included in the operation unit may be set based on actual requirements. The memory access unit is used to implement data transmission between the computing unit and the memory. When the register file does not include the data required for the operation, the memory access unit can be used to access the memory to obtain the corresponding data. Furthermore, the data obtained from the scalar register can also be output to the scalar execution unit for processing as data requiring operation.

Referring to FIG. 8 , an embodiment of the present disclosure also provides a chip. The chip includes a processor 801 , and the processor 801 can be the processor described in any of the above embodiments. In some embodiments, the chip can be applied in an AI accelerator card. In some embodiments, the chip further includes a controller 802 for configuring at least one of the following information: information on the first number of registers included in the first register group, information on the number of registers included in the second register group. The second quantity information, the number of the first register group, the number of the second register group.

For details of the embodiments of the present disclosure, please refer to the foregoing embodiments of the processor and will not be described again here.

An embodiment of the present disclosure also provides an electronic device, including the chip described in any of the above embodiments.

Referring to Figure 9, an embodiment of the present disclosure also provides a data processing method, which is applied to the processing unit in the processor according to any embodiment of the present disclosure. The method includes:

Step 901: Schedule each thread in the plurality of threads;

Step 902: In response to the data access request of the target thread among the plurality of threads, access the target register in the register group allocated to the target thread, wherein the register group includes the first register group or the Second register group.

In some embodiments, the data access request of the target thread carries the logical address of the target register; in response to the data access request of the target thread among the plurality of threads, the data access request allocated to the target thread is accessed. The target register in the register group includes: mapping the logical address to the physical address of the target register; and accessing the target register based on the physical address.

In some embodiments, mapping the logical address to a physical address of the target register includes: when the target register is a register in the first register group, based on the allocation to each previous The total number of registers of threads and the logical address determine the physical address; the previous thread includes each thread with a thread number smaller than the target thread; and/or the target register is the second register group. In the case of registers, the physical address is determined based on the total number of registers allocated to each thread and the logical address.

In some embodiments, the first register file is divided into at least one storage unit, each storage unit includes at least one first register group and at least one second register group; different storage units are physically isolated, and Different storage units correspond to different threads, a storage unit includes a first register group for allocating to a thread corresponding to the storage unit, and a storage unit includes a second register group for allocating to a thread corresponding to the storage unit. At least two threads of the storage unit; mapping the logical address to the physical address of the target register includes: when the target register is a register in the first register group, based on the target The thread number of the thread and the number of storage units determine the storage unit where the target register is located, and the physical address is determined based on the total number of register sets allocated to each previous thread, the number of storage units, and the logical address ; The prior thread includes each thread with a thread number smaller than the target thread; and/or when the register accessed by the data access request is a register in the second register group, based on the logical address And the number of storage units determines the storage unit where the target register is located, and determines the physical address based on the total number of register groups allocated to each thread, the number of storage units, and the logical address.

In some embodiments, the target register is a register in the first register group; in response to a data access request of a target thread in the plurality of threads, accessing a register group allocated to the target thread The target register includes: taking the data read from the target register as index information, and accessing the registers in the second register group based on the index information.

In some embodiments, in response to a data access request from a target thread among the plurality of threads, accessing a target register in a register group allocated to the target thread includes: obtaining a read from a second register file index information; access the target register based on the index information read from the second register file.

In some embodiments, the data access request includes an indication bit, and the indication bit is used to indicate whether to use the data read from the second register file as index information for accessing the target register.

In some embodiments, the method further includes: sending a data access request to the target register through an instruction path; obtaining data transmitted by the target register in response to the data access request through an execution path, and processing the obtained data Perform computational processing.

In some embodiments, sending a data access request to the target register through an instruction path includes: reading the data access request sent by the target thread through an instruction reading unit in the instruction path; The instruction decoding unit in the instruction path decodes the data access request read by the instruction reading unit; and sends the decoded data access request to the target register through the instruction issuing unit in the instruction path; and/or obtaining the data transmitted by the target register in response to the data access request through the execution path, and performing operations on the obtained data, including: performing operations on the obtained data through the operation unit in the execution path processing; and outputting the operation results to the memory through the memory access unit in the execution path, and/or outputting the data stored in the memory to the operation unit for operation processing.

In some embodiments, each first register group includes the same number of registers.

An embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method described in any of the foregoing embodiments is implemented.

Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

From the above description of the embodiments, those skilled in the art can clearly understand that the embodiments of this specification can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the embodiments of this specification can be embodied in the form of software products in essence or those that contribute to the existing technology. The computer software products can be stored in storage media, such as ROM/RAM, A magnetic disk, optical disk, etc., includes a number of instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments of this specification.

The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer, which may be in the form of a personal computer, a laptop, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, or a game controller. desktop, tablet, wearable device, or a combination of any of these devices.

Each embodiment in this specification is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment. The device embodiments described above are only illustrative. The modules described as separate components may or may not be physically separated. When implementing the embodiments of this specification, the functions of each module may be integrated into the same device. or implemented in multiple software and/or hardware. Some or all of the modules can also be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

The above are only specific implementation modes of the embodiments of this specification. It should be pointed out that those of ordinary skill in the art can make several improvements and modifications without departing from the principles of the embodiments of this specification. Improvements and modifications should also be considered as the protection scope of the embodiments of this specification.

Claims

A processor, characterized in that the processor includes:

A first register file, the first register file includes at least one first register group and at least one second register group, each of the first register group and each of the second register group includes at least one register, each Each of the first register sets is used for allocation to one of the plurality of threads, and each of the second register sets is used for allocation to at least two of the plurality of threads; and

a processing unit configured to schedule each of the plurality of threads and, in response to a data access request from a target thread of the plurality of threads, access a target register in a register group allocated to the target thread , wherein the register set includes the first register set or the second register set.
The processor according to claim 1, wherein the data access request of the target thread carries the logical address of the target register; the processing unit is configured to:

Mapping the logical address to the physical address of the target register;

The target register is accessed based on the physical address.
The processor of claim 2, wherein when mapping the logical address to the physical address of the target register, the processing unit is configured to:

In the case where the target register is a register in the first register group, the physical address is determined based on the total number of registers allocated to each previous thread and the logical address; the previous thread includes a thread number Each thread that is smaller than the target thread; or

In the case where the target register is a register in the second register group, the physical address is determined based on the total number of registers allocated to each thread and the logical address.
The processor according to claim 2 or 3, characterized in that the first register file is divided into at least one storage unit, each storage unit includes at least one first register group and at least one second register group; Different storage units are physically isolated from each other, and different storage units correspond to different threads. One storage unit includes a first register group for allocation to a thread corresponding to the storage unit, and one storage unit includes a second register group. The register set is used to be allocated to at least two threads corresponding to the storage unit; the processing unit is used to:

In the case where the target register is a register in the first register group, the storage unit where the target register is located is determined based on the thread number of the target thread and the number of storage units, and based on the allocation to each The total number of register sets, the number of storage units and the logical address of the previous thread determine the physical address; the previous thread includes each thread with a thread number smaller than the target thread; or

When the register accessed by the data access request is a register in the second register group, the storage unit where the target register is located is determined based on the logical address and the number of storage units, and based on the allocation to The total number of register sets for each thread, the number of storage units, and the logical address determine the physical address.
The processor according to claim 1, wherein the target register is a register in the first register group; the processing unit is configured to:

The data read from the destination register is used as index information, and

Registers in the second register group are accessed based on the index information.
The processor according to claim 1, characterized in that the data access request includes an indication bit, the indication bit is used to indicate whether to use the data read from the second register file as a method for accessing the target register. Index information; in the case where the indication bit indicates that the data read from the second register file will be used as index information for accessing the target register, the processing unit is configured to:

Obtain index information read from the second register file;

The target register is accessed based on index information read from the second register file.
The processor according to any one of claims 1 to 6, characterized in that the processor further includes:

an instruction path for sending a data access request to the target register; and

The execution path is used to obtain the data transmitted by the target register in response to the data access request, and perform operational processing on the obtained data.
The processor of claim 7, wherein the instruction path includes:

An instruction reading unit is used to read the data access request sent by the target thread;

An instruction decoding unit, used to decode the data access request read by the instruction reading unit;

An instruction issuing unit is used to send the decoded data access request to the target register.
The processor according to claim 7 or 8, characterized in that the execution path includes:

An arithmetic unit, used to perform arithmetic processing on the acquired data; and

The memory access unit is used to output operation results to the memory, and/or output data stored in the memory to the operation unit for operation processing.
The processor according to any one of claims 1 to 9, wherein each of the first register groups includes the same number of registers.
A chip, characterized in that the chip includes the processor according to any one of claims 1 to 10.
The chip according to claim 11, characterized in that the chip further includes a controller, the controller is used to configure at least one of the following information:

the first quantity information of the registers included in the first register group,

The second register group includes second quantity information of registers,

the number of the first register group,

The number of the second register group.
An electronic device, characterized in that the electronic device includes the chip according to claim 11 or 12.
A data processing method, characterized in that it is applied to the processing unit in the processor according to any one of claims 1 to 10, and the method includes:

Scheduling each of multiple threads;

In response to a data access request of a target thread in the plurality of threads, access a target register in a register set allocated to the target thread, wherein the register set includes the first register set or the second register Group.
A computer-readable storage medium on which a computer program is stored, characterized in that when the program is executed by a processor, the method of claim 14 is implemented.