WO2022068673A1 - 指令处理设备以及指令处理方法 - Google Patents

指令处理设备以及指令处理方法 Download PDF

Info

Publication number
WO2022068673A1
WO2022068673A1 PCT/CN2021/119965 CN2021119965W WO2022068673A1 WO 2022068673 A1 WO2022068673 A1 WO 2022068673A1 CN 2021119965 W CN2021119965 W CN 2021119965W WO 2022068673 A1 WO2022068673 A1 WO 2022068673A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
address
value
unit
physical address
Prior art date
Application number
PCT/CN2021/119965
Other languages
English (en)
French (fr)
Inventor
王伟
周建
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2022068673A1 publication Critical patent/WO2022068673A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Definitions

  • the present disclosure relates to the technical field of virtual-real address translation, and in particular, to an instruction processing device and an instruction processing method.
  • GPU Graphics Processing Unit
  • AI artificial intelligence
  • Embodiments of the present disclosure provide at least an instruction processing device and an instruction processing method.
  • an embodiment of the present disclosure provides an instruction processing device, including: an instruction processor, an address translation unit, and a plurality of execution units; the instruction processor is connected to the address translation unit, and the address translation unit is connected to the multiple execution units; the instruction processor, configured to acquire the first instruction, and transmit the first instruction to the address translation unit; the address translation unit, configured to receive the information transmitted by the instruction processor
  • the first instruction converts the virtual address carried in the first instruction into a physical address to obtain a second instruction, and transmits the second instruction to the target execution unit in the plurality of execution units, wherein the The target execution unit is at least one execution unit among the plurality of execution units; the target execution unit is configured to execute the second instruction to obtain an instruction execution result.
  • the address translation unit is used to convert the virtual address carried in the instruction into a physical address; when the instruction is distributed to the specific execution unit, the virtual address carried in the instruction has been
  • the translation unit translates into a physical address, so that the execution unit does not need to perform the address translation process, so it is not necessary to set the corresponding circuits of TLB and PTW in each execution unit, reducing the chip area occupied by the execution unit and reducing the hardware cost of the GPU. Further, since there is no situation of multiple execution units concurrently querying the page table, the bandwidth pressure of the global memory can be reduced.
  • a unified address translation unit is used to translate the instructions distributed to the execution units, instead of setting a circuit corresponding to TLB and PTW in each execution unit of the multiple execution units, which can be an address
  • the translation unit sets a larger cache unit than the TLB in each execution unit to improve the efficiency of address translation.
  • it further includes: an instruction distribution unit; the address translation unit is connected to the multiple execution units through the instruction distribution unit, and is further configured to transmit the second instruction distribution unit to the instruction distribution unit. an instruction; the instruction distribution unit is configured to, after receiving the second instruction transmitted by the address translation unit, determine the target execution unit in the plurality of execution units for the second instruction, and assign The second instruction is transmitted to the target execution unit.
  • the first instruction is converted into the second instruction carrying the physical address by the address translation unit, and the second instruction is distributed to the specific target execution unit by the instruction distribution unit.
  • the target execution unit when the target execution unit executes the second instruction, it is used to: parse the value of the physical address carried in the second instruction, and access the memory corresponding to the value of the physical address. .
  • the address translation unit includes: an instruction parsing subunit, an address translation subunit, and an instruction translation subunit; wherein the instruction parsing subunit is associated with the instruction processor and the address translation subunit.
  • the address conversion subunit is further connected with the instruction conversion subunit, and the instruction conversion subunit is further connected with the multiple execution units; the instruction parsing subunit is used for converting from the The numerical value of the virtual address is parsed in the first instruction; the address conversion subunit is used to determine the numerical value of the physical address corresponding to the numerical value of the virtual address obtained by the instruction parsing subunit; the instruction conversion subunit , for generating the second instruction based on the value of the physical address determined by the address translation subunit and the first instruction.
  • the address translation unit includes: a cache unit, the cache unit is used to store the mapping relationship between the value of the virtual address and the value of the physical address; the address translation unit is connected to the memory , the address translation unit is further configured to query the mapping relationship in the cache unit according to the value of the virtual address; when there is a mapping relationship corresponding to the value of the virtual address in the cache unit, Obtain the value of the physical address corresponding to the value of the virtual address from the cache unit; in the case where there is no mapping relationship corresponding to the value of the virtual address in the cache unit, query the value of the physical address through the page table The value of the physical address corresponding to the value of the virtual address is obtained from the memory.
  • the address translation unit is further configured to store the mapping relationship between the value of the physical address and the value of the virtual address obtained from the memory into the cache unit.
  • the address translation unit is used to: change the target flag bit in the first instruction from a first value to a second value, and replace the value of the virtual address with the physical value.
  • the value of the address is used to obtain the second instruction; wherein, the first value indicates that the address carried in the instruction is a virtual address; the second value indicates that the address carried in the instruction is a physical address.
  • an embodiment of the present disclosure further provides an instruction processing method, including: an instruction processor acquires a first instruction, and transmits the first instruction to an address translation unit; and the address translation unit receives the instruction processor transmission the first instruction, converts the virtual address carried in the first instruction into a physical address, obtains a second instruction, and transmits the second instruction to the target execution unit in the multiple execution units, wherein the The target execution unit is at least one execution unit among the plurality of execution units; the target execution unit executes the second instruction to obtain an instruction execution result.
  • the method further includes: the address translation unit transmits the second instruction to the instruction distribution unit; after the instruction distribution unit receives the second instruction transmitted by the address translation unit, The second instruction determines the target execution unit of the plurality of execution units, and transmits the second instruction to the target execution unit.
  • the execution of the second instruction by the target execution unit includes: the target execution unit parses the numerical value of the physical address carried in the second instruction, and accesses the numerical value corresponding to the physical address. of memory.
  • the address translation unit receives the first instruction transmitted by the instruction processor, converts the virtual address carried in the first instruction into a physical address, and obtains a second instruction, including: The address translation unit parses the value of the virtual address from the first instruction; determines the value of the physical address corresponding to the value of the virtual address; generates a value based on the value of the physical address and the first instruction the second instruction.
  • the address translation unit inquires about the mapping relationship between the value of the virtual address and the value of the physical address stored in the cache unit; In the case where there is a mapping relationship corresponding to the value of the virtual address, the value of the physical address corresponding to the value of the virtual address is obtained from the cache unit; the virtual address does not exist in the cache unit In the case of the mapping relationship corresponding to the value of the virtual address, the value of the physical address corresponding to the value of the virtual address is obtained from the memory by querying the page table.
  • the method further includes: the address translation unit stores the mapping relationship between the value of the physical address and the value of the virtual address obtained from the memory into the cache unit.
  • the address translation unit converts the virtual address carried in the first instruction into a physical address to obtain the second instruction, including: converting the target flag bit in the first instruction from the first instruction to the physical address.
  • the numerical value is changed to a second numerical value, and the numerical value of the virtual address is replaced with the numerical value of the physical address to obtain the second instruction; wherein, the first numerical value indicates that the address carried in the instruction is a virtual address; the The second value indicates that the address carried in the instruction is a physical address.
  • FIG. 1 shows a schematic diagram of an instruction processing device provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of an instruction structure provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of another instruction processing device provided by an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of an example of an instruction processing device architecture provided by an embodiment of the present disclosure
  • FIG. 5 shows a flowchart of an instruction processing method provided by an embodiment of the present disclosure.
  • the GPU usually includes: an instruction processor (Command Processor, CP), and a plurality of execution units (Compute Unit, CU) connected to the instruction processor.
  • the instruction processor can be used to obtain the instruction stream from the host central processing unit (Host Central Processing Unit, host CPU) (hereinafter referred to as the host), and determine the target execution unit for each instruction in the instruction stream, and then send the instruction to the corresponding target.
  • Execution unit after receiving the instruction, the CU can convert the virtual address carried in the instruction into a physical address, and then perform operations corresponding to the instruction, such as reading and writing data from the physical address.
  • the CU When the CU converts the virtual address carried in the instruction into a physical address, it needs to determine the corresponding physical address by querying the translation lookaside buffer (Translation Lookaside Buffer, TLB) and/or the logical circuit of the page table query (Page Table Walk, PTW). address.
  • TLB Translation Lookaside Buffer
  • PTW Page Table Walk
  • the present disclosure provides an instruction processing device that uses an address translation unit to convert a virtual address carried in an instruction into a physical address before the instruction is dispatched to a specific execution unit.
  • the virtual address carried in the instruction has been translated into a physical address by the address translation unit, so that the execution unit does not need to perform the address translation process, so there is no need to set TLB and TLB in each execution unit.
  • the corresponding circuit of the PTW reduces the chip area occupied by the execution unit and reduces the hardware cost of the GPU. Further, since there is no situation of multiple execution units concurrently querying the page table, the bandwidth pressure of the global memory can be reduced.
  • the instruction processing device may be used for a central processing unit (Central Processing Unit, CPU), a GPU, or other instruction processing devices including an instruction processor and an execution unit.
  • CPU Central Processing Unit
  • GPU GPU
  • other instruction processing devices including an instruction processor and an execution unit.
  • the instruction processing device provided by the embodiment of the present disclosure is described below by taking the instruction processing device provided by the embodiment of the present disclosure applied to a GPU as an example, but it can also be applied to other types of instruction processing device.
  • FIG. 1 a schematic structural diagram of an instruction processing apparatus provided by an embodiment of the present disclosure includes: an instruction processor 10 , an address translation unit 20 , and a plurality of execution units 30 .
  • the instruction processor 10 is connected to the address translation unit 20 , and the address translation unit is connected to a plurality of execution units 30 .
  • the instruction processor 10 is configured to acquire the first instruction and transmit the first instruction to the address translation unit 20 ; the address translation unit 20 is configured to receive the first instruction transmitted by the instruction processor 10 .
  • the execution unit is at least one execution unit among the plurality of execution units 30 ; the target execution unit is configured to execute the second instruction to obtain an instruction execution result.
  • the instruction processor 10 the address translation unit 20, and the execution unit 30, respectively, will be described in detail below.
  • the host converts the source program of the software into machine instructions, and stores the machine instructions in the instruction memory of the GPU.
  • the instruction processor 10 may directly read the instruction from the instruction memory.
  • the instruction processor 10 decouples the instructions with dependencies, and then transmits the first instruction to the address translation unit 20 .
  • the compiler deployed in the computer is responsible for converting the software program of the software into machine instructions, and storing the machine instructions in the instruction memory of the CPU.
  • the instruction processor 10 may directly read the instruction from the instruction memory.
  • the address translation unit 20 after receiving the first instruction transmitted by the instruction processor 10, is responsible for converting the virtual address in the first instruction into a physical address to generate the second instruction.
  • a schematic structural diagram of an instruction provided by the implementation of the present disclosure includes an instruction header, a first flag bit, a first address bit, a second flag bit, a second address bit, and an instruction tail.
  • the instruction structure is only an example, and the present disclosure does not limit other information included in the instruction, nor does the present disclosure limit the specific positions of flag bits and address bits in the instruction structure.
  • the present disclosure also does not limit the number of flag bits and address bits included in an instruction.
  • the first flag bit is used to indicate whether the address stored in the first address bit is a virtual address or a physical address; the second flag bit is used to indicate whether the address stored in the second address bit is a virtual address or a physical address.
  • the flag bit when the flag bit is a first value, it indicates that the address stored in the corresponding address bit is a virtual address; when the flag bit is a second value, it indicates that the address stored in the corresponding address bit is a physical address.
  • the first value may be 0, and the second value may be 1.
  • the first instruction includes two virtual addresses, and the corresponding first flag bit and the second flag bit are both 0, indicating that the addresses included in the first instruction are virtual addresses; the first address bit and the second address bit are The stored addresses are virtual address 0 and virtual address 1, respectively.
  • FIG. 2 it is a schematic diagram of the second instruction.
  • the first flag bit and the second flag are both 1, the address stored in the first address bit is physical address 0 corresponding to virtual address 0; the address stored in the second address bit is physical address 1 corresponding to virtual address 1.
  • the specific structure of the address translation unit 20 includes: an instruction parsing subunit 21 , an address conversion subunit 22 and an instruction conversion subunit 23 .
  • the instruction parsing subunit 21 is connected to the instruction processor 10, and is used for parsing the value of one or more virtual addresses from the first instruction; the address conversion subunit 22 is used to determine the The numerical value of the physical address corresponding to the numerical value of each virtual address in the numerical value of one or more virtual addresses obtained by the instruction parsing subunit 21; the instruction conversion subunit 23 is used to determine based on the address conversion subunit 22 The numerical value of each physical address and the first instruction generate a second instruction.
  • the instruction parsing subunit 21 is connected with the instruction processor 10 and the address conversion subunit 22
  • the address conversion subunit 22 is also connected with the instruction conversion subunit 23
  • the instruction conversion subunit 23 is connected with a plurality of execution units 30 .
  • the address conversion sub-unit 22 is also connected with the cache unit 24 and the memory 50 .
  • the cache unit 24 is used to store the mapping relationship between the value of the virtual address and the value of the physical address
  • the memory 50 is the memory in the electronic device, which can be the memory inside the GPU or the memory outside the GPU. Pages are stored in units of pages.
  • the first instruction includes an instruction header, flag bits, address bits, and instruction tail, and the above-mentioned data are stored in different data bits of the instruction.
  • the instruction parsing subunit 21 parses the virtual address from the first instruction, it can determine the length of the instruction, the number of address bits and flag bits, and the number of address bits and flag bits in the instruction by means of an internal look-up table. Location.
  • the instruction parsing unit 21 first determines whether the address stored in the address bit corresponding to the flag bit is a virtual address according to the specific position of each flag bit in the first instruction; the address stored in the address bit corresponding to the flag bit is a virtual address. In this case, the value of the virtual address corresponding to the flag bit is read from the corresponding address bit.
  • the instruction parsing subunit 21 may include a first circuit; the signal input end included in the first circuit may be equal to the number of data bits included in the first instruction; wherein, after the flag bit enters the circuit structure, it is passed through Logical operation is performed on the flag bit to output a control signal, and the function bit of the control signal, when the flag bit is 0, transmits the data of the address bit corresponding to the flag bit to the address conversion subunit 22 to complete the virtual address analysis, And the process of transferring the value of the virtual address to the address translation subunit 22 .
  • the address conversion subunit 22 when determining the value of the physical address corresponding to the value of any virtual address, can access the cache unit 24, for example, and query the value and physical address of the virtual address stored in the cache unit 24 according to the value of the virtual address.
  • the mapping relationship between the values If there is a mapping relationship corresponding to the value of the virtual address in the cache unit 24 , the value of the physical address corresponding to the value of the virtual address is acquired from the cache unit 24 . In the case where the mapping relationship corresponding to the value of the virtual address does not exist in the cache unit 24, the value of the physical address corresponding to the value of the virtual address is obtained from the memory 50 through a page table query method. The value of the acquired physical address is transmitted to the instruction conversion sub-unit 23 .
  • the translation of the instructions distributed to the execution units is implemented through a unified address translation unit, instead of setting the corresponding circuits of TLB and PTW in each of the multiple execution units, it can be
  • the address translation unit sets a larger cache unit than the TLB in each execution unit to improve the efficiency of address translation.
  • the address translation subunit 22 may first query the cache unit 24 whether there is a target virtual address that has the same value as the virtual address obtained from the instruction parsing subunit 21. For the logical relationship of physical addresses, the value of the target physical address is read, and the value of the target physical address is used as the value of the physical address corresponding to the value of the virtual address in the first instruction. If the target virtual address does not exist, the corresponding circuit of the PTW is queried through the page table to obtain the value of the physical address corresponding to the value of the virtual address from the memory.
  • the instruction processor 10 can read multiple instructions at the same time, and transmit the multiple instructions to the address translation unit concurrently. Therefore, when the address translation unit translates multiple instructions, since there may be a situation in which the physical address corresponding to the virtual address in multiple instructions is read through the same PTW, translation applications at the same level can be merged, and the translation applications of different instructions can be executed out of order. Translation can reduce memory access, reduce bandwidth pressure, and improve translation efficiency.
  • the address translation unit 20 is further configured to store the mapping relationship between the physical address and the virtual address acquired from the memory into the cache unit.
  • the address conversion subunit 22 may include a second circuit, and the number of signal input terminals of the second circuit is equal to the number of data bits included in the first instruction, for example.
  • the second circuit determines the cache memory from the cache unit 24 according to the value of each virtual address stored in the cache unit 24 through logical operations Whether there is a target virtual address in unit 24 that has the same value as the virtual address in the first instruction.
  • the address translation sub-unit 22 obtains the physical address corresponding to the value of the virtual address from the memory 50 through the corresponding circuit of the PTW. numerical value.
  • the instruction conversion subunit 23 when converting the first instruction into the second instruction, for example, can change the target flag bit in the first instruction from the first value to the second value, and replace the value of the virtual address is the value of the physical address, and the second instruction is obtained.
  • the first numerical value indicates that the address carried in the instruction is a virtual address; the second numerical value indicates that the address carried in the instruction is a physical address.
  • the instruction includes multiple flag bits and address bits, the first value indicates that the address corresponding to the target flag bit is a virtual address, and the second value indicates that the address corresponding to the target flag bit is a physical address.
  • a third circuit may be included in the instruction conversion sub-unit 23 .
  • the third circuit can receive the value of the physical address transmitted by the second circuit in the address translation subunit 22 .
  • the second circuit also transmits the instruction header, instruction tail, and flag bits in the first instruction to the instruction conversion subunit 23 .
  • the third circuit changes the flag bit from the first value to the second value through logical operation, and generates and outputs the second instruction based on the instruction header, the value-modified flag bit, the value of the physical address, and the instruction tail.
  • the execution unit that executes the second instruction can be referred to as a target execution unit, and the target execution unit is at least one execution unit in the plurality of execution units 30.
  • the physical address carried in the second instruction can be parsed. value, and access the memory corresponding to the value of the physical address.
  • the operand corresponding to the second instruction is stored in the storage location corresponding to the value of the physical address; when the target execution unit accesses the physical address, it can read the operand stored in the storage location corresponding to the physical address; then use the Operand, execute the specific operation indicated by the second instruction to obtain the instruction execution result.
  • another instruction processing device provided by an embodiment of the present disclosure further includes: an instruction distribution unit 40 .
  • the address translation unit 20 is connected to a plurality of execution units 30 through the instruction distribution unit 40 , and is further configured to transmit the second instruction to the instruction distribution unit 40 .
  • the instruction distribution unit 40 is configured to, after receiving the second instruction transmitted by the address translation unit 20, determine the target execution unit in the plurality of execution units 30 for the second instruction, and assign the target execution unit to the second instruction.
  • the second instruction is transmitted to the target execution unit.
  • an embodiment of the present disclosure further provides a specific example of the architecture of a command processor, including: a command processor (Command Processor, CP) 10, an address translation unit 20, a peripheral component interconnect (Peripheral Component Interconnect Express, PCIE) interface, execution unit 30#0, execution unit 30#1, instruction dispatch unit 40, memory 50, and bus.
  • a command processor Common Processor, CP
  • an address translation unit 20
  • a peripheral component interconnect (Peripheral Component Interconnect Express, PCIE) interface Peripheral Component Interconnect Express, PCIE) interface
  • execution unit 30#0 execution unit 30#1
  • instruction dispatch unit 40 memory 50
  • memory 50 and bus.
  • the instruction processor 10 obtains an instruction from the host through the bus and PCIE, and after releasing the dependency, before sending the instruction to the instruction issuing unit 40, the instruction is transmitted to the instruction parsing sub-unit in the address translation unit 20, and the instruction parsing sub-unit parses the instruction. , obtain the value of the virtual address contained in the instruction, and pass the value of the virtual address to the address translation subunit in the address translation unit.
  • the instruction form class is shown in Figure 2, which can contain multiple addresses, which are indicated by a flag bit to indicate a virtual address or a physical address, where a flag bit of 0 can indicate a virtual address, and a flag bit of 1 can indicate a physical address.
  • the address translation subunit first queries the built-in TLB, that is, the storage unit. If there is a corresponding physical address in the TLB, it sends the value of the queried physical address to the instruction translation subunit in the address translation unit, and the instruction translation subunit converts the The value of the virtual address in the instruction is replaced with the value of the physical address, and the corresponding flag bit is set to 1 to complete the translation process of the instruction.
  • the address translation subunit needs to query the page table in the memory through the bus to obtain the mapping relationship between the value of the virtual address and the value of the physical address, and access the page table in the memory to obtain the physical address.
  • the instruction processor 10 can still deliver a new instruction to the address translation unit 20 for parsing and further buffer query.
  • the address translation subunit obtains the physical address by querying the page table, it adds the mapping relationship between the value of the virtual address in the instruction and the value of the physical address queried in the TLB, or replaces the old mapping relationship, and converts the queried physical address to the TLB.
  • the value of the address is sent to the instruction conversion subunit.
  • the instruction conversion subunit performs the following operations, sets the flag bit whose value is 0 to 1, and replaces the address bit corresponding to the flag bit with the value of the physical address to complete the translation process of the instruction.
  • the instruction conversion subunit After the instruction conversion subunit completes the instruction translation, it transfers the translated instruction to the instruction distribution unit, and the instruction distribution unit distributes the instruction to the target execution unit, such as execution unit 30#0 or execution unit 30#1.
  • an address translation unit is used to convert the virtual address carried in the instruction into a physical address.
  • the virtual address carried in the instruction has been translated into a physical address by the address translation unit, so that the address translation process does not need to be performed in each execution unit, so there is no need to set TLB in each execution unit.
  • the logic circuit corresponding to the PTW can ultimately reduce the area of the chip; at the same time, since there is no multi-way concurrent page table query of multiple execution units, the bandwidth pressure of the global memory can be reduced.
  • each unit and subunit of the above device can be implemented by being embedded on a certain chip in the form of an integrated circuit. And they can be implemented individually or integrated together.
  • the integrated circuit may include sequential circuits, logic operation circuits, memories, and the like. This disclosure does not limit this.
  • the embodiment of the present disclosure also provides an instruction processing method corresponding to the instruction processing device. Reference may be made to the implementation of the method, and repeated descriptions will not be repeated.
  • the instruction processing method includes the following steps
  • the instruction processor acquires a first instruction, and transmits the first instruction to the address translation unit.
  • the address translation unit receives the first instruction transmitted by the instruction processor, converts the virtual address carried in the first instruction into a physical address, obtains a second instruction, and executes it to targets in multiple execution units The unit transmits the second instruction, wherein the target execution unit is at least one execution unit of a plurality of execution units.
  • the target execution unit executes the second instruction to obtain an instruction execution result.
  • the embodiment of the present disclosure acquires the first instruction through an instruction processing device, and transmits the first instruction to the address translation unit; after receiving the first instruction, the address translation unit converts the virtual address in the first instruction into a physical address, and obtains the first instruction. two instructions, and transmit the second instruction to the target execution unit in the plurality of execution units; the target execution unit executes the second instruction to obtain the instruction execution result.
  • the address translation unit is used to convert the virtual address carried in the instruction into a physical address; when the instruction is dispatched to the execution unit, the virtual address carried in the instruction has been translated by the address translation unit into The physical address does not need to be translated in each execution unit, so it is not necessary to set the corresponding logic circuits of TLB and PTW in each execution unit, thereby reducing the area of the chip; at the same time, since there are no multiple execution units In the case of unit multi-channel concurrent page table query, the bandwidth pressure of global memory can be reduced.
  • the method further includes: the address translation unit transmits the second instruction to the instruction distribution unit; after the instruction distribution unit receives the second instruction transmitted by the address translation unit, The second instruction determines the target execution unit among the plurality of execution units, and transmits the second instruction to the target execution unit.
  • the execution of the second instruction by the target execution unit includes: the target execution unit parses the numerical value of the physical address carried in the second instruction, and accesses the numerical value corresponding to the physical address. of memory.
  • the address translation unit receives the first instruction transmitted by the instruction processor, converts the virtual address carried in the first instruction into a physical address, and obtains a second instruction, including: the The address translation unit parses the value of the virtual address from the first instruction, determines the value of the physical address corresponding to the value of the virtual address, and generates a second value based on the value of the physical address and the first instruction. instruction.
  • the address translation unit inquires about the mapping relationship between the value of the virtual address and the value of the physical address stored in the cache unit; In the case where there is a mapping relationship corresponding to the value of the virtual address, the value of the physical address corresponding to the value of the virtual address is obtained from the cache unit; the virtual address does not exist in the cache unit In the case of the mapping relationship corresponding to the value of the virtual address, the value of the physical address corresponding to the value of the virtual address is obtained from the memory by querying the page table.
  • the method further includes: the address translation unit stores the mapping relationship between the value of the physical address and the value of the virtual address obtained from the memory into the cache unit.
  • the address translation unit converts the virtual address carried in the first instruction into a physical address to obtain a second instruction, including: changing the target flag bit in the first instruction from the first value. is the second value, and the value of the virtual address is replaced with the value of the physical address to obtain the second instruction; wherein, the first value indicates that the address carried in the instruction is a virtual address; the second The numerical value indicates that the address carried in the instruction is a physical address.
  • Embodiments of the present disclosure further provide a computer program, which implements any one of the methods in the foregoing embodiments when the computer program is executed by a processor.
  • the computer program product can be specifically implemented by hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • multiple units or sub-units may be combined. Either it can be integrated into another system, or some features can be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces, indirect coupling or communication connection of devices or units, which may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a processor-executable non-volatile computer-readable storage medium.
  • the computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种指令处理设备以及指令处理方法。其中,该指令处理设备包括:指令处理器(10)、地址翻译单元(20)、以及多个执行单元(30);所述指令处理器(10)连接所述地址翻译单元(20),所述地址翻译单元(20)连接所述多个执行单元(30);指令处理器(10),用于获取第一指令,并向地址翻译单元(20)传输第一指令;地址翻译单元(20),用于接收指令处理器(10)传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,并向所述多个执行单元(30)中的目标执行单元传输所述第二指令,其中,所述目标执行单元为多个执行单元(30)中的至少一个执行单元;所述目标执行单元,用于执行所述第二指令,得到指令执行结果。

Description

指令处理设备以及指令处理方法
相关申请的交叉引用
本申请要求2020年9月30日提交的题为“指令处理设备以及指令处理方法”、申请号为202011064561.X的中国专利申请的优先权,以上申请的全部内容通过引用并入本文。
技术领域
本公开涉及虚实地址转换技术领域,具体而言,涉及一种指令处理设备以及指令处理方法。
背景技术
图形处理器(Graphics Processing Unit,GPU)是执行云侧人工智能(Artificial Intelligence,AI)推理和训练任务的常用设备。由于资源虚拟化等需要,当前GPU在执行数据处理任务时,需要使用虚实地址转换的功能,将指令中的虚拟地址转换为实际的物理地址。
发明内容
本公开实施例至少提供一种指令处理设备以及指令处理方法。
第一方面,本公开实施例提供了一种指令处理设备,包括:指令处理器、地址翻译单元、以及多个执行单元;所述指令处理器连接所述地址翻译单元,所述地址翻译单元连接所述多个执行单元;所述指令处理器,用于获取第一指令,并向所述地址翻译单元传输所述第一指令;所述地址翻译单元,用于接收所述指令处理器传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,并向所述多个执行单元中的目标执行单元传输所述第二指令,其中,所述目标执行单元为多个执行单元中的至少一个执行单元;所述目标执行单元,用于执行所述第二指令,得到指令执行结果。
这样,在指令被分发至具体的执行单元之前,利用地址翻译单元将指令中携带的虚拟地址转换为物理地址;当指令被分发至达具体的执行单元后,指令中携带的虚拟地址已经被地址翻译单元翻译为物理地址,从而不需要执行单元进行地址的翻译过程,因而无需在每个执行单元中设置TLB和PTW相应的电路,减小执行单元的占用的芯片面积, 降低GPU的硬件成本。进一步的,由于不存在多个执行单元多路并发页表查询的情况,从而可以减小全局内存的带宽压力。
同时,本公开实施例由通过一个统一的地址翻译单元实现对分发至执行单元的指令的翻译,而不是在多个执行单元中的每个执行单元中设置TLB和PTW相应的电路,可以为地址翻译单元设置比每个执行单元中的TLB更大的缓存单元,提升地址翻译的效率。
另外,本公开实施例中,由于多条指令通过同一PTW读取物理地址,因此可以合并同层次的翻译申请,乱序执行对不同指令的翻译,从而可以减少对内存的访问,降低带宽压力,提升翻译的效率。
另外,由于只需要为地址翻译单元设置一个缓存单元,因此在多进程的上下文切换时,只需要处理该缓存单元中的缓存,而不必清理所有执行单元的缓存,因此更有利于多用户多进程间的上下文切换。
另外,由于指令处理器中,存在多个指令队列,受限于执行单元的执行效率,指令队列中会存在多条指令,因此在通过地址翻译单元进行地址翻译,物理内存查询的延时会被指令下发的延时“隐藏”,减少TLB miss(在缓存单元中不存在虚拟地址和物理地址映射关系)导致的高时延问题。
一种可能的实施方式中,还包括:指令分发单元;所述地址翻译单元,通过所述指令分发单元与所述多个执行单元相连,还用于向所述指令分发单元传输所述第二指令;所述指令分发单元,用于在接收到所述地址翻译单元传输的所述第二指令后,为所述第二指令确定所述多个执行单元中的所述目标执行单元,并将所述第二指令向所述目标执行单元传输。
这样,通过地址翻译单元将第一指令转换为携带物理地址的第二指令,并通过指令分发单元将第二指令分发至具体的目标执行单元。
一种可能的实施方式中,所述目标执行单元在执行所述第二指令时,用于:解析所述第二指令中携带的物理地址的数值,并访问所述物理地址的数值对应的内存。
一种可能的实施方式中,所述地址翻译单元包括:指令解析子单元、地址转换子单元以及指令转换子单元;其中,所述指令解析子单元与所述指令处理器和所述地址转换子单元相连接,所述地址转换子单元还与所述指令转换子单元相连接,所述指令转换子单元还与所述多个执行单元相连接;所述指令解析子单元,用于从所述第一指令中解析 所述虚拟地址的数值;所述地址转换子单元,用于确定所述指令解析子单元解析得到的所述虚拟地址的数值对应的物理地址的数值;所述指令转换子单元,用于基于所述地址转换子单元确定的所述物理地址的数值以及所述第一指令,生成所述第二指令。
一种可能的实施方式中,所述地址翻译单元包括:缓存单元,所述缓存单元用于存储虚拟地址的数值与物理地址的数值之间的映射关系;所述地址翻译单元,与内存相连接,所述地址翻译单元还用于根据所述虚拟地址的数值,查询所述缓存单元中的所述映射关系;在所述缓存单元中存在所述虚拟地址的数值对应的映射关系的情况下,从所述缓存单元中获取所述虚拟地址的数值对应的所述物理地址的数值;在所述缓存单元中不存在所述虚拟地址的数值对应的映射关系的情况下,通过页表查询从所述内存中获取所述虚拟地址的数值对应的所述物理地址的数值。
一种可能的实施方式中,所述地址翻译单元还用于将从所述内存中获取的所述物理地址的数值和所述虚拟地址的数值之间的映射关系存储到所述缓存单元中。
一种可能的实施方式中,所述地址翻译单元用于:将所述第一指令中的目标标志位由第一数值更改为第二数值,并将所述虚拟地址的数值替换为所述物理地址的数值,得到所述第二指令;其中,所述第一数值指示指令中携带的地址为虚拟地址;所述第二数值指示指令中携带的地址为物理地址。
第二方面,本公开实施例还提供一种指令处理方法,包括;指令处理器获取第一指令,并向地址翻译单元传输所述第一指令;所述地址翻译单元接收所述指令处理器传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,并向多个执行单元中的目标执行单元传输所述第二指令,其中,所述目标执行单元为多个执行单元中的至少一个执行单元;所述目标执行单元执行所述第二指令,得到指令执行结果。
一种可能的实施方式中,还包括:所述地址翻译单元向指令分发单元传输所述第二指令;所述指令分发单元在接收到所述地址翻译单元传输的所述第二指令后,为所述第二指令确定所述多个执行单元中的所述目标执行单元,并将所述第二指令向所述目标执行单元传输。
一种可能的实施方式中,所述目标执行单元执行所述第二指令,包括:所述目标执行单元解析所述第二指令中携带的物理地址的数值,并访问所述物理地址的数值对应的内存。
一种可能的实施方式中,所述地址翻译单元接收所述指令处理器传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,包括:所述地址翻译单元从所述第一指令中解析所述虚拟地址的数值;确定与所述虚拟地址的数值对应的物理地址的数值;基于所述物理地址的数值以及所述第一指令,生成所述第二指令。
一种可能的实施方式中,还包括:所述地址翻译单元根据所述虚拟地址的数值,查询缓存单元中存储的虚拟地址的数值与物理地址的数值之间的映射关系;在所述缓存单元中存在所述虚拟地址的数值对应的映射关系的情况下,从所述缓存单元中获取所述虚拟地址的数值对应的所述物理地址的数值;在所述缓存单元中不存在所述虚拟地址的数值对应的映射关系的情况下,通过页表查询从内存中获取所述虚拟地址的数值对应的所述物理地址的数值。
一种可能的实施方式中,还包括:所述地址翻译单元将从所述内存中获取的所述物理地址的数值和所述虚拟地址的数值之间的映射关系存储到所述缓存单元中。
一种可能的实施方式中,所述地址翻译单元将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,包括:将所述第一指令中的目标标志位由第一数值更改为第二数值,并将所述虚拟地址的数值替换为所述物理地址的数值,得到所述第二指令;其中,所述第一数值指示指令中携带的地址为虚拟地址;所述第二数值指示指令中携带的地址为物理地址。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种指令处理设备的示意图;
图2示出了本公开实施例所提供的指令结构的示意图;
图3示出了本公开实施例所提供的另一种指令处理设备的示意图;
图4示出了本公开实施例所提供的一种指令处理设备架构示例的示意图;
图5示出了本公开实施例所提供的一种指令处理方法的流程图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
GPU中通常包括:指令处理器(Command Processor,CP)、以及与指令处理器连接的多个执行单元(Compute Unit,CU)。指令处理器可以用于从主机中央处理器(Host Central Processing Unit,host CPU)(以下简称主机)获取指令流,并为指令流中的各个指令确定目标执行单元,然后将指令发给对应的目标执行单元;CU在接收到指令后,可以将指令中携带的虚拟地址转换为物理地址,然后进行与指令对应的操作,如从该物理地址读写数据等。CU在将指令中携带的虚拟地址转换为物理地址的时候,需要通过查询翻译后备缓存器(Translation Lookaside Buffer,TLB)和/或页表查询(Page Table Walk,PTW)的逻辑电路确定对应的物理地址。在该过程中,由于GPU中的执行单元数量较多,每个执行单元中都会需要设置用于TLB和PTW的存储器和逻辑电路,每个执行单元所需的芯片面积较大,从而造成GPU所需的芯片面积过大的问题。同时,由于GPU在执行数据处理任务时,会将数据处理任务划分为多个子任务分配至不同的执行单元执行,多个执行单元在执行对应子任务时,会存在多路并发页表查询,可能导致全局内存的带宽问题。
基于上述研究,本公开提供了一种指令处理设备,在指令被分发至具体的执行单元之前,利用地址翻译单元将指令中携带的虚拟地址转换为物理地址。当指令被分发至达具体的执行单元后,指令中携带的虚拟地址已经被地址翻译单元翻译为物理地址,从而不需要执行单元进行地址的翻译过程,因而无需在每个执行单元中设置TLB和PTW相应的电路,减小执行单元的占用的芯片面积,降低GPU的硬件成本。进一步的,由于 不存在多个执行单元多路并发页表查询的情况,从而可以减小全局内存的带宽压力。
相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
为便于对本实施例进行理解,首先对本公开实施例所公开的一种指令处理设备进行详细介绍。
本公开实施例提供的指令处理设备,可以用于中央处理器(Central Processing Unit,CPU)、GPU,或者其他包括指令处理器、执行单元的指令处理设备。
下面以将本公开实施例提供的指令处理设备应用于GPU为例对本公开实施例提供的指令处理设备加以说明,但也可以应用于其他类型的指令处理设备。
参见图1所示,为本公开实施例提供的指令处理设备的结构示意图,包括:指令处理器10、地址翻译单元20、以及多个执行单元30。指令处理器10连接地址翻译单元20,地址翻译单元连接多个执行单元30。
其中,所述指令处理器10,用于获取第一指令,并向所述地址翻译单元20传输所述第一指令;地址翻译单元20,用于接收所述指令处理器10传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,并向所述多个执行单元30中的目标执行单元传输所述第二指令,其中,所述目标执行单元为多个执行单元30中的至少一个执行单元;所述目标执行单元,用于执行所述第二指令,得到指令执行结果。
下面对分别指令处理器10、地址翻译单元20、以及执行单元30加以详细描述。
针对本公开实施例提供的指令处理设备应用在GPU中的情况,主机将软件的源程序转换为机器指令,并将机器指令存储至GPU的指令存储器中。在获取第一指令时,指令处理器10可以直接从该指令存储器中读取指令。此处,指令处理器10在从指令存储器中读取指令之后,会对具有依赖关系的指令进行解耦,之后向地址翻译单元20传输第一指令。
针对本公开实施例提供的指令处理设备应用在CPU中的情况,计算机内部署的编译器负责将软件的软件程序转换为机器指令,并将机器指令存储至CPU的指令存储器中。在获取第一指令时,指令处理器10可以直接从该指令存储器中读取指令。
地址翻译单元20,在接收到指令处理器10传输的第一指令后,负责将第一指令中 的虚拟地址转换为物理地址,生成第二指令。
如图2中的a部分所示,为本公开实施提供的一种指令的结构示意图,包括指令头、第一标志位、第一地址位、第二标志位、第二地址位以及指令尾。需要注意的是,该指令结构只是一个示例,本公开不限制指令包括的其他信息,本公开也不限制标志位和地址位在指令结构中的具体位置。本公开也不限制一个指令中所包括的标志位和地址位的个数。其中,第一标志位用于指示在第一地址位存储的地址是虚拟地址还是物理地址;第二标志位用于指示在第二地址位存储的地址是虚拟地址还是物理地址。示例性的,当标志位为第一数值的时候,指示对应地址位存储的地址为虚拟地址;在标志位为第二数值的时候,指示对应地址位存储的地址为物理地址。示例性的,第一数值可以为0,第二数值可以为1。
如图2中的b部分所示,为第一指令的示意图。该第一指令中包括了两个虚拟地址,对应的第一标志位和第二标志位均为0,表征第一指令中所包括的地址均为虚拟地址;第一地址位和第二地址位存储的地址分别为虚拟地址0和虚拟地址1。
如图2中的c部分所示,为第二指令的示意图。其中第一标志位和第二标识为均为1,第一地址位存储的地址为与虚拟地址0对应的物理地址0;第二地址位存储的地址为与虚拟地址1对应的物理地址1。
为了实现将第一指令转化为第二指令,参见图3所示,地址翻译单元20的具体结构包括:指令解析子单元21、地址转换子单元22以及指令转换子单元23。
其中,所述指令解析子单元21与所述指令处理器10连接,用于从所述第一指令中解析出一个或多个虚拟地址的数值;所述地址转换子单元22,用于确定所述指令解析子单元21解析得到的一个或多个虚拟地址的数值中每个虚拟地址的数值对应的物理地址的数值;所述指令转换子单元23,用于基于所述地址转换子单元22确定的每个物理地址的数值以及所述第一指令,生成第二指令。
指令解析子单元21与指令处理器10和地址转换子单元22相连接,地址转换子单元22还与指令转换子单元23相连接,指令转换子单元23与多个执行单元30相连接。地址转换子单元22还与缓存单元24已经内存50相连接。其中,缓存单元24用于存储虚拟地址的数值与物理地址的数值之间的映射关系,内存50是电子设备中的存储器,可以是GPU内部的存储器,也可以是GPU外部的存储器,内存50以页为单位进行存储。
第一指令中包括了指令头、标志位、地址位、指令尾,且上述几种数据存储在指令 的不同数据位。指令解析子单元21在从第一指令中解析虚拟地址的时候,可以通过内部查表的方式确定指令的长度以及地址位和标志位的个数、以及每个地址位和标志位在指令中的位置。指令解析单元21首先根据每个标志位在第一指令中的具体位置,确定该标志位对应的地址位存储的地址是否为虚拟地址;在该标志位对应的地址位存储的地址为虚拟地址的情况下,从对应的地址位读取该标志位对应的虚拟地址的数值。
这里,指令解析子单元21可以包括第一电路;该第一电路所包括的信号输入端,可以与第一指令中包括的数据位的数量相等;其中,标志位进入到电路结构内部后,通过对标志位进行逻辑运算输出一控制信号,该控制信号的作用位在标志位为0的情况下,将与标志位对应的地址位的数据,传输给地址转换子单元22,完成虚拟地址解析、以及将虚拟地址的数值向地址转换子单元22传输的过程。
地址转换子单元22,在确定任一虚拟地址的数值对应的物理地址的数值时,例如可以访问缓存单元24,根据该虚拟地址的数值,查询缓存单元24中存储的虚拟地址的数值与物理地址的数值之间的映射关系。在缓存单元24中存在该虚拟地址的数值对应的映射关系的情况下,从缓存单元24中获取该虚拟地址的数值对应的物理地址的数值。在缓存单元24中不存在所述虚拟地址的数值对应的映射关系的情况下,通过页表查询方式从内存50中获取所述虚拟地址的数值对应的物理地址的数值。将获取的物理地址的数值向指令转换子单元23传输。
此处,本公开实施例由于通过一个统一的地址翻译单元实现对分发至执行单元的指令的翻译,而不是在多个执行单元中的每个执行单元中设置TLB和PTW相应的电路,可以为地址翻译单元设置比每个执行单元中的TLB更大的缓存单元,提升地址翻译的效率。
另外,由于只需要为地址翻译单元设置一个缓存单元,因此在多进程的上下文切换时,只需要处理该缓存单元中的缓存,而不必清理所有执行单元的缓存,因此更有利于多用户多进程间的上下文切换。
地址转换子单元22可以首先从缓存单元24中查询是否存在与从指令解析子单元21得到的虚拟地址的数值相同的目标虚拟地址,若存在目标虚拟地址,则基于查询到的目标虚拟地址与目标物理地址的逻辑关系,读取目标物理地址的数值,将目标物理地址的数值作为第一指令中虚拟地址的数值对应的物理地址的数值。若未存在目标虚拟地址,则通过页表查询PTW相应的电路从内存中获取与虚拟地址的数值对应的物理地址的数值。
这里,由于指令处理器10可以同时读取多条指令,并将多条指令并发传输给地址翻译单元。因而地址翻译单元在对多条指令进行翻译时,由于可能存在通过同一PTW读取多条指令中虚拟地址对应的物理地址的情况,因此可以合并同层次的翻译申请,乱序执行对不同指令的翻译,从而可以减少对内存的访问,降低带宽压力,提升翻译的效率。
本公开另一实施例中,所述地址翻译单元20,还用于将从内存中获取的所述物理地址和所述虚拟地址之间的映射关系存储到所述缓存单元中。
此处,地址转换子单元22可以包括第二电路,该第二电路的信号输入端的数量,例如与第一指令中包括的数据位的数量相等。在从上述指令解析子单元21包括的第一电路中,得到虚拟地址的数值后,第二电路从缓存单元24中,按照缓存单元24中存储的各虚拟地址的数值,通过逻辑运算,确定缓存单元24中是否存在与第一指令中虚拟地址的数值相同的目标虚拟地址。在缓存单元24中不存在与第一指令中虚拟地址的数值相同的目标虚拟地址的情况下,地址转换子单元22通过PTW相应的电路从内存50中获取与虚拟地址的数值对应的物理地址的数值。
指令转换子单元23,在将第一指令转换为第二指令时,例如可以将所述第一指令中的目标标志位由第一数值更改为第二数值,并将所述虚拟地址的数值替换为所述物理地址的数值,得到所述第二指令。
其中,所述第一数值指示指令中携带的地址为虚拟地址;所述第二数值指示指令中携带的地址为物理地址。当指令中包括多个标志位和地址位时,第一数值指示该目标标志位对应的地址为虚拟地址,第二数值指示该目标标志位对应的地址为物理地址。
指令转换子单元23中可以包括第三电路。该第三电路能够接收到地址转换子单元22中的第二电路传输的物理地址的数值。另外,第二电路还会将第一指令中的指令头、指令尾、以及标志位传输给指令转换子单元23。第三电路通过逻辑运算,将标志位由第一数值更改为第二数值,基于指令头、数值更改的标志位、以及物理地址的数值、以及指令尾,生成并输出第二指令。
执行单元30,每个执行单元30均可以执行指令。可以将执行第二指令的执行单元称为目标执行单元,目标执行单元为多个执行单元30中的至少一个执行单元,在执行第二指令时,可以解析所述第二指令中携带的物理地址的数值,并访问所述物理地址的数值对应的内存。
例如,在物理地址的数值对应的存储位置,存储有与第二指令对应的操作数;目标 执行单元在访问物理地址时,能够读取到物理地址对应的存储位置所存储的操作数;然后利用操作数,执行第二指令所指示的具体操作,得到指令执行结果。
如图3所示,本公开实施例还提供的另外一种指令处理设备还包括:指令分发单元40。所述地址翻译单元20,通过指令分发单元40与多个执行单元30相连,还用于向所述指令分发单元40传输所述第二指令。
所述指令分发单元40,用于在接收到所述地址翻译单元20传输的第二指令后,为所述第二指令确定所述多个执行单元30中的所述目标执行单元,并将所述第二指令向所述目标执行单元传输。
参见图4所示,本公开实施例还提供一种指令处理器的架构的具体示例,包括:指令处理器(Command Processor,CP)10、地址翻译单元20、外围组件互联(Peripheral Component Interconnect Express,PCIE)接口、执行单元30#0、执行单元30#1、指令分发单元40、内存50、以及总线。
指令处理器10通过总线以及PCIE从主机获得一条指令,解除依赖后,在送给指令下发单元40之前,将指令传输给地址翻译单元20中的指令解析子单元,指令解析子单元解析该指令,获得该指令内含有的虚拟地址的数值,并将虚拟地址的数值传给地址翻译单元中的地址转换子单元。
指令形式类如图2所示,可以含有多个地址,由标志位来指示是虚拟地址或者物理地址,其中,标志位为0可以指示虚拟地址,标志位为1可以指示物理地址。
地址转换子单元首先查询内置的TLB,也就是存储单元,若在TLB中存在对应的物理地址,将查询到的物理地址的数值送至地址翻译单元中的指令转换子单元,指令转换子单元将指令中的虚拟地址的数值替换为物理地址的数值,并将相应的标志位置为1,完成对指令的翻译过程。
在TLB中如果不存在对应的物理地址,地址转换子单元需要通过总线从内存中的页表查询获得该虚拟地址的数值和物理地址的数值之间的映射关系,在访问内存中页表得到物理地址的过程中,指令处理器10依然可以将新的指令下发到地址翻译单元20进行解析并进行进一步的缓冲查询。
此处,由于指令处理器中,存在多个指令队列,受限于执行单元的执行效率,指令队列中会存在多条指令,因此在通过地址翻译单元进行地址翻译,物理内存查询的延时会被指令下发的延时“隐藏”,减少TLB miss(在缓存单元中不存在虚拟地址和物理地 址映射关系)导致的高时延问题。
地址转换子单元在页表查询得到物理地址后,在TLB中增加指令中该虚拟地址的数值和查询到的物理地址的数值之间的映射关系或者替换旧的映射关系,并将查询到的物理地址的数值送入指令转换子单元。
指令转换子单元执行如下操作,将数值为0的标志位置1,并将该标志位对应的地址位替换为物理地址的数值,完成对指令的翻译过程。
指令转换子单元完成指令翻译后,将翻译后的指令传递给指令分发单元,指令分发单元将指令分发给目标执行单元,例如可以是执行单元30#0或执行单元30#1。
本公开实施例在指令被分发至具体的执行单元之前,利用地址翻译单元将指令中携带的虚拟地址转换为物理地址。当指令被分发至达执行单元后,指令中携带的虚拟地址已经被地址翻译单元翻译为物理地址,从而不需要在每个执行单元进行地址的翻译过程,因而无需在每个执行单元中设置TLB和PTW相应的逻辑电路,从而最终减小芯片的面积;同时,由于不存在多个执行单元多路并发页表查询的情况,从而可以减小全局内存的带宽压力。
以上装置的各个单元、子单元的部分或全部可以通过集成电路的形式内嵌于某一个芯片上来实现。且它们可以单独实现,也可以集成在一起。该集成电路可以包括时序电路、逻辑运算电路、存储器等等。本公开对此不做限制。
基于同一发明构思,本公开实施例中还提供了与指令处理设备对应的指令处理方法,由于本公开实施例中的装置解决问题的原理与本公开实施例上述指令处理设备相似,因此装置的实施可以参见方法的实施,重复之处不再赘述。
参照图5所示,为本公开实施例提供的一种指令处理方法的流程图,所述指令处理方法包括以下步骤
S501:指令处理器获取第一指令,并向所述地址翻译单元传输所述第一指令。
S502:地址翻译单元接收所述指令处理器传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,并向多个执行单元中的目标执行单元传输所述第二指令,其中,所述目标执行单元为多个执行单元中的至少一个执行单元。
S503:目标执行单元执行所述第二指令,得到指令执行结果。
本公开实施例通过指令处理设备获取第一指令,并向地址翻译单元传输该第一指令; 地址翻译单元在接收到第一指令后,将第一指令中的虚拟地址转换为物理地址,得到第二指令,并向多个执行单元中的目标执行单元传输所述第二指令;目标执行单元执行该第二指令,得到指令执行结果。在指令被分发至具体的执行单元之前,利用地址翻译单元将指令中携带的虚拟地址转换为物理地址;当指令被分发至达执行单元后,指令中携带的虚拟地址已经被地址翻译单元翻译为物理地址,从而不需要在每个执行单元进行地址的翻译过程,因而无需在每个执行单元中设置TLB和PTW相应的逻辑电路,从而最终减小芯片的面积;同时,由于不存在多个执行单元多路并发页表查询的情况,从而可以减小全局内存的带宽压力。
一种可能的实施方式中,还包括:所述地址翻译单元向指令分发单元传输所述第二指令;所述指令分发单元在接收到所述地址翻译单元传输的第二指令后,为所述第二指令确定多个执行单元中的所述目标执行单元,并将所述第二指令向所述目标执行单元传输。
一种可能的实施方式中,所述目标执行单元执行所述第二指令,包括:所述目标执行单元解析所述第二指令中携带的物理地址的数值,并访问所述物理地址的数值对应的内存。
一种可能的实施方式中,地址翻译单元接收所述指令处理器传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,包括:所述地址翻译单元从所述第一指令中解析所述虚拟地址的数值,确定与所述虚拟地址的数值对应的物理地址的数值,基于所述物理地址的数值以及所述第一指令,生成第二指令。
一种可能的实施方式中,还包括:所述地址翻译单元根据所述虚拟地址的数值,查询缓存单元中存储的虚拟地址的数值与物理地址的数值之间的映射关系;在所述缓存单元中存在所述虚拟地址的数值对应的映射关系的情况下,从所述缓存单元中获取所述虚拟地址的数值对应的所述物理地址的数值;在所述缓存单元中不存在所述虚拟地址的数值对应的映射关系的情况下,通过页表查询从内存中获取所述虚拟地址的数值对应的所述物理地址的数值。
一种可能的实施方式中,还包括:地址翻译单元将从内存中获取的所述物理地址的数值和所述虚拟地址的数值之间的映射关系存储到所述缓存单元中。
一种可能的实施方式中,地址翻译单元将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,包括:将所述第一指令中的目标标志位由第一数值更改为第二 数值,并将所述虚拟地址的数值替换为所述物理地址的数值,得到所述第二指令;其中,所述第一数值指示指令中携带的地址为虚拟地址;所述第二数值指示指令中携带的地址为物理地址。
关于方法中的各步骤的处理流程、以及各步骤之间的交互流程的描述可以参照上述指令处理设备实施例中的相关说明,这里不再详述。
本公开实施例还提供一种计算机程序,该计算机程序被处理器执行时实现前述实施例的任意一种方法。该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或子单元可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、 磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述实施例,仅为本公开的具体实施方式,用以说明本公开的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。

Claims (14)

  1. 一种指令处理设备,其特征在于,包括:指令处理器、地址翻译单元、以及多个执行单元;
    所述指令处理器连接所述地址翻译单元,所述地址翻译单元连接所述多个执行单元;
    所述指令处理器,用于获取第一指令,并向所述地址翻译单元传输所述第一指令;
    所述地址翻译单元,用于接收所述指令处理器传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,并向所述多个执行单元中的目标执行单元传输所述第二指令,其中,所述目标执行单元为多个执行单元中的至少一个执行单元;
    所述目标执行单元,用于执行所述第二指令,得到指令执行结果。
  2. 根据权利要求1所述的指令处理设备,其特征在于,还包括:指令分发单元;
    所述地址翻译单元,通过所述指令分发单元与所述多个执行单元相连,还用于向所述指令分发单元传输所述第二指令;
    所述指令分发单元,用于在接收到所述地址翻译单元传输的所述第二指令后,为所述第二指令确定所述多个执行单元中的所述目标执行单元,并将所述第二指令向所述目标执行单元传输。
  3. 根据权利要求1或2所述的指令处理设备,其特征在于,所述目标执行单元在执行所述第二指令时,用于:
    解析所述第二指令中携带的物理地址的数值,并访问所述物理地址的数值对应的内存。
  4. 根据权利要求1-3中任一项所述的指令处理设备,其特征在于,所述地址翻译单元包括:指令解析子单元、地址转换子单元以及指令转换子单元;
    其中,所述指令解析子单元与所述指令处理器和所述地址转换子单元相连接,所述地址转换子单元还与所述指令转换子单元相连接,所述指令转换子单元还与所述多个执行单元相连接;
    所述指令解析子单元,用于从所述第一指令中解析所述虚拟地址的数值;
    所述地址转换子单元,用于确定所述指令解析子单元解析得到的所述虚拟地址的数值对应的物理地址的数值;
    所述指令转换子单元,用于基于所述地址转换子单元确定的所述物理地址的数值以及所述第一指令,生成所述第二指令。
  5. 根据权利要求1-4中任一项所述的指令处理设备,其特征在于,所述地址翻译 单元包括:缓存单元,所述缓存单元用于存储虚拟地址的数值与物理地址的数值之间的映射关系;
    所述地址翻译单元,与内存相连接,所述地址翻译单元还用于
    根据所述虚拟地址的数值,查询所述缓存单元中的所述映射关系;
    在所述缓存单元中存在所述虚拟地址的数值对应的映射关系的情况下,从所述缓存单元中获取所述虚拟地址的数值对应的所述物理地址的数值;
    在所述缓存单元中不存在所述虚拟地址的数值对应的映射关系的情况下,通过页表查询从所述内存中获取所述虚拟地址的数值对应的所述物理地址的数值。
  6. 根据权利要求5所述的指令处理设备,其特征在于,所述地址翻译单元还用于将从所述内存中获取的所述物理地址的数值和所述虚拟地址的数值之间的映射关系存储到所述缓存单元中。
  7. 根据权利要求1-6中任一项所述的指令处理设备,其特征在于,所述地址翻译单元用于:
    将所述第一指令中的目标标志位由第一数值更改为第二数值,并将所述虚拟地址的数值替换为所述物理地址的数值,得到所述第二指令;
    其中,所述第一数值指示指令中携带的地址为虚拟地址;所述第二数值指示指令中携带的地址为物理地址。
  8. 一种指令处理方法,其特征在于,包括;
    指令处理器获取第一指令,并向地址翻译单元传输所述第一指令;
    所述地址翻译单元接收所述指令处理器传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,并向多个执行单元中的目标执行单元传输所述第二指令,其中,所述目标执行单元为多个执行单元中的至少一个执行单元;
    所述目标执行单元执行所述第二指令,得到指令执行结果。
  9. 根据权利要求8所述的指令处理方法,其特征在于,还包括:
    所述地址翻译单元向指令分发单元传输所述第二指令;
    所述指令分发单元在接收到所述地址翻译单元传输的所述第二指令后,为所述第二指令确定所述多个执行单元中的所述目标执行单元,并将所述第二指令向所述目标执行单元传输。
  10. 根据权利要求8或9所述的指令处理方法,其特征在于,所述目标执行单元执行所述第二指令,包括:
    所述目标执行单元解析所述第二指令中携带的物理地址的数值,并访问所述物理地 址的数值对应的内存。
  11. 根据权利要求8-10中任一项所述的指令处理方法,其特征在于,所述地址翻译单元接收所述指令处理器传输的所述第一指令,将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,包括:
    所述地址翻译单元从所述第一指令中解析所述虚拟地址的数值;
    确定与所述虚拟地址的数值对应的物理地址的数值;
    基于所述物理地址的数值以及所述第一指令,生成所述第二指令。
  12. 根据权利要求8-11中任一项所述的指令处理方法,其特征在于,还包括:
    所述地址翻译单元根据所述虚拟地址的数值,查询缓存单元中存储的虚拟地址的数值与物理地址的数值之间的映射关系;
    在所述缓存单元中存在所述虚拟地址的数值对应的映射关系的情况下,从所述缓存单元中获取所述虚拟地址的数值对应的所述物理地址的数值;
    在所述缓存单元中不存在所述虚拟地址的数值对应的映射关系的情况下,通过页表查询从内存中获取所述虚拟地址的数值对应的所述物理地址的数值。
  13. 根据权利要求12所述的指令处理方法,其特征在于,还包括:
    所述地址翻译单元将从所述内存中获取的所述物理地址的数值和所述虚拟地址的数值之间的映射关系存储到所述缓存单元中。
  14. 根据权利要求8-13中任一项所述的指令处理方法,其特征在于,所述地址翻译单元将所述第一指令中携带的虚拟地址转换为物理地址,得到第二指令,包括:
    将所述第一指令中的目标标志位由第一数值更改为第二数值,并将所述虚拟地址的数值替换为所述物理地址的数值,得到所述第二指令;
    其中,所述第一数值指示指令中携带的地址为虚拟地址;所述第二数值指示指令中携带的地址为物理地址。
PCT/CN2021/119965 2020-09-30 2021-09-23 指令处理设备以及指令处理方法 WO2022068673A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011064561.X 2020-09-30
CN202011064561.XA CN114327632A (zh) 2020-09-30 2020-09-30 指令处理设备以及指令处理方法

Publications (1)

Publication Number Publication Date
WO2022068673A1 true WO2022068673A1 (zh) 2022-04-07

Family

ID=80949594

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119965 WO2022068673A1 (zh) 2020-09-30 2021-09-23 指令处理设备以及指令处理方法

Country Status (2)

Country Link
CN (1) CN114327632A (zh)
WO (1) WO2022068673A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115456862A (zh) * 2022-11-09 2022-12-09 深流微智能科技(深圳)有限公司 一种用于图像处理器的访存处理方法及设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827076B (zh) * 2022-06-30 2022-09-13 沐曦集成电路(上海)有限公司 一种基于地址翻译单元的地址返回方法及系统
CN117971722B (zh) * 2024-03-28 2024-07-02 北京微核芯科技有限公司 一种取数指令的执行方法及其装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739358A (zh) * 2009-12-21 2010-06-16 东南大学 利用虚存机制对片上异构存储资源动态分配的方法
CN104572313A (zh) * 2013-10-22 2015-04-29 华为技术有限公司 一种进程间的通信方法及装置
US20160140046A1 (en) * 2014-11-13 2016-05-19 Via Alliance Semiconductor Co., Ltd. System and method for performing hardware prefetch tablewalks having lowest tablewalk priority
CN105989758A (zh) * 2015-02-05 2016-10-05 龙芯中科技术有限公司 地址翻译方法和装置
CN110688330A (zh) * 2019-09-23 2020-01-14 北京航空航天大学 一种基于内存映射相邻性的虚拟内存地址翻译方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5313577A (en) * 1991-08-21 1994-05-17 Digital Equipment Corporation Translation of virtual addresses in a computer graphics system
US9672159B2 (en) * 2015-07-02 2017-06-06 Arm Limited Translation buffer unit management
GB2571539B (en) * 2018-02-28 2020-08-19 Imagination Tech Ltd Memory interface
CN113722246B (zh) * 2021-11-02 2022-02-08 超验信息科技(长沙)有限公司 处理器中物理内存保护机制的实现方法及装置
CN115456862B (zh) * 2022-11-09 2023-03-24 深流微智能科技(深圳)有限公司 一种用于图像处理器的访存处理方法及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739358A (zh) * 2009-12-21 2010-06-16 东南大学 利用虚存机制对片上异构存储资源动态分配的方法
CN104572313A (zh) * 2013-10-22 2015-04-29 华为技术有限公司 一种进程间的通信方法及装置
US20160140046A1 (en) * 2014-11-13 2016-05-19 Via Alliance Semiconductor Co., Ltd. System and method for performing hardware prefetch tablewalks having lowest tablewalk priority
CN105989758A (zh) * 2015-02-05 2016-10-05 龙芯中科技术有限公司 地址翻译方法和装置
CN110688330A (zh) * 2019-09-23 2020-01-14 北京航空航天大学 一种基于内存映射相邻性的虚拟内存地址翻译方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115456862A (zh) * 2022-11-09 2022-12-09 深流微智能科技(深圳)有限公司 一种用于图像处理器的访存处理方法及设备
CN115456862B (zh) * 2022-11-09 2023-03-24 深流微智能科技(深圳)有限公司 一种用于图像处理器的访存处理方法及设备

Also Published As

Publication number Publication date
CN114327632A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2022068673A1 (zh) 指令处理设备以及指令处理方法
US11663135B2 (en) Bias-based coherency in an interconnect fabric
US8797332B2 (en) Device discovery and topology reporting in a combined CPU/GPU architecture system
JP5815712B2 (ja) マルチプルプロセッサ計算プラットフォームにおけるプロセッサ間通信技法
US11204867B2 (en) PCIe controller with extensions to provide coherent memory mapping between accelerator memory and host memory
US8171230B2 (en) PCI express address translation services invalidation synchronization with TCE invalidation
US6813653B2 (en) Method and apparatus for implementing PCI DMA speculative prefetching in a message passing queue oriented bus system
CN102326153B (zh) 以复制写入请求用于一致性存储器拷贝的方法及设备
US20220283975A1 (en) Methods and apparatus for data descriptors for high speed data systems
KR101352721B1 (ko) 온다이 시스템 패브릭 블록의 제어 장치, 방법 및 시스템
US8904045B2 (en) Opportunistic improvement of MMIO request handling based on target reporting of space requirements
EP3163452B1 (en) Efficient virtual i/o address translation
WO2014090087A1 (en) Translation management instructions for updating address translation data structures in remote processing nodes
JP2009037610A (ja) 入出力(i/o)仮想化動作のプロセッサへのオフロード
KR101900436B1 (ko) 결합된 cpu/gpu 아키텍처 시스템에서의 디바이스의 발견 및 토폴로지 보고
US20230132931A1 (en) Hardware management of direct memory access commands
KR20220130518A (ko) PCIe 디바이스 및 그 동작 방법
CN114546896A (zh) 系统内存管理单元、读写请求处理方法、电子设备和片上系统
US20230273792A1 (en) Vector processing
US20230052808A1 (en) Hardware Interconnect With Memory Coherence
TW200304071A (en) USB host controller
US6957319B1 (en) Integrated circuit with multiple microcode ROMs
CN116745754A (zh) 一种访问远端资源的系统及方法
EP4134827B1 (en) Hardware interconnect with memory coherence
CN117493236B (zh) Fpga加速器以及加速器系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21874334

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.09.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21874334

Country of ref document: EP

Kind code of ref document: A1