WO2023065748A1 - 加速器和电子装置 - Google Patents

加速器和电子装置 Download PDF

Info

Publication number
WO2023065748A1
WO2023065748A1 PCT/CN2022/107417 CN2022107417W WO2023065748A1 WO 2023065748 A1 WO2023065748 A1 WO 2023065748A1 CN 2022107417 W CN2022107417 W CN 2022107417W WO 2023065748 A1 WO2023065748 A1 WO 2023065748A1
Authority
WO
WIPO (PCT)
Prior art keywords
complement
bits
memory
accelerator
code
Prior art date
Application number
PCT/CN2022/107417
Other languages
English (en)
French (fr)
Inventor
葛建明
侯红朝
许飞翔
袁红岗
李甲
姚飞
仇小钢
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023065748A1 publication Critical patent/WO2023065748A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present disclosure generally relate to the field of electronics, and more specifically relate to an accelerator and an electronic device including the accelerator.
  • GPUs graphics processing units
  • Tensor data usually represents one-dimensional or multi-dimensional array data in the computer field.
  • image data is a conventional two-dimensional tensor data, which can be represented by a two-dimensional array.
  • a multi-core processor When processing image data, different parts of the image data can be processed in parallel by a multi-core processor to reduce processing time.
  • tensor data When tensor data is stored in memory, tensor data can be stored in original code in some cases, and in complement code in other cases. Both states exist. However, when the GPU's processing engine processes the data, the raw data is required for proper processing. Conventional solutions to this include the use of additional instructions to convert the complement data into raw data for proper operation of the processing engine. Extra instructions often cause reduction of GPU processing efficiency and extension of time.
  • Embodiments of the present disclosure provide an accelerator and an electronic device capable of converting original code and complement code.
  • an accelerator includes a processing engine unit, a memory, and a complement converter.
  • a complement converter is coupled in the data transmission path between the processing engine unit and the memory, and the complement converter is configured to: convert the first complement from the memory into a first primitive and transmit the first primitive to a processing engine unit; and converting the second original code from the processing engine unit into a second complement and transmitting the second complement to a memory.
  • memory accesses have many different forms of read and write instructions
  • all of these instructions can be automatically format converted by defining data segment attributes.
  • Segment attributes are storage settings and are independent of the kernel, so the same kernel will execute correctly for both storage formats.
  • the core program can perform storage access without any modification without any impact.
  • the stream processor command of the data segment can dynamically declare the attributes of the data segment according to the storage format of the input and output.
  • the accelerator further includes a bypass circuit.
  • a bypass circuit is coupled to the complement converter and configured to selectively bypass the complement converter to directly couple the memory with the processing engine unit based on the segment attribute data.
  • the complement code converter includes a first complement code conversion circuit and a second complement code conversion circuit.
  • the first complement conversion circuit is configured to selectively convert values of a first plurality of remaining bits of the first complement based on the value of a first first bit of the first complement.
  • the second complement conversion circuit is configured to selectively convert the values of the second plurality of remaining bits of the second original code based on the value of the second first bit of the second original code.
  • the first complement conversion circuit includes a first plurality of inverters, a first adder, and a first multiplexer.
  • the first plurality of inverters is configured to respectively invert the first plurality of remaining bits to generate a first plurality of inverted bits of the first plurality of remaining bits.
  • the first adder is configured to add 1 to the first plurality of inverted bits to generate the first plurality of converted bits.
  • the first multiplexer includes a first input terminal, a second input terminal and a first control terminal. The first input is configured to receive a first plurality of remaining bits. A second input configured to receive a first plurality of transition bits.
  • the first control terminal is configured to receive a first first bit
  • the first multiplexer is configured to selectively output a first plurality of remaining bits or a first plurality of converted bits based on a value of the first first bit.
  • the second complement conversion circuit includes a second plurality of inverters, a second adder, and a second multiplexer.
  • the second plurality of inverters is configured to respectively invert the second plurality of remaining bits to generate a second plurality of inverted bits of the second plurality of remaining bits.
  • the second adder is configured to add 1 to the second plurality of inverted bits to generate the second plurality of converted bits.
  • the second multiplexer includes a third input terminal, a fourth input terminal and a second control terminal. A third input configured to receive a second plurality of remaining bits. A fourth input configured to receive a second plurality of converted bits.
  • the second control terminal is configured to receive a second first bit
  • the second multiplexer is configured to selectively output a second plurality of remaining bits or a second plurality of converted bits based on the value of the second first bit.
  • the bypass circuit includes a third multiplexer.
  • the third multiplexer includes a fifth input terminal, a sixth input terminal and a third control terminal.
  • the fifth input terminal is configured to receive the first complement code or the first original code.
  • the sixth input is configured to receive the converted original code or complement code.
  • the third control terminal is configured to receive the bypass enable signal in the segment attribute data.
  • the third multiplexer is configured to selectively bypass the complement converter based on the bypass enable signal.
  • the accelerator includes a graphics processor, and the memory includes a first-level cache or a second-level cache.
  • the accelerator further includes a stream processor.
  • the stream processor is configured to transmit at least a portion of the segment attribute data to the bypass circuit.
  • the processing engine unit is further configured to receive multi-dimensional tensor data from the memory.
  • an electronic device in a second aspect of the present disclosure, includes a power supply unit and the accelerator according to the first aspect.
  • the accelerator is powered by a power supply unit.
  • programmers can convert between original code and complement code without using additional instructions, which improves the execution speed and efficiency of the program.
  • Figure 1 shows a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • Fig. 2 shows a schematic block diagram of a chip according to an embodiment of the present disclosure
  • Fig. 3 shows a schematic block diagram of a three-dimensional tensor according to an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of page allocation of image data according to an embodiment of the present disclosure.
  • Fig. 5 shows a schematic diagram of a complement conversion subsystem according to an embodiment of the present disclosure.
  • the term “comprise” and its variants mean open inclusion, ie “including but not limited to”.
  • the term “or” means “and/or” unless otherwise stated.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one further embodiment”.
  • the terms “first”, “second”, etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
  • the conversion between the original code and the complementary code can be realized by hardware without using additional instructions convert.
  • instruction conversion usually requiring multiple clock cycles
  • hardware processing conversion can be directly converted during signal transmission without additional multiple instruction cycles, which greatly reduces the time required for conversion, thereby reducing program computational overhead and processing time.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer.
  • example environment 100 includes, for example, central processing unit (CPU) 20 , system memory 10 , north/memory bridge 30 , accelerator subsystem 40 , device memory 50 , and south/input-output (IO) bridge 60 .
  • System memory 10 may be, for example, a volatile memory such as dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the north bridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the south bridge/IO bridge 60.
  • the South Bridge/IO Bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers and the like.
  • the accelerator subsystem 40 may include, for example, devices or chips such as a graphics processing unit (GPU) and an artificial intelligence (AI) accelerator for accelerated processing of data such as graphics and video.
  • Device memory 50 may be, for example, a volatile memory such as DRAM that is external to accelerator subsystem 40 .
  • device memory 50 is also referred to as off-chip memory, ie, memory located outside the chip of accelerator subsystem 40 .
  • the chip of the accelerator subsystem 40 also has a volatile memory, such as a first-level (L1) cache (cache) and an optional second-level (L2) cache.
  • L1 cache first-level cache
  • L2 cache optional second-level cache
  • FIG. 1 While an example environment 100 in which embodiments of the disclosure can be implemented is shown in FIG. 1 , the disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments with accelerator subsystems such as GPUs, such as ARM architectures and RISC-V architectures.
  • the example environment 100 may also include other components or devices not shown, such as a power supply unit for powering the accelerator subsystem 200 .
  • the disclosed comparisons are not limiting.
  • FIG. 2 shows a schematic block diagram of an accelerator subsystem 200 according to one embodiment of the present disclosure.
  • the accelerator subsystem 200 may be, for example, a specific implementation of the chip of the accelerator subsystem 40 in FIG. 1 .
  • the accelerator subsystem 200 is, for example, an accelerator subsystem chip such as a GPU.
  • the accelerator subsystem 200 includes a stream processor (SP) 210, a page table device 220, a processing engine (PE) unit 230, a direct memory access (DMA) controller 240, a complement converter 270, an L1 high-speed cache 260 and L2 cache 250 .
  • SP stream processor
  • PE processing engine
  • DMA direct memory access
  • the accelerator subsystem 200 is controlled by a host device such as the CPU 20, and receives instructions from the CPU 20.
  • the SP 210 analyzes instructions from the CPU 20, and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing.
  • the page table device 220 is used to manage the on-chip virtual storage of the accelerator subsystem 200 .
  • L2 cache 250 and off-chip memory such as device memory 50 in FIG. 1 constitute a virtual memory system.
  • the page table device 220 is jointly maintained by the SP 210, the PE unit 230 and the DMA controller 240.
  • the PE unit 230 includes a plurality of processing engines (processing engine, PE) PE_1, PE_2...PE_N, where N represents an integer greater than 1.
  • PE processing engine
  • Each PE in PE unit 230 may be a single instruction multiple thread (SIMT) device.
  • each thread can have its own register file (register file), and all threads of each PE also share a unified register file (uniform register file).
  • Multiple PEs can perform the same or different processing tasks in parallel, and can perform address conversion described below and access to target data in memory in parallel, thereby reducing processing time. It can be understood that the target elements processed by multiple PEs are not the same, and the segment, page, cache line, and attribute, size, and dimension order of the target element may be different, as described in detail below.
  • Each thread can perform thread-level data exchange between its own register file and the memory subsystem.
  • Each thread has its own arithmetic logic execution unit and uses its own storage address, which adopts a typical register access architecture (load-store architecture).
  • Each execution unit includes a floating-point/fixed-point unit supporting multiple data types and an arithmetic logic unit.
  • the accelerator subsystem 200 of FIG. 2 may, for example, perform the following operations: 1) construct page table entry content and initial state; Move to the on-chip memory, such as the L2 cache 250; 3) start and execute the program; 4) define each segment and describe the properties of the tensor and storage; 5) when the program execution is completed, write the data of the execution result to off-chip memory.
  • Stored attributes may include segment attributes, dimension attributes, and page attributes, among others.
  • Segment attributes include a bypass enable signal that indicates whether a complement converter is used to perform conversions between original and complement code.
  • the segment attributes may also include status flags used by pages in the segment, element size, element data encoding type and conversion flags, replacement rules for cache lines in the segment, and so on.
  • Dimension attributes can be used to independently set the attributes of each dimension, including information such as long mode, streaming mode, symbolic attributes of addresses, and bit widths of inverse cross addressing in a cache line.
  • Long mode indicates that the size of a tensor in one dimension is significantly higher than the size of other dimensions.
  • Streaming mode means that it can support the calculation of infinitely long tensor data without stopping the core program.
  • the symbol attribute of the address indicates that the coordinate offset value relative to the reference point can be positive or negative, in other words, the offset can be positive or negative in the same dimension.
  • the properties of the page include page ID, physical base address, status field and dimension information, etc.
  • the page identifier is used to index the corresponding page table entry.
  • the physical base address describes the physical first address of the page in on-chip memory such as L2 cache or in off-chip memory.
  • the status field indicates whether the page is occupied or available.
  • Dimension information mainly includes the number of dimensions and the size of each dimension, and this field can be defined by a segment. Attributes of pages may be stored within page table device 220, for example.
  • the data processed by the accelerator subsystem 200 is mainly for multi-dimensional tensors.
  • the tensor may be a four-dimensional tensor having four dimensions Dl, D2, D3, and D4, and the tensor may be of different size in each dimension.
  • the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited in the present disclosure.
  • the tensor may internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this.
  • the basic unit is elements. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double byte, and so on.
  • int8 the basic unit of addressing is double byte, and so on.
  • the multi-dimensional tensor data may undergo the conversion from complement code to original code described below during the transmission process from the memory to the PE unit, or may undergo the conversion described below during the transmission process from the PE unit to the memory. Describes the conversion from original code to complement code without using additional instructions for the conversion.
  • tensors may be divided into at least one segment. In the case where the tensor contains only one segment, the tensor is the segment. Whereas, in the case where the tensor contains multiple segments, the segment is part of the tensor.
  • the CPU 20 can specify which PE processes each part of the segment by an instruction.
  • the complement conversion circuit 270 is located between the L1 cache 260 and the PE unit 230, so that when the complement form data in the L1 cache 260 is transmitted to the PE unit 230, the complement conversion circuit 270 The first complement circuit 271 can convert it into the original code, and when the original code data generated by the PE unit 230 is transmitted to the L1 cache 260, the second complement circuit 272 in the complement conversion circuit 270 can convert it for the complement.
  • the complement conversion circuit 270 is shown in FIG. 2 as being between the L1 cache 260 and the PE unit 230 , this is for illustration only and does not limit the scope of the present disclosure.
  • the complement conversion circuit 270 can also be located between the L1 cache 260 and the L2 cache 250, or between the DMA 240 and the L2 cache 250.
  • the accelerator subsystem 200 may further include a bypass circuit (not shown) to convert the complementary code conversion circuit 270 in the signal transmission path between the L1 cache 260 and the PE unit 230 when necessary. bypass, thereby directly connecting the L1 cache 260 and the PE unit 230 . This disclosure is not limited in this regard either.
  • FIG. 3 shows a schematic block diagram of a three-dimensional tensor 300 according to an embodiment of the present disclosure.
  • the three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3.
  • CPU 20 may specify that the tensor elements of segment S1 be processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8.
  • the CPU 20 also specifies that the tensor elements of the second segment S2 are processed by PE_1-PE_4.
  • each segment may have a different size, so programmers can flexibly configure segments based on design needs.
  • page division can be implemented on any one or more dimensions, and the number of pages divided on each dimension is independent of each other.
  • tensor data may be stored in on-chip high-speed memory, such as L2 cache 250 .
  • on-chip high-speed memory such as L2 cache 250 .
  • the kernel program (kernel) can be started multiple times, and each time the DMA controller 240 moves a segment of the tensor from the off-chip storage to the on-chip storage in advance for kernel operation. After starting the kernel multiple times, all the segments contained in the tensor are processed, and the entire running process ends.
  • the on-chip high-speed memory is sufficient to accommodate all tensors to be accessed by the kernel, a tensor only needs one segment description, and the kernel only needs to be started once.
  • At least one page may also be set to further subdivide the tensor.
  • the first segment S1 there are 4 pages P[1], P[2], P[3] and P[4].
  • the second segment S2 has only one page.
  • the number of pages in each segment can be different, so programmers can flexibly configure the size of pages in a segment based on design needs. For example, pages are configured to fit into L2 cache 250 in their entirety.
  • a page can usually contain multiple elements.
  • the page where the target element is located is referred to as a "target element page" herein.
  • a page may include multiple cache lines.
  • L1 cache 260 It only takes a few clock cycles for the PE to read data from the L1 cache 260 , but it may take dozens or even hundreds of clock cycles for the L1 cache 260 to read data from the L2 cache 250 . Therefore, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250 .
  • a "cache line" is used here to describe the minimum transfer data unit from L2 cache 250 to L1 cache 260, in this disclosure, this part of data may not necessarily be arranged in rows or columns, a "cache line”
  • the data inside is distributed on multiple dimensions, and the size of the data distributed on each dimension is not limited to 1.
  • PE performs parallel processing on the data in a segment, and the allocation of PE is carried out in the logical address space of the data, which is independent of the physical storage structure of the segment, as described below.
  • the first group of cache lines in the first page P[1] is designated to be processed by PE_1, and the second group of cache lines is designated to be processed by PE_2.
  • PE_1 the first group of cache lines in the first page P[1]
  • PE_2 the second group of cache lines
  • PE_M the number of tensor data that can be processed by PE_M, where M represents any integer not greater than N.
  • FIG. 4 shows a schematic diagram of page allocation of image data 400 according to an embodiment of the present disclosure.
  • Image data is typically a two-dimensional tensor.
  • the image data 400 is, for example, 8*8 pixels.
  • the image data 400 has 8 pixels in the first dimension D1 and also has 8 pixels in the second dimension D2. Therefore, the image data 400 has pixels P00, P01...P77.
  • the image data 400 has only one segment, but is divided into four pages P[1], P[2], P[3] and P[4] in two dimensions.
  • the four pages can be divided according to the second dimension D2 to be allocated to PE_1 and PE_2 for processing, or can be divided according to the first dimension D1 to be allocated to PE_1 and PE_2 for processing. In addition, it is also possible to divide by diagonal. This disclosure is not limited in this regard.
  • FIG. 5 shows a schematic diagram of a complement code conversion subsystem 500 according to an embodiment of the present disclosure.
  • the complementary code conversion system 500 may be a specific implementation of at least a part of the accelerator subsystem 200 in FIG. Complementary code conversion system 500.
  • the complement conversion subsystem 500 includes a bypass circuit 540 and a complement conversion circuit 550 .
  • the complement conversion circuit 550 may be used, for example, to convert the first complement from a memory such as the L1 cache 260 into a first primitive and transmit the first primitive to the processing engine PE_1, or for The second primitive of engine unit PE_1 is converted to the second complement and the second complement is transferred to a memory such as L1 cache 260 .
  • memory accesses have many different forms of read and write instructions
  • all of these instructions can be automatically format converted by defining data segment attributes.
  • Segment attributes are storage settings and are independent of the kernel, so the same kernel will execute correctly for both storage formats.
  • the core program can perform storage access without any modification without any impact.
  • the stream processor command of the data segment can dynamically declare the attributes of the data segment according to the storage format of the input and output.
  • the complement code converter in the accelerator may include multiple complement code conversion circuits.
  • two complement conversion circuits can be equipped to handle conversion from original code to complement code and conversion from complement code to original code respectively.
  • only one complement code conversion circuit may be provided for each processing engine, and the conversion from the original code to the complement code and the conversion from the complement code to the original code may be realized through multiplexing.
  • only two complement conversion circuits may be provided between the PE unit 230 and the L1 cache 260 to process conversion from original code to complement code and conversion from complement code to original code, respectively. This disclosure is not limited in this regard.
  • complement code conversion circuit 550 can be used to convert between original code and complement code for int8 byte data. Although described herein for int8 byte data, the present disclosure is not limited thereto. Without departing from the principle, spirit and scope of the present disclosure, the timing method of the complementary code conversion circuit can be modified to be suitable for other types of data.
  • the complement conversion circuit 550 is implemented as a first complement conversion circuit, it is configured to selectively convert the first multiple of the first complement based on the value of the first first bit in[7] of the first complement.
  • the first plurality of negated bits is provided to adder 520 .
  • the adder 520 is the first adder in this case, and adds 1 to the end of the data consisting of the first plurality of inverted bits to generate the first plurality of converted bits of 7 bits.
  • the first plurality of transition bits is provided to multiplexer 530 .
  • the multiplexer 530 is in this case the first multiplexer. Since the first control terminal of the multiplexer 530 receives the control input as 1 in[7], the multiplexer 530 outputs 7 output bits out[6:0], and the seven output bits out[6 :0] is actually the first number of transition bits in this case.
  • the seven output bits out[6:0] are combined with the first bit in[7] to generate the first original code, that is, the first first bit in[7] is still the first first bit of the original code, and the first A plurality of transition bits out[6:0] make up the last 7 bits.
  • the multiplexer 530 If the value of the first first bit in[7] is 0, it directly provides the value of the first plurality of remaining bits in[6:0] to the multiplexer 530 . Since the first control terminal of the multiplexer 530 receives the control input as 0 in[7], the multiplexer 530 outputs 7 output bits out[6:0], and the seven output bits out[6 :0] is actually the first plurality of remaining bits in[6:0] in this case. The seven output bits out[6:0] are combined with the first bit in[7] to generate the first original code, that is, the first first bit in[7] is still the first first bit of the original code, and the first A plurality of transition bits out[6:0] make up the last 7 bits.
  • the complement conversion circuit 550 When the complement conversion circuit 550 is implemented as a second complement conversion circuit, it is configured to selectively convert the values of the second plurality of remaining bits of the second original code based on the value of the second first bit of the second original code. value.
  • the operation of the second complement conversion circuit is basically the same as that of the first complement conversion circuit, and will not be repeated here.
  • the complement conversion subsystem 500 also includes a bypass circuit 540 .
  • the bypass circuit 540 is coupled to the complement conversion circuit 550 and is configured to selectively bypass the complement conversion circuit 550 to directly couple the memory with the processing engine unit based on the segment attribute data.
  • the bypass circuit may include a plurality of sub-bypass circuits to respectively bypass a plurality of complementary code conversion circuits.
  • the bypass circuit can also bypass the complementary code converter as a whole.
  • bypass circuit 540 may be a third multiplexer.
  • the bypass circuit 540 includes a fifth input terminal, a sixth input terminal and a third control terminal.
  • the fifth input terminal is configured to receive the first complement code or the first original code.
  • the sixth input is configured to receive the converted original code or complement code.
  • the third control terminal is configured to receive the bypass enable signal Bypass_En in the segment attribute data.
  • the third multiplexer is configured to selectively bypass the complementary code conversion circuit 550 based on the bypass enable signal Bypass_En.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Advance Control (AREA)
  • Power Sources (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本文描述了一种加速器和电子装置。该加速器包括处理引擎单元、存储器以及补码转换器。补码转换器耦合在处理引擎单元和存储器之间的数据传输路径中,并且补码转换器被配置为:将来自存储器的第一补码转换为第一原码并且将第一原码传输至处理引擎单元;以及将来自处理引擎单元的第二原码转换为第二补码并且将第二补码传输至存储器。通过在处理引擎单元和存储器之间的数据传输路径中设置硬件形式的补码转换器,可以避免使用额外的指令来进行原码和补码之间的转换,提高了程序的执行速度和效率。

Description

加速器和电子装置
本申请要求于2021年10月19日提交中国专利局、申请号为202111214262.4、发明名称为“加速器和电子装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开的实施例一般地涉及电子领域,更具体而言涉及一种加速器和包括该加速器的电子装置。
背景技术
诸如图形处理器(GPU)之类的并行高性能多线程多核处理系统处理数据的速度比过去快得多。这些处理系统可以将复杂的计算分解为较小的任务,并且由多核并行处理以增加处理效率并且减少处理时间。
在一些情形下,诸如GPU之类的多核处理器对具有大量相同或相似形式的数据的张量的处理尤为有利。张量数据在计算机领域通常表示一维或多维数组的数据,例如图像数据就是一种常规的二维张量数据,其可以由二维数组表示。对图像数据进行处理时,可以由多核处理器对图像数据中的不同部分并行处理以减少处理时间。
张量数据在存储器中被存储时,张量数据在一些情形下可以按原码存储,在另一些情形下也可以按补码存储。这两种状态都存在。然而,当GPU的处理引擎处理数据时,需要原码数据以正确处理。对此,常规方案包括使用额外的指令将补码数据转换为原码数据以便于处理引擎的正确操作。额外的指令往往引起GPU的处理效率的降低和时间的延长。
发明内容
本公开的实施例提供了一种加速器和电子装置,其可以对原码 和补码进行转换。
根据本公开的第一方面,提供一种加速器。该加速器包括处理引擎单元、存储器以及补码转换器。补码转换器耦合在处理引擎单元和存储器之间的数据传输路径中,并且补码转换器被配置为:将来自存储器的第一补码转换为第一原码并且将第一原码传输至处理引擎单元;以及将来自处理引擎单元的第二原码转换为第二补码并且将第二补码传输至存储器。通过在处理引擎单元和存储器之间的数据传输路径中设置硬件形式的补码转换器,可以避免使用额外的指令来进行原码和补码之间的转换,提高了程序的执行速度和效率。此外,虽然存储器访问具有多种不同形式的读指令和写指令,但是通过定义数据段属性,这些指令全可以进行自动格式转换。段属性是存储设置并且独立于核心程序,因此对于两种存储格式,相同的核心程序都可以正确执行。换言之,无论数据格式是原码还是补码,核心程序都能够无影响地进行存储访问而无需任何改动。当程序启动时,数据段的流处理器命令可以按照输入和输出的存储格式来动态声明该数据段的属性。
在第一方面的一种可能实现方式中,加速器还包括旁路电路。旁路电路耦合至补码转换器并且被配置为:基于段属性数据,选择性将补码转换器旁路,以将存储器与处理引擎单元直接耦合。通过使用旁路电路,加速器不仅可以进行自动格式转换,还可以兼容通过使用指令进行格式转换的常规程序。
在第一方面的一种可能实现方式中,补码转换器包括第一补码转换电路和第二补码转换电路。第一补码转换电路被配置为基于第一补码的第一首比特的数值,选择性地转换第一补码的第一多个剩余比特的数值。第二补码转换电路,被配置为基于第二原码的第二首比特的数值,选择性地转换第二原码的第二多个剩余比特的数值。
在第一方面的一种可能实现方式中,第一补码转换电路包括第一多个反相器、第一加法器和第一多路选择器。第一多个反相器被配置为将第一多个剩余比特分别取反以生成第一多个剩余比特的第 一多个取反比特。第一加法器被配置为将第一多个取反比特加1以生成第一多个转换比特。第一多路选择器包括第一输入端、第二输入端和第一控制端。第一输入端被配置为接收第一多个剩余比特。第二输入端,被配置为接收第一多个转换比特。第一控制端被配置为接收第一首比特,第一多路选择器被配置为基于第一首比特的值,选择性地输出第一多个剩余比特或第一多个转换比特。通过使用多个反相器、加法器和多路选择器,可以以简单的电路结构实现补码转换器,并且降低成本和简化设计。
在第一方面的一种可能实现方式中,第二补码转换电路包括第二多个反相器、第二加法器和第二多路选择器。第二多个反相器被配置为将第二多个剩余比特分别取反以生成第二多个剩余比特的第二多个取反比特。第二加法器被配置为将第二多个取反比特加1以生成第二多个转换比特。第二多路选择器包括第三输入端、第四输入端和第二控制端。第三输入端,被配置为接收第二多个剩余比特。第四输入端,被配置为接收第二多个转换比特。第二控制端,被配置为接收第二首比特,第二多路选择器被配置为基于第二首比特的值,选择性地输出第二多个剩余比特或第二多个转换比特。通过使用多个反相器、加法器和多路选择器,可以以简单的电路结构实现补码转换器,并且降低成本和简化设计。
在第一方面的一种可能实现方式中,旁路电路包括第三多路选择器。第三多路选择器包括第五输入端、第六输入端和第三控制端。第五输入端,被配置为接收第一补码或第一原码。第六输入端被配置为接收经转换的原码或补码。第三控制端被配置为接收位于段属性数据中的旁路使能信号。第三多路选择器被配置为基于旁路使能信号选择性地将补码转换器旁路。通过使用多路选择器,可以以简单的电路结构实现旁路电路,并且降低成本和简化设计。
在第一方面的一种可能实现方式中,加速器包括图形处理器,存储器包括一级高速缓存或二级高速缓存。
在第一方面的一种可能实现方式中,加速器还包括流处理器。 流处理器被配置为将段属性数据的至少一部分传输至旁路电路。
在第一方面的一种可能实现方式中,处理引擎单元还被配置为从存储器接收多维张量数据。
在本公开的第二方面,提供一种电子设备。电子设备包括供电单元和根据第一方面的加速器。该加速器由供电单元供电。通过在处理引擎单元和存储器之间的数据传输路径中设置硬件形式的补码转换器,可以避免使用额外的指令来进行原码和补码之间的转换,提高了程序的执行速度和效率。此外,虽然存储器访问具有多种不同形式的读指令和写指令,但是通过定义数据段属性,这些指令全可以进行自动格式转换。段属性是存储设置并且独立于核心程序,因此对于两种存储格式,相同的核心程序都可以正确执行。换言之,无论数据格式是原码还是补码,核心程序都能够无影响地进行存储访问而无需任何改动。当程序启动时,数据段的流处理器命令可以按照输入和输出的存储格式来动态声明该数据段的属性。
根据本公开的实施例的方法和电子设备,编程人员可以无需使用额外的指令来进行原码和补码之间的转换,提高了程序的执行速度和效率。
附图说明
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显,其中,在本公开示例性实施例中,相同的参考标号通常代表相同部件。
图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一个实施例的芯片示意框图;
图3示出了根据本公开的一个实施例的三维张量示意框图;
图4示出了根据本公开的一个实施例的图像数据的页分配示意图;以及
图5示出了根据本公开的一个实施例的补码转换子系统的示意 图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中示出了本公开的优选实施例,然而应该理解,本公开可以以各种形式实现而不应被这里阐述的实施例限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
如前文所提及的,常规方案包括使用额外的指令将补码数据转换为原码数据以便于处理引擎的正确操作。额外的指令往往引起GPU的处理效率的降低和时间的延长。
在本公开的一些实施例中,通过在加速器中的存储器和处理引擎之间的路径中设置补码转换电路,可以通过硬件实现原码和补码之间的转换,而无需使用额外的指令进行转换。相比于指令转换(通常需要多个时钟周期),硬件处理转换可以无需额外的多个指令周期,而是在信号传输过程中直接被转换,这极大地降低转换所需的时间,从而降低程序的运算开销和处理时间。
图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。示例环境100例如可以是诸如计算机之类的具有计算能力的电子设备。在一个实施例中,示例环境100例如包括中央处理器(CPU)20、系统存储器10、北桥/存储器桥30、加速器子系统40、设备存储器50和南桥/输入输出(IO)桥60。系统存储器10例如可以是诸如动态随机存取存储器(DRAM)之类的易失性存储器。 北桥/存储器桥30例如集成了内存控制器、PCIe控制器等,其负责CPU 20和高速接口之间的数据交换以及桥接CPU 20和南桥/IO桥60。南桥/IO桥60用于计算机的低速接口,例如串行高级技术接口(SATA)控制器等。加速器子系统40例如可以包括诸如图形处理器(GPU)和人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。设备存储器50例如可以是诸如DRAM之类的位于加速器子系统40外部的易失性存储器。在本公开中,设备存储器50也被称为片外存储器,即,位于加速器子系统40的芯片外部的存储器。相对而言,加速器子系统40的芯片内部也具有易失性存储器,例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存。这将在下文结合本公开的一些实施例具体描述。虽然在图1中示出了本公开的多个实施例能够在其中实现的一种示例环境100,但是本公开不限于此。本公开的一些实施例也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器子系统的一些应用环境中使用。示例环境100还可以包括其它未示出的部件或装置,例如为加速器子系统200供电的供电单元。本公开对比不进行限制。
图2示出了根据本公开的一个实施例的加速器子系统200的示意框图。加速器子系统200例如可以是图1中加速器子系统40的芯片的一种具体实现方式。加速器子系统200例如是诸如GPU之类的加速器子系统芯片。在一个实施例中,加速器子系统200包括流处理器(SP)210、页表装置220、处理引擎(PE)单元230、直接存储器访问(DMA)控制器240、补码转换器270、L1高速缓存(cache)260和L2高速缓存250。
加速器子系统200由诸如CPU 20之类的主机设备控制,并且接收来自CPU 20的指令。SP 210对来自CPU 20的指令进行分析,并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。页表装置220用于管理加速器子系统200的片上虚拟存储。在本公开中,L2高速缓存250和诸如图1中的设备存储 器50之类的片外存储器构成虚拟存储系统。页表装置220由SP 210、PE单元230和DMA控制器240共同维护。
PE单元230包括多个处理引擎(processing engine,PE)PE_1、PE_2……PE_N,其中N表示大于1的整数。PE单元230中的每个PE可以是单指令多线程(SIMT)装置。在PE中,每个线程可以具有自己的寄存器堆(register file),并且每个PE的所有线程还共享一个统一寄存器堆(uniform register file)。多个PE可以并行地执行相同或不同的处理工作,可以并行地进行下文所述的地址转换和存储器中目标数据的访问,从而减少处理时间。可以理解,多个PE处理的目标元素并不相同,并且目标元素所在的段、页、缓存行和元素的属性、尺寸、维度排序等可以有所不同,如下文具体描述。
每个线程可以在自己的寄存器堆与存储器子系统之间做线程级的数据交换。每个线程有自己的算数逻辑执行单元并使用自己的存储地址,其采用典型的寄存器存取架构(load-store architecture)。每个执行单元包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元。
大多数的指令执行算数和逻辑运算,例如,浮点和定点数的加、减、乘、除,或者逻辑与、或、非等。操作数来自于寄存器。存储器读写指令可以提供寄存器与片上/片外存储器之间的数据交换。一般地,PE中所有的执行单元可以同步地执行相同指令。通过使用谓词(predicate)寄存器,可以屏蔽部分执行单元,从而实现分支指令的功能。
在一个实施例中,图2的加速器子系统200可以例如执行如下操作:1)组建页表项内容和初始状态;2)将诸如图1中的设备存储器50之类的片外存储器上的数据搬运至片上存储器,例如L2高速缓存250;3)启动和执行程序;4)定义各个段并对张量以及存储的属性进行描述;5)在程序执行完成时,将执行结果的数据写入至片外存储器。存储的属性可以包括段属性、维度属性和页属性等。
段属性包括旁路使能信号,其指示是否使用补码转换器以执行 在原码和补码之间的转换。此外,段属性还可以包括段中页使用的状态标志、元素大小、元素数据编码类型以及转换标志,段内缓存行的替换规则等。维属性可以用于独立设置各维的属性,包括长模式、流(streaming)模式、地址的符号属性以及缓存行内逆交叉寻址的位宽等信息。长模式表示张量在一个维度的尺寸显著高于其它维度的尺寸。流模式表示可以在核心程序不停止的情形下支持无限长的张量数据的计算。地址的符号属性表示相对于基准点的坐标偏移值可以为正也可以为负,换言之在同一维度上可以正向偏移也可以负向偏移。页的属性包括页标识、物理基址、状态字段和维度信息等。页标识用于索引对应的页表项。物理基址描述页在诸如L2高速缓存之类的片上存储器或片外的存储器内的物理首地址。状态字段表示页是否被占用或可用。维度信息主要包括维度的数目以及各维的尺寸,该字段可以由段定义。页的属性例如可以被存储在页表装置220内。
可以理解,在公开的实施例中,加速器子系统200所处理的数据主要针对多维张量。例如,在一个实施例中,张量可以是四维张量,其具有四个维度D1、D2、D3和D4,并且张量在各维上的尺寸可以不同。在另一些实施例中,张量可以是一维、二维、三维或更多维张量,本公开对此不进行限制。此外,在本公开的实施例中,张量内部可以支持诸如uint8、int8、bfloat16、float16、uint16、int16、float32、int32、uint32以及其他自定义元素类型,本公开对此也不进行限制。对于张量的寻址而言,其以元素为基本单位。例如,如果元素类型为int8,则元素以字节为单位。再例如,如果元素类型为int16,则寻址基本单位为双字节,依此类推。在下文中,将以int8作为参考进行描述,但是可以理解,本公开不限于此。其他数据元素类型也可以适用。换言之,在本公开中,多维张量数据在从存储器向PE单元的传输过程中可以经历下文所描述的从补码到原码的转换,或从PE单元向存储器的传输过程中可以经历下文所描述的从原码到补码的转换,而无需使用额外的指令进行转换。
在一些情形中,张量所包含的数据量可能较大,而L2高速缓存250的容量有限,因此无法将张量整体加载至片上的L2高速缓存250。在本公开的一些实施例中,为了便于张量的并行处理,可以将张量划分为至少一个段。在张量仅包括一个段的情形下,张量即为段。而在张量包括多个段的情形下,段为张量的一部分。CPU 20可以通过指令指定段的各个部分由哪个PE进行处理。
在一个实施例中,补码转换电路270位于L1高速缓存260和PE单元230之间,从而当L1高速缓存260中的补码形式的数据向PE单元230传输时,补码转换电路270中的第一补码电路271可以将其转换为原码,而当PE单元230生成的原码数据向L1高速缓存260中传输时,补码转换电路270中的第二补码电路272可以将其转换为补码。虽然在图2中将补码转换电路270示出为在L1高速缓存260和PE单元230之间,但是这仅是示意,而非对本公开的范围进行限制。补码转换电路270也可以位于L1高速缓存260和L2高速缓存250之间,或是位于DMA 240和和L2高速缓存250之间。此外,在一些实施例中,加速器子系统200还可以包括旁路电路(未示出)以在需要时将在L1高速缓存260和PE单元230之间的信号传输路径中的补码转换电路270旁路,从而将L1高速缓存260和PE单元230直接连接。本公开对此也不进行限制。
图3示出了根据本公开的一个实施例的三维张量300的示意框图。三维张量300具有三个维度D1、D2和D3,并且包括第一段S1、第二段S2和第三段S3。CPU 20可以指定段S1的张量元素由PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7和PE_8处理。此外,CPU20还指定了第二段S2的张量元素由PE_1-PE_4处理。在本公开的实施例中,每个段所具有的尺寸可以不同,因此编程人员可以基于设计需要灵活配置段。实际上,页的划分可以在任意一个或多个维上实施,并且各维上划分的页数是相互独立的。
在一个实施例中,可以将张量数据存储于片上的高速存储器,例如L2高速缓存250。但由于片上的高速存储器的容量较少,因此 在张量规模较大时,编程人员可以将张量划分为多个段,每个段描述张量一部分。核心程序(kernel)可以分多次启动,每次由DMA控制器240提前将张量的一个段由片外存储搬运到片内存储,并供kernel操作使用。在多次启动kernel后,张量包含的所有段均被处理,整个运行过程结束。当片上的高速存储器足以容纳kernel所要访问的所有张量时,一个张量仅需要一个段描述即可,kernel也只需要启动一次。
进一步地,在本公开的一些实施例中,在一个段内,还可以设置至少一个页以进一步细分张量。例如,在第一段S1中,具有4个页P[1]、P[2]、P[3]和P[4]。第二段S2仅具有一个页。在本公开的实施例中,每个段所具有的页的数目可以不同,因此编程人员可以基于设计需要灵活配置段内页的尺寸。例如,将页配置为适于整体存入L2高速缓存250。
如上所述,当对张量寻址时,最小的寻址单元是以元素为单元。一个页通常可以包括多个元素。目标元素所在的页在本文中被称为“目标元素页”。在本公开的一些实施例中,页可以包括多个缓存行。目标元素页可以位于L2高速缓存250中时,如果PE经由L1高速缓存260读取目标元素,则L2高速缓存250需要将L2高速缓存250中的包括目标元素在内的一小部分的物理地址连续的数据整体传输至L1高速缓存260。这一小部分数据也被称为缓存行(cache line)数据,而这种缓存机制基于空间邻近性原理。PE从L1高速缓存260读取数据仅需几个时钟周期,而L1高速缓存260从L2高速缓存250读取数据可能需要几十个甚至上百个时钟周期。因此,期望减少L1高速缓存260从L2高速缓存250读取数据的次数。虽然在此以“缓存行”来描述从L2高速缓存250到L1高速缓存260的最小传输数据单位,但在本公开中,这部分数据可以并不必然按行或列排列,一个“缓存行”里面的数据分布在多个维上,且各维上分布的数据尺寸不限于1。PE对一个段内的数据进行并行处理,PE的分配在数据的逻辑地址空间展开,独立于段的物理存储结构,具 体如下文描述。
在图3中,第一页P[1]中的第一组缓存行被指定由PE_1处理,第二组缓存行被指定由PE_2处理。虽然在此以顺序示出了张量由多个PE依序处理,但是可以理解张量数据的处理独立于PE的顺序,本公开对此不进行限制。例如图3中的PE_2表示部分的张量数据可以由PE_M处理,其中M表示不大于N的任意整数。
图4示出了根据本公开的一个实施例的图像数据400的页分配示意图。图像数据是典型的二维张量。在一个实施例中,图像数据400例如为8*8像素。换言之,图像数据400在第一维D1具有8个像素,并且在第二维D2也具有8个像素。因此,图像数据400具有像素P00、P01……P77。在图4的实施例中,图像数据400仅具有一个段,但是按两个维度分为4个页P[1]、P[2]、P[3]和P[4]。4个页可以按第二维D2划分以分配给PE_1和PE_2处理,也可以按第一维D1划分以分配给PE_1和PE_2处理。此外,还可以按对角线划分。本公开对此不进行限制。
图5示出了根据本公开的一个实施例的补码转换子系统500的示意图。在一个实施例中,补码转换系统500可以是图2中的加速器子系统200的至少一部分的一种具体实现方式,因此上面针对图1-图4所描述的各个方面可以选择性地适用于补码转换系统500。
补码转换子系统500包括旁路电路540和补码转换电路550。补码转换电路550例如可以用于将来自诸如L1高速缓存260之类的存储器的第一补码转换为第一原码并且将第一原码传输至处理引擎PE_1,或例如用于将来自处理引擎单元PE_1的第二原码转换为第二补码并且将第二补码传输至诸如L1高速缓存260之类的存储器。通过在处理引擎单元和存储器之间的数据传输路径中设置硬件形式的补码转换器,可以避免使用额外的指令来进行原码和补码之间的转换,提高了程序的执行速度和效率。此外,虽然存储器访问具有多种不同形式的读指令和写指令,但是通过定义数据段属性,这些指令全可以进行自动格式转换。段属性是存储设置并且独立于核心程 序,因此对于两种存储格式,相同的核心程序都可以正确执行。换言之,无论数据格式是原码还是补码,核心程序都能够无影响地进行存储访问而无需任何改动。当程序启动时,数据段的流处理器命令可以按照输入和输出的存储格式来动态声明该数据段的属性。
可以理解,加速器中的补码转换器可以包括多个补码转换电路。例如,针对每个处理引擎,可以配备两个补码转换电路以分别处理从原码到补码的转换和从补码到原码的转换。备选地,可以针对每个处理引擎仅提供一个补码转换电路,通过复用的方式实现从原码到补码的转换和从补码到原码的转换。在另一个实施例中,可以在PE单元230和L1高速缓存260之间仅设置两个补码转换电路以分别处理从原码到补码的转换和从补码到原码的转换。本公开对此不进行限制。下面仅以一个补码转换电路550来描述本公开中的补码转换器中的补码转换电路的一种实现方式。补码转换电路550可以用于针对int8的字节数据在原码和补码之间进行转换。虽然在此针对int8字节数据进行描述,但是本公开不限于此。在不脱离本公开的原理、精神和范围的前提下,可以对补码转换电路的时序方式进行修改以适用于其它类型的数据。
在补码转换电路550被实现为第一补码转换电路时,其被配置为基于第一补码的第一首比特in[7]的数值,选择性地转换第一补码的第一多个剩余比特in[6:0]的数值。例如,如果第一首比特in[7]的数值为1,则其将第一多个剩余比特in[6:0]的数值提供给7个反相器510、511、512、513、514、515和516取反,以生成第一多个取反比特。第一多个取反比特被提供至加法器520。加法器520在此情形下为第一加法器,并且将第一多个取反比特组成的数据末尾加1,以生成7位的第一多个转换比特。第一多个转换比特被提供至多路选择器530。多路选择器530在此情形下为第一多路选择器。由于多路选择器530的第一控制端接收到控制输入是为1的in[7],因此多路选择器530输出7位输出比特out[6:0],该七位输出比特out[6:0]实际上在此情形下就是第一多个转换比特。该七位输出比特out[6:0]与 第一首比特in[7]组合以生成第一原码,即,第一首比特in[7]仍为原码的第一首比特,而第一多个转换比特out[6:0]构成后7位比特。
如果第一首比特in[7]的数值为0,则其将第一多个剩余比特in[6:0]的数值直接提供至多路选择器530。由于多路选择器530的第一控制端接收到控制输入是为0的in[7],因此多路选择器530输出7位输出比特out[6:0],该七位输出比特out[6:0]实际上在此情形下就是第一多个剩余比特in[6:0]。该七位输出比特out[6:0]与第一首比特in[7]组合以生成第一原码,即,第一首比特in[7]仍为原码的第一首比特,而第一多个转换比特out[6:0]构成后7位比特。
在补码转换电路550被实现为第二补码转换电路时,其被配置为基于第二原码的第二首比特的数值,选择性地转换第二原码的第二多个剩余比特的数值。第二补码转换电路的操作与第一补码转换电路的操作基本上相同,在此不再赘述。通过使用多个反相器、加法器和多路选择器,可以以简单的电路结构实现补码转换器,并且降低成本和简化设计。
补码转换子系统500还包括旁路电路540。旁路电路540耦合至补码转换电路550并且被配置为基于段属性数据,选择性将补码转换电路550旁路,以将存储器与处理引擎单元直接耦合。在一个实施例中,旁路电路可以包括多个子旁路电路以分别将多个补码转换电路进行旁路。备选地,旁路电路也可以对补码转换器整体进行旁路。通过使用旁路电路,加速器不仅可以进行自动格式转换,还可以兼容通过使用指令进行格式转换的常规程序。在一个实施例中,旁路电路540可以是第三多路选择器。旁路电路540包括第五输入端、第六输入端和第三控制端。第五输入端,被配置为接收第一补码或第一原码。第六输入端被配置为接收经转换的原码或补码。第三控制端被配置为接收位于段属性数据中的旁路使能信号Bypass_En。第三多路选择器被配置为基于旁路使能信号Bypass_En选择性地将补码转换电路550旁路。通过使用多路选择器,可以以简单的电路结构实现旁路电路,并且降低成本和简化设计。
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (10)

  1. 一种加速器,包括:
    处理引擎单元;
    存储器;以及
    补码转换器,耦合在所述处理引擎单元和所述存储器之间的数据传输路径中,并且所述补码转换器被配置为:
    将来自所述存储器的第一补码转换为第一原码并且将所述第一原码传输至所述处理引擎单元;以及
    将来自所述处理引擎单元的第二原码转换为第二补码并且将所述第二补码传输至所述存储器。
  2. 根据权利要求1所述的加速器,还包括旁路电路,所述旁路电路耦合至所述补码转换器并且被配置为:
    基于段属性数据,选择性将所述补码转换器旁路,以将所述存储器与所述处理引擎单元直接耦合。
  3. 根据权利要求1或2所述的加速器,其中所述补码转换器包括:
    第一补码转换电路,被配置为基于所述第一补码的第一首比特的数值,选择性地转换所述第一补码的第一多个剩余比特的数值;以及
    第二补码转换电路,被配置为基于所述第二原码的第二首比特的数值,选择性地转换所述第二原码的第二多个剩余比特的数值。
  4. 根据权利要求3所述的加速器,其中所述第一补码转换电路包括:
    第一多个反相器,被配置为将所述第一多个剩余比特分别取反以生成所述第一多个剩余比特的第一多个取反比特;
    第一加法器,被配置为将所述第一多个取反比特加1以生成第一多个转换比特;以及
    第一多路选择器,包括:
    第一输入端,被配置为接收所述第一多个剩余比特;
    第二输入端,被配置为接收所述第一多个转换比特;以及
    第一控制端,被配置为接收所述第一首比特,所述第一多路选择器被配置为基于所述第一首比特的值,选择性地输出所述第一多个剩余比特或所述第一多个转换比特。
  5. 根据权利要求4所述的加速器,其中所述第二补码转换电路包括:
    第二多个反相器,被配置为将所述第二多个剩余比特分别取反以生成所述第二多个剩余比特的第二多个取反比特;
    第二加法器,被配置为将所述第二多个取反比特加1以生成第二多个转换比特;以及
    第二多路选择器,包括:
    第三输入端,被配置为接收所述第二多个剩余比特;
    第四输入端,被配置为接收所述第二多个转换比特;以及
    第二控制端,被配置为接收所述第二首比特,所述第二多路选择器被配置为基于所述第二首比特的值,选择性地输出所述第二多个剩余比特或所述第二多个转换比特。
  6. 根据权利要求2所述的加速器,其中所述旁路电路包括第三多路选择器,所述第三多路选择器包括:
    第五输入端,被配置为接收所述第一补码或所述第一原码;
    第六输入端,被配置为接收经转换的原码或补码;以及
    第三控制端,被配置为接收位于所述段属性数据中的旁路使能信号;
    所述第三多路选择器被配置为基于所述旁路使能信号选择性地将所述补码转换器旁路。
  7. 根据权利要求1或2所述的加速器,其中所述加速器是图形处理器,所述存储器包括一级高速缓存或二级高速缓存。
  8. 根据权利要求2所述的加速器,还包括:
    流处理器,被配置为将所述段属性数据的至少一部分传输至所 述旁路电路。
  9. 根据权利要求1或2所述的加速器,其中所述处理引擎单元还被配置为从所述存储器接收多维张量数据。
  10. 一种电子设备,包括:
    供电单元;
    根据权利要求1-7中任一项所述的加速器,由所述供电单元供电。
PCT/CN2022/107417 2021-10-19 2022-07-22 加速器和电子装置 WO2023065748A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111214262.4 2021-10-19
CN202111214262.4A CN113961506B (zh) 2021-10-19 2021-10-19 加速器和电子装置

Publications (1)

Publication Number Publication Date
WO2023065748A1 true WO2023065748A1 (zh) 2023-04-27

Family

ID=79465129

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107417 WO2023065748A1 (zh) 2021-10-19 2022-07-22 加速器和电子装置

Country Status (2)

Country Link
CN (1) CN113961506B (zh)
WO (1) WO2023065748A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961506B (zh) * 2021-10-19 2023-08-29 海飞科(南京)信息技术有限公司 加速器和电子装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3045914A (en) * 1958-02-14 1962-07-24 Ibm Arithmetic circuit
CN104202053A (zh) * 2014-07-17 2014-12-10 南京航空航天大学 一种快速n位原码到补码的转换装置和转换方法
CN106940638A (zh) * 2017-03-10 2017-07-11 南京大学 一种快速、低功耗和省面积的二进制原码加/减法运算单元的硬件架构
CN110033086A (zh) * 2019-04-15 2019-07-19 北京异构智能科技有限公司 用于神经网络卷积运算的硬件加速器
CN111340201A (zh) * 2018-12-19 2020-06-26 北京地平线机器人技术研发有限公司 卷积神经网络加速器及其执行卷积运算操作的方法
CN113961506A (zh) * 2021-10-19 2022-01-21 海飞科(南京)信息技术有限公司 加速器和电子装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0680495B2 (ja) * 1989-06-01 1994-10-12 三菱電機株式会社 マイクロプロセッサ
US6615338B1 (en) * 1998-12-03 2003-09-02 Sun Microsystems, Inc. Clustered architecture in a VLIW processor
CN100425000C (zh) * 2006-09-30 2008-10-08 东南大学 双涡轮结构低密度奇偶校验码解码器及解码方法
CN102122240B (zh) * 2011-01-20 2013-04-17 东莞市泰斗微电子科技有限公司 一种数据类型转换电路
CN202475439U (zh) * 2011-11-28 2012-10-03 中国电子科技集团公司第五十四研究所 一种基于可配置qc-ldpc编译码算法的硬件仿真验证平台
US9329872B2 (en) * 2012-04-27 2016-05-03 Esencia Technologies Inc. Method and apparatus for the definition and generation of configurable, high performance low-power embedded microprocessor cores
CN112711441A (zh) * 2019-10-25 2021-04-27 安徽寒武纪信息科技有限公司 用于转换数据类型的转换器、芯片、电子设备及其方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3045914A (en) * 1958-02-14 1962-07-24 Ibm Arithmetic circuit
CN104202053A (zh) * 2014-07-17 2014-12-10 南京航空航天大学 一种快速n位原码到补码的转换装置和转换方法
CN106940638A (zh) * 2017-03-10 2017-07-11 南京大学 一种快速、低功耗和省面积的二进制原码加/减法运算单元的硬件架构
CN111340201A (zh) * 2018-12-19 2020-06-26 北京地平线机器人技术研发有限公司 卷积神经网络加速器及其执行卷积运算操作的方法
CN110033086A (zh) * 2019-04-15 2019-07-19 北京异构智能科技有限公司 用于神经网络卷积运算的硬件加速器
CN113961506A (zh) * 2021-10-19 2022-01-21 海飞科(南京)信息技术有限公司 加速器和电子装置

Also Published As

Publication number Publication date
CN113961506B (zh) 2023-08-29
CN113961506A (zh) 2022-01-21

Similar Documents

Publication Publication Date Title
JP2966085B2 (ja) 後入れ先出しスタックを備えるマイクロプロセッサ、マイクロプロセッサシステム、及び後入れ先出しスタックの動作方法
EP2725498B1 (en) DMA vector buffer
KR100938942B1 (ko) 멀티 프로세서 시스템에 있어서 dma 전송을 사용한 리스트 전송 방법 및 장치
WO2023040460A1 (zh) 存储器访问方法和电子装置
US11403104B2 (en) Neural network processor, chip and electronic device
US20220043770A1 (en) Neural network processor, chip and electronic device
WO2006098499A1 (en) Methods and apparatus for dynamic linking program overlay
WO2023142403A1 (zh) 用于确定张量元素的越界状态的方法和电子装置
CN114579929B (zh) 加速器执行的方法和电子设备
WO2023173642A1 (zh) 指令调度的方法、处理电路和电子设备
US9594395B2 (en) Clock routing techniques
WO2023065748A1 (zh) 加速器和电子装置
WO2023103392A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
WO2015017129A1 (en) Multi-threaded gpu pipeline
US20080288728A1 (en) multicore wireless and media signal processor (msp)
WO2023103391A1 (zh) 流处理方法、处理电路和电子设备
CN114510271B (zh) 用于在单指令多线程计算系统中加载数据的方法和装置
CN114035980B (zh) 基于便笺存储器来共享数据的方法和电子装置
WO2023077875A1 (zh) 用于并行执行核心程序的方法和装置
WO2023103397A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
CN114651249A (zh) 在没有维度填充的情况下最大限度地减少由矩阵乘法和卷积内核中的不兼容主导维度引起的缓存冲突的负面影响的技术
JP4024271B2 (ja) マルチプロセッサシステムにおいて命令を処理するための方法と装置
CN114586002A (zh) 改变数据格式的交织数据转换
US11886737B2 (en) Devices and systems for in-memory processing determined
WO2009004628A2 (en) Multi-core cpu

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882370

Country of ref document: EP

Kind code of ref document: A1