WO2023241417A1 - 处理器以及用于数据处理的方法、设备和存储介质 - Google Patents

处理器以及用于数据处理的方法、设备和存储介质 Download PDF

Info

Publication number
WO2023241417A1
WO2023241417A1 PCT/CN2023/098714 CN2023098714W WO2023241417A1 WO 2023241417 A1 WO2023241417 A1 WO 2023241417A1 CN 2023098714 W CN2023098714 W CN 2023098714W WO 2023241417 A1 WO2023241417 A1 WO 2023241417A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processor core
processor
processed
instruction
Prior art date
Application number
PCT/CN2023/098714
Other languages
English (en)
French (fr)
Inventor
曹宇辉
施云峰
王剑
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023241417A1 publication Critical patent/WO2023241417A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/177Initialisation or configuration control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/82Architectures of general purpose stored program computers data or demand driven
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Definitions

  • Example embodiments of the present disclosure relate generally to the field of computers, and in particular to processors and methods, devices, and computer-readable storage media for data processing.
  • a processor in a first aspect of the present disclosure, includes a plurality of processor cores, each of the plurality of processor cores including a data cache for reading and writing data and an instruction cache separate from the data cache for reading instructions.
  • the processor also includes a distributor communicatively coupled to the plurality of processor cores.
  • the dispatcher is configured to distribute data to be processed to a corresponding data cache of at least one of the plurality of processor cores, and to distribute instructions associated with the data to be processed to a corresponding instruction cache of at least one of the processor cores. for execution.
  • a method for data processing includes distributing data to be processed by a distributor of the processor to a corresponding data cache of at least one processor core among a plurality of processor cores of the processor.
  • the distributor is communicatively coupled to multiple processor cores.
  • the method also includes distributing instructions associated with the data to be processed to a corresponding instruction cache of at least one processor core for execution.
  • an electronic device in a third aspect of the present disclosure, includes at least a processor according to the first aspect.
  • a computer-readable storage medium is provided.
  • a computer program is stored on the computer-readable storage medium, and the computer program can be executed by a processor to implement the method of the second aspect.
  • FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented
  • Figure 2 shows a schematic diagram of an architecture for instruction and data distribution in accordance with some embodiments of the present disclosure
  • FIG. 3 illustrates a schematic diagram for broadcasting instructions to multiple processor cores according to some embodiments of the present disclosure
  • Figure 4 shows a schematic diagram of cyclically writing, executing and reading data by a processor core according to some embodiments of the present disclosure
  • Figure 5 illustrates a flowchart of a process for data processing in accordance with some embodiments of the present disclosure.
  • FIG. 6 illustrates a block diagram of an electronic device in which a processor may be included in accordance with one or more embodiments of the present disclosure.
  • each processor core of a multi-core processor is an independent and complete instruction execution unit.
  • instruction or data conflicts may occur.
  • a conventional solution is to use a mailbox strategy for commands. For example, multiple processor cores send instructions to each other through mailboxes when they need to perform operations simultaneously.
  • a conventional solution is to use a cache (Cache) consistency strategy, such as modified exclusive shared invalidation (MESI) technology, to ensure that the data in the cache is the same as the data in the main memory.
  • cache cache
  • MESI modified exclusive shared invalidation
  • Multi-core processor architectures that use the above conventional strategies include big.LITTLE architecture and so on.
  • SIMD single instruction multiple data
  • an improved solution for a processor is proposed.
  • a distributor is provided in the processor for distributing data and/or instructions to individual processor cores of the processor.
  • Many cache coherency issues of multi-core processors are avoided by using a dispatcher to centrally schedule the distribution of data and/or instructions without using conventional cache coherence designs.
  • the processor usually actively initiates a data transmission request.
  • a single processor cannot know the data requests of other processors, so it is difficult to be compatible with the form of data broadcast. In other words, data cannot be transmitted in the form of broadcast.
  • This solution uses a centralized data scheduling mechanism, which can easily use broadcasting to transmit data, thereby improving data transmission efficiency. In this way, this solution can make full use of limited bandwidth resources, thereby improving the efficiency of vector calculations, especially neural network vector calculations.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented.
  • processor 101 is a multi-core processor, which includes processor core 120-1, processor core 120-2, ..., processor core 120-N, where N is An integer greater than 1.
  • processor cores 120-1, 120-2, ..., and the processor cores 120-N are collectively referred to as or individually referred to as the processor cores 120 below.
  • Each processor core 120 may be a SIMD processor core.
  • processor 101 may include four processor cores 120 (ie, the value of N may be 4).
  • Processor 101 also includes a distributor 110 .
  • Distributor 110 is communicatively coupled to various processor cores 120 . That is, the distributor 110 and each processor core 120 may communicate with each other according to appropriate data transmission protocols and/or standards. In operation, the dispatcher 110 can distribute data 140 and/or instructions 130 to various processor cores 120 .
  • the distributor 110 may also sometimes be referred to as a "scheduler,” and the two may be used interchangeably in this context.
  • distributor 110 may be implemented as hardware circuitry. This hardware circuitry may be integrated or embedded into processor 101. Alternatively or additionally, the distributor 110 may also be implemented in whole or in part by software modules, for example implemented as executable instructions and stored in a memory (not shown).
  • Dispatcher 110 is configured to distribute instructions 130 and/or data 140 received by processor 101 from other devices (not shown) in environment 100 to various processor cores 120 .
  • the device that sends instructions 130 and/or data 140 to the distributor 110 is also referred to as the originating device of the data processing request.
  • processor 101 may receive data 140 and/or instructions 130 transmitted by a storage device or other external device in environment 100 via, for example, a bus, and transmit the received data 140 and/or instructions via distributor 110 Instructions 130 are distributed to various processor cores 120 . The distribution process of data 140 and/or instructions 130 will be described below in conjunction with FIG. 2 .
  • processor 101 may be implemented in a variety of existing or future computing platforms or computing systems.
  • the processor 101 can be implemented in various embedded applications (for example, data processing systems of mobile network base stations, etc.) to provide Provide services such as large-scale vector calculations.
  • the processor 101 can also be integrated or embedded into various electronic devices or computing devices to provide various computing services.
  • the application environment and application scenarios of the processor 101 are not limited here.
  • FIG. 2 shows a schematic diagram of an example architecture 200 for instruction 130 and data 140 distribution in accordance with some embodiments of the present disclosure.
  • architecture 200 will be described with reference to environment 100 of FIG. 1 .
  • each of the plurality of processor cores 120 includes a data cache for reading and writing data and an instruction cache for reading instructions that is separate from the data cache.
  • the processor core 120-1 includes an instruction cache 220-1 and a data cache 230-1;
  • the processor core 120-2 includes an instruction cache 220-2 and a data cache 230-2; ...;
  • the processor core 120-N includes Instruction cache 220-N and data cache 230-N.
  • the instruction cache 220-1, the instruction cache 220-2, ..., the instruction cache 220-N are collectively referred to as the instruction cache 220 or individually as the instruction cache 220
  • the data cache 230-1 and the data cache 230-2 are referred to as ,..., the data cache 230-N is collectively referred to as or individually referred to as the data cache 230.
  • instruction cache 220 is typically not implemented as a cache. From the perspective of processor 101, instruction cache 220 is read-only.
  • the data cache 230 may include Vector Closely-coupled Memory (VCCM).
  • VCCM Vector Closely-coupled Memory
  • data cache 230 is typically not implemented as a cache. However, unlike the instruction cache 220, the data cache 230 is readable and writable.
  • the dispatcher 110 is configured to distribute the received instructions 130 and/or data 140 to at least one processor core 120 of the plurality of processor cores 120 .
  • dispatcher 110 may be configured to dispatch received instructions 130 and/or data 140 to only first processor core 120-1.
  • the distributor 110 may distribute the received instructions 130 and/or data 140 to the first processor core 120-1 and the second processor core 120-2.
  • distributor 110 may distribute instructions 130 and/or data 140 to each of multiple processor cores 120 .
  • distributor 110 may receive Configuration information 210.
  • the configuration information 210 may instruct the distributor 110 to distribute instructions 130 and/or data 140 to one or some of the plurality of processor cores 120 .
  • configuration information 210 may instruct dispatcher 110 to dispatch instructions 130 and/or data 140 to only first processor core 120-1.
  • the configuration information 210 may instruct the distributor 110 to distribute instructions 130 and/or data 140 to each of the plurality of processor cores 120 .
  • the distributor 110 may be configured to distribute instructions 130 and/or data 140 to one or some processor cores 120 , or to all processor cores 120 .
  • distributor 110 may receive a set of data and instructions to be processed by processor 101 .
  • the configuration information 210 received by the distributor 110 may indicate at least an association between the data to be processed (also referred to as data 140) in the data set and the instructions 130 in the instruction set.
  • an association between data 140 and instructions 130 may indicate that data 140 in the data set is to be processed according to instructions 130 in the instruction set.
  • Dispatcher 110 may distribute data 140 and instructions 130 depending at least in part on this association.
  • the distributor 110 distributes the data 140 to a corresponding data cache 230 of at least one of the plurality of processor cores 120 and distributes the instructions 130 associated with the data 140 to a corresponding data cache 230 of at least one of the plurality of processor cores 120 . Instructions are cached 220 for execution.
  • the dispatcher 110 may broadcast the same instructions 130 to at least one processor core.
  • Figure 3 illustrates a schematic diagram for broadcasting instructions 130 to multiple processor cores in accordance with some embodiments of the present disclosure.
  • the instruction 130 may include instruction 0, instruction 1, ..., and instruction M (where M is an integer greater than or equal to 1).
  • Instructions 130 may be broadcast to processor cores 120-1, 120-2, ..., 120-J via dispatcher 110 (where J is an integer greater than 1).
  • instructions 310-1, 310-2, ..., 310-J that are the same as instruction 130 are broadcast to processor cores 120-1, 120-2, ..., 120-J.
  • instructions 310-1, 310-2, ..., and instructions 310-J are collectively referred to as instructions 310 or individually in the following. It should be understood that there may be a delay 320 between the time the instruction 310 is received at the processor core 120 and the instruction 130 is received at the dispatcher 110 .
  • dispatcher 110 may broadcast instructions 130 only to More or fewer processor cores 120 of the plurality of processor cores 120 .
  • Dispatcher 110 may broadcast instructions 130 to only one processor core 120.
  • dispatcher 110 may broadcast instructions 130 to all processor cores 120 (ie, J equals N).
  • the processor core 120 or processor cores 120 to which the dispatcher 110 broadcasts the instruction 130 may be set in advance or based on the received configuration information 210 .
  • the distributor 110 may also use other transmission methods to distribute the instructions 130 to each processor core 120 .
  • the dispatcher 110 may first send the instruction 130 to the processor core 120-1, and then send the instruction 130 to the processor core 120-2, and so on.
  • using broadcasting to distribute instructions can reduce the cost of repeatedly reading instructions, thereby greatly saving instruction transmission overhead.
  • the distributor 110 may send different data to be processed associated with the instruction to different processor cores 120 respectively.
  • the distributor 110 may broadcast or send the first data in the data 140 to the first processor core 120-1 for processing, and broadcast or send the second data in the data 140 that is different from the first data to the second processor core 120-1 for processing.
  • Processor core 120-2 is provided for processing.
  • FIG. 3 only shows an example of broadcasting the same instruction 130 to each processor core 120, alternatively or additionally, in some embodiments, a process similar to that of FIG. 3 may also be used to broadcast the same instruction 130 to each processor core 120.
  • the same data is broadcast to each processor core 120.
  • data 140 may be broadcast to at least one processor core 120 of multiple processor cores 120 .
  • different instructions associated with data 140 may be distributed to different processor cores 120 .
  • a first instruction in instructions 130 may be sent or broadcast to processor core 120-1 to process data 130 by processor core 120 based on the first instruction
  • a second instruction in instructions 130 may be sent or broadcast to the processor core 120 to process data 130 based on the first instruction.
  • the processor core 120-2 is configured to process the data 140 based on the second instruction by the processor core 120-2.
  • This method of distributing the same data and different instructions to each processor core 120 is suitable for many data processing scenarios, such as some data processing with small data volume but complex processing process. process. For example, in neural network calculations, there are often scenarios where different calculation processes are used for the same data.
  • the above-described method of distributing the same data and different instructions to each processor core 120 can be well adapted to such a scenario. In this way, the overhead of such data and instruction transmission processes can be greatly saved, thereby improving data processing efficiency.
  • each processor core 120 is configured to execute instructions according to its associated execution pipeline 240 - 1 , 240 - 2 , . . . , pipeline 240 -N.
  • execution pipeline 240-1, the execution pipeline 240-2, ..., the execution pipeline 240-N will be collectively referred to as or individually referred to as the execution pipeline 240 below.
  • Execution pipeline 240 is configured to utilize instructions in instruction cache 220 to process data in data cache 230 .
  • the execution pipeline 240 may process data written in the data cache 230 according to instructions in the instruction cache 220 and send the processing results back to the data cache 230 .
  • each processor core 120 may send processing results from the data cache 230 to the distributor 110 .
  • the distributor 110 may receive the processing results from the corresponding data cache 230 of at least one processor core 120 respectively. Additionally, the distributor 110 may send the received processing results of the data to be processed to other devices, such as the originating device of the data processing request. In this way, the distributor 110 can be responsible for the exchange of external data and the data in the data cache 230 in the processor core 120, thereby reducing the external data exchange of the processor core 120.
  • dispatcher 110 may distribute instructions 130 and data 140 associated with the instructions to various processor cores 120 .
  • Each processor core 120 may process the received data 140 according to the instructions 130 and send the processing results to the distributor 110 .
  • the distributor 110 may read and write data to the data cache 230 in a round-robin manner. For example, the distributor 110 may distribute the third data in the data to be processed (eg, the data 140) to the first processor core 120-1 of the at least one processor core 120 for processing. In response to receiving the first result obtained by processing the third data from the processor core 120-1, the distributor 110 distributes fourth data that is different from the third data in the data to be processed to the first processor core 120-1. 1.
  • FIG. 4 shows a schematic diagram of the processor core 120 cyclically writing data, executing instructions, and reading data according to some embodiments of the present disclosure.
  • the distributor 110 sends data 410-1 to the processor core 120.
  • the processor core 120 performs data writing 430-1 on the received data 410-1 to write the data 410-1 into the data cache 230.
  • the processor core 120 performs instruction execution 440-1 based on the received instruction associated with the data 410-1 using an execution pipeline 240 such as that shown in FIG. 2 .
  • the processor core 120 then reads the result obtained by executing the instruction 440-1 from the data cache 230 450-1.
  • the processor core 120 sends the read data 420-1 to the distributor 110.
  • distributor 110 In response to receiving data 420-1 from processor core 120, distributor 110 sends data 410-2 to processor core 120.
  • the processor core 120 then performs processes such as data writing 430-2, instruction execution 440-2, and data reading 450-2 on the data 410-2, and reads the processing result corresponding to the data 410-2.
  • Data 420-2 is sent to distributor 110.
  • the distributor 110 may send data 410-K (where K may be an integer greater than 1) to the processor core 120.
  • the processor core 120 then performs processes such as data writing 430-K, instruction execution 440-K, and data reading 450-K on the data 410-K, and reads the processing result corresponding to the data 410-K.
  • Data 420-K is sent to distributor 110.
  • the instructions associated with the data 410-1, 410-2, ..., 410-K may be the same instruction, and the instruction may be transmitted to the processor core 120 only once by the dispatcher 110.
  • the number of times K of loops to read, write and process data may be preset. Alternatively or additionally, the number of times K of loops to read, write and process data may be set according to the configuration information 210 received by the distributor 110 .
  • FIG. 4 only shows the process of cyclically reading, writing, and processing data for one of the processor cores 120, a similar process can be used for other processor cores 120 to cyclically read, write, and process data.
  • This looping method facilitates loading a large amount of data into the data cache 230 of the processor core 120 . In this way, the instruction can be distributed only once, and the data associated with the instruction can be read, written, and processed in a loop to further reduce instruction and data transmission overhead.
  • each data distribution, instruction described above Processes such as distribution may be performed in any suitable order.
  • the above-described embodiments of data and instruction distribution and various embodiments of reading, writing, and processing data can be implemented in combination.
  • each processor core 120 can use a large amount of data. Capacity data cache and minimize data exchange outside the processor core.
  • embodiments of this solution use a centralized data scheduling or distribution mechanism, and can easily use broadcasting to transmit data and/or instructions, thereby improving the transmission efficiency of data and instructions. In this way, this solution can make full use of limited bandwidth resources, thereby improving the efficiency of vector calculations.
  • the bandwidth required by the underlying computing unit is often several times or even dozens of times the bandwidth that can be provided externally.
  • the solution of the present disclosure can make full use of limited bandwidth resources, thereby improving computing efficiency such as neural network accelerators.
  • FIG. 5 illustrates a flow diagram of a process 500 for data processing in accordance with some embodiments of the present disclosure.
  • Process 500 may be implemented at distributor 110 of processor 101 .
  • process 500 will be described with reference to environment 100 of FIG. 1 .
  • the distributor 110 distributes data to be processed (eg, data 140 ) to corresponding data caches 230 of at least one of the plurality of processor cores 120 .
  • the distributor 110 may distribute the data 140 to each of the plurality of processor cores 120 or to one or more of the plurality of processor cores 120 .
  • the dispatcher 110 dispatches instructions 130 associated with the data to be processed (eg, data 140) to corresponding instruction caches 220 of at least one processor core 120 for execution.
  • the dispatcher 110 may distribute the instructions 130 to each of the plurality of processor cores 120 or to one or more of the plurality of processor cores 120 .
  • the distributor 110 may provide Instructions 130 are broadcast to distribute instructions 130 to at least one processor core 120 .
  • distributor 110 may send first data in data 140 to first processor core 120-1 of at least one processor core 120 for processing, and send data in data 140 that is different from the first The second data of the data is sent to a second processor core 120-2 of the at least one processor core 120 for processing.
  • distributor 110 may distribute data 140 to at least one processor core 120 by broadcasting data 140 to at least one processor core 120 .
  • dispatcher 110 may issue a first of instructions 130 to first processor core 120-1 of at least one processor core 120, for use by first processor core 120-1 based on the first Instructions process data 140.
  • the dispatcher 110 may also send a second instruction in the instructions 130 that is different from the first instruction to the second processor core 120-2 in the at least one processor core 120, so that the second processor core 120-2 executes the instruction based on the first instruction.
  • Two instructions process data 140.
  • the distributor 110 is further configured to respectively receive the processing results from the corresponding data cache 230 of the at least one processor core 120.
  • the processing result is obtained by at least one processor core 120 processing the received data 140 according to the instruction 130 respectively.
  • the distributor 110 may distribute the third data in the data 140 to the first processor core 120-1 of the at least one processor core 120 for processing.
  • the distributor 110 distributes fourth data in the data 140 that is different from the third data to the first processor core 120 -1 for processing.
  • distributor 110 is further configured to receive a set of data and instructions to be processed by processor 101 .
  • Distributor 110 is also configured to receive configuration information.
  • the configuration information at least indicates an association between the data to be processed (eg, data 140) in the data set and the instructions 130 in the instruction set.
  • an association between data 140 and instructions 130 may indicate that data 140 is to be processed by instructions 130 .
  • distribution of data 140 and instructions 130 may depend, at least in part, on the correlations described above.
  • FIG. 6 shows a block diagram of an electronic device 600 in which the processor 101 may be included in accordance with one or more embodiments of the present disclosure. It should be understood that the electronic device 600 shown in FIG. 6 is merely exemplary and should not constitute any limitation on the functionality and scope of the embodiments described herein.
  • electronic device 600 is in the form of a general electronic device or computing device.
  • the components of electronic device 600 may include, but are not limited to, one or more processors 101 , memory 620 , storage devices 630 , one or more communication units 640 , one or more input devices 650 , and one or more output devices 660 .
  • the processor 101 may perform various processes according to programs stored in the memory 620 .
  • Each processor core 120 in the processor 101 can execute computer-executable instructions in parallel to improve the parallel processing capability of the electronic device 600 .
  • Electronic device 600 typically includes a plurality of computer storage media. Such media may be any available media that is accessible to electronic device 600, including, but not limited to, volatile and nonvolatile media, removable and non-removable media.
  • Memory 620 may be volatile memory (e.g., registers, cache, random access memory (RAM)), nonvolatile memory (e.g., read only memory (ROM), electrically erasable programmable read only memory (EEPROM) , flash memory) or some combination thereof.
  • Storage device 630 may be a removable or non-removable medium and may include machine-readable media such as a flash drive, a magnetic disk, or any other medium that may be capable of storing information and/or data (such as training data for training ) and can be accessed within electronic device 600.
  • Electronic device 600 may further include additional removable/non-removable, volatile/non-volatile storage media.
  • a disk drive may be provided for reading from or writing to a removable, non-volatile disk (eg, a "floppy disk") and for reading from or writing to a removable, non-volatile optical disk. Read or write to optical disc drives.
  • each drive may be connected to the bus (not shown) by one or more data media interfaces.
  • Memory 620 may include a computer program product 625 having one or more program modules configured to perform various methods or actions of various embodiments of the present disclosure. do. For example, these program modules may be configured to implement various functions or actions of distributor 110 .
  • the communication unit 640 implements communication with other electronic devices or computing devices through communication media. Additionally, the functionality of the components of electronic device 600 may be implemented as a single computing cluster or as multiple computing machines capable of communicating over a communications connection. Accordingly, electronic device 600 may operate in a networked environment using a logical connection to one or more other servers, a networked personal computer (PC), or another network node.
  • PC personal computer
  • Input device 650 may be one or more input devices, such as a mouse, keyboard, trackball, etc.
  • Output device 660 may be one or more output devices, such as a display, speakers, printer, etc.
  • the electronic device 600 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., through the communication unit 640 as needed, and with one or more devices that enable the user to interact with the electronic device 600 Communicate with or with any device (eg, network card, modem, etc.) that enables electronic device 600 to communicate with one or more other electronic devices or computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • I/O input/output
  • a computer-readable storage medium is provided with computer-executable instructions stored thereon, wherein the computer-executable instructions are executed by a processor to implement the method described above.
  • a computer program product is also provided, the computer program product is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.
  • These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, the computer-readable program instructions , resulting in the implementation of the functions/actions specified in one or more blocks in the flowchart and/or block diagram installation.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Thereby, instructions executed on a computer, other programmable data processing apparatus, or other equipment implement the functions/actions specified in one or more blocks of the flowcharts and/or block diagrams.
  • Example 1 describes a processor including a plurality of processor cores.
  • Each of the plurality of processor cores includes a data cache for reading and writing data and an instruction cache, separate from the data cache, for reading instructions.
  • the processor also includes a distributor communicatively coupled to the plurality of processor cores.
  • the distributor is configured to: distribute data to be processed to a corresponding data cache of at least one processor core among a plurality of processor cores; and distribute instructions associated with the data to be processed to corresponding instructions of at least one processor core. Cache for execution.
  • Example 2 includes the processor as described in Example 1, wherein distributing the instruction to the at least one processor core includes broadcasting the instruction to the at least one processor core.
  • Example 3 includes the processor described according to Example 2, wherein distributing the data to be processed to at least one processor core includes: sending the first data in the data to be processed to a first the processor core for processing; and sending the second data in the data to be processed to the second processor core for processing.
  • the second data is different from the first data.
  • Example 4 includes the processor described in Example 1, wherein distributing the data to be processed to the at least one processor core includes broadcasting the data to be processed to the at least one processor core.
  • Example 5 includes as described according to Example 4 The processor, wherein distributing the instruction to at least one processor core includes: sending the first instruction to the first processor core, so that the first processor core processes the data to be processed based on the first instruction; and sending the second instruction to the second processor core, so that the second processor core processes the data to be processed based on the second instruction.
  • the first instruction is different from the second instruction.
  • Example 6 includes the processor described in Example 1, wherein the distributor is further configured to: respectively receive processing results from corresponding data caches of at least one processor core. The processing result is obtained by at least one processor core processing the received data to be processed according to instructions.
  • Example 7 includes the processor described in Example 6, wherein distributing the data to be processed to at least one processor core includes: distributing the third data in the data to be processed to at least one processor core. a first of the processor cores for processing; and in response to receiving from the first processor core a first result obtained by processing the third data, distributing fourth data of the data to be processed to the first The processor core is used for processing, and the third data is different from the fourth data.
  • Example 8 includes the processor as described in Example 1, wherein the distributor is further configured to: receive a data set and an instruction set to be processed by the processor; and receive configuration information.
  • the configuration information at least indicates the association between the data to be processed in the data set and the instructions in the instruction set. Distribution of data and instructions to be processed depends, at least in part, on this association.
  • Example 9 describes a method of data processing.
  • the method includes: distributing, by a distributor of the processor, data to be processed to a corresponding data cache of at least one processor core among a plurality of processor cores of the processor.
  • the distributor is communicatively coupled to multiple processor cores.
  • the method also includes distributing instructions associated with the data to be processed to a corresponding instruction cache of at least one processor core for execution.
  • Example 10 includes the method described in Example 9, wherein distributing the instruction to the at least one processor core includes broadcasting the instruction to the at least one processor core.
  • Example 11 includes the method described according to Example 10, wherein distributing the data to be processed to at least one processor core includes: The first data in the processing data is distributed to the first processor core for processing; and the second data in the data to be processed is distributed to the second processor core for processing. The first data is different from the second data.
  • Example 12 includes the method described according to Example 9, wherein distributing the data to be processed to at least one processor core includes broadcasting the data to be processed to the at least one processor core.
  • Example 13 includes the method described in Example 12, wherein distributing the instruction to at least one processor core includes: sending the first instruction to the first processor core to be processed by the first processor core.
  • the processor core processes the data to be processed based on the first instruction; and sends the second instruction to the second processor core, so that the second processor core processes the data to be processed based on the second instruction.
  • the first instruction is different from the second instruction.
  • Example 14 includes the method described in accordance with Example 9, in accordance with one or more embodiments of the present disclosure.
  • the method also includes respectively receiving processing results from corresponding data caches of at least one processor core.
  • the processing result is obtained by at least one processor core processing the received data to be processed according to instructions.
  • Example 15 includes the method described in Example 14, wherein distributing the data to be processed to at least one processor core includes: distributing a third data in the data to be processed to at least one processor a first processor core among the processor cores for processing; and in response to receiving the first result obtained by processing the third data from the first processor core, distributing the fourth data among the data to be processed to the first processing core for processing.
  • the third data is different from the fourth data.
  • Example 16 includes the method described in accordance with Example 9, in accordance with one or more embodiments of the present disclosure.
  • the method also includes receiving a data set and a set of instructions to be processed by the processor; and receiving configuration information.
  • the configuration information at least indicates the association between the data to be processed in the data set and the instructions in the instruction set. Distribution of data and instructions to be processed depends, at least in part, on this association.
  • Example 17 describes an electronic device, which at least includes the processor described in any one of Examples 1 to 8.
  • Example 18 describes a computer-executable Reading a storage medium having a computer program stored thereon.
  • the computer program is executed by the processor to implement the method described in any one of Examples 9 to 16.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that contains one or more executable functions for implementing the specified logical functions instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Advance Control (AREA)

Abstract

根据本公开的实施例,提供了一种处理器以及用于数据处理的方法、设备和存储介质。该处理器包括多个处理器核,多个处理器核各自包括用于读取和写入数据的数据缓存以及与数据缓存分离的、用于读取指令的指令缓存。该处理器还包括分发器,该分发器被通信地耦合到多个处理器核。该分发器被配置为将待处理数据分发给多个处理器核中的至少一个处理器核的相应数据缓存,以及将与待处理数据相关联的指令分发给至少一个处理器核的相应指令缓存以供执行。以此方式,可以显著地提高数据传输以及向量计算的效率。

Description

处理器以及用于数据处理的方法、设备和存储介质
本申请要求2022年06月14日递交的,标题为“处理器以及用于数据处理的方法、设备和存储介质”、申请号为202210674851.9的中国发明专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开的示例实施例总体涉及计算机领域,特别地涉及处理器以及用于数据处理的方法、设备和计算机可读存储介质。
背景技术
随着信息技术的发展,各种数据处理业务对计算系统的计算能力和计算资源提出了越来越高的要求。目前已经提出了使用多核处理器来通过并行计算的方式提高系统整体的计算能力和计算吞吐量。对于一些指令重复性高且数据量大的向量计算,如何使多核处理器充分利用有限的带宽来处理这样的向量计算是值得关注的问题。
发明内容
在本公开的第一方面,提供了一种处理器。该处理器包括多个处理器核,多个处理器核各自包括用于读取和写入数据的数据缓存以及与数据缓存分离的、用于读取指令的指令缓存。该处理器还包括分发器,该分发器被通信地耦合到多个处理器核。该分发器被配置为将待处理数据分发给多个处理器核中的至少一个处理器核的相应数据缓存,以及将与待处理数据相关联的指令分发给至少一个处理器核的相应指令缓存以供执行。
在本公开的第二方面,提供了一种用于数据处理的方法。该方法 包括由处理器的分发器将待处理数据分发给处理器的多个处理器核中的至少一个处理器核的相应数据缓存。该分发器被通信地耦合到多个处理器核。该方法还包括将与待处理数据相关联的指令分发给至少一个处理器核的相应指令缓存以供执行。
在本公开的第三方面,提供了一种电子设备。该电子设备至少包括根据第一方面的处理器。
在本公开的第四方面,提供了一种计算机可读存储介质。计算机可读存储介质上存储有计算机程序,计算机程序可由处理器执行以实现第二方面的方法。
应当理解,该内容部分中所描述的内容并非旨在限定本公开的实施例的关键特征或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的描述而变得容易理解。
附图说明
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标记表示相同或相似的元素,其中:
图1示出了本公开的实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一些实施例的用于指令和数据分发的架构的示意图;
图3示出了根据本公开的一些实施例的用于对多个处理器核广播指令的示意图;
图4示出了根据本公开的一些实施例的由处理器核循环地写入、执行以及读取数据的示意图;
图5示出了根据本公开的一些实施例的用于数据处理的过程的流程图;以及
图6示出了其中可以包括根据本公开的一个或多个实施例的处理器的电子设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中示出了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反,提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“一些实施例”应当理解为“至少一些实施例”。下文还可能包括其他明确的和隐含的定义。
可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。
如前所述,随着信息技术的发展,各种数据处理业务对计算系统的计算能力和计算资源提出了越来越高的要求。目前已经提出了使用多核处理器来通过并行计算的方式提高系统整体的计算能力和计算吞吐量。一般情况下,多核处理器的每个处理器核都是独立且完整的指令执行单元。当多个处理器核协同工作时,例如当多个处理器核需要访问相同的存储地址时,有可能会引发指令或者数据冲突问题。
针对上述指令或数据冲突问题,一种常规的方案是针对指令使用邮箱(mailbox)策略。例如,多个处理器核在需要同步执行操作时,通过邮箱互发指令。针对数据,一种常规方案则是使用高速缓存(Cache)一致性策略,诸如修改独占共享无效(MESI)技术等,以保证高速缓冲中的数据与主存储器中的数据相同。使用上述常规策略的多核处理器架构有big.LITTLE架构等。
研究发现,针对一些指令重复性高且数据量大的向量计算,往往会使用多核架构的单指令多数据(SIMD)处理器。一方面,如果使 用传统的基于高速缓存一致性的多核处理器架构,会导致SIMD处理器的一级缓存较小。此外,由于计算数据量大,局部性差,会产生大量的高速缓存错失而造成读写数据的效率低下。另一方面,这类规律的向量计算很少需要产生复杂的线程切换任务,传统的多核处理器控制模式显得臃肿而冗余。
综上,对于一些指令重复性高且数据量大的向量计算,如何使多核处理器充分利用有限的带宽来处理这样的向量计算是值得关注的问题。
根据本公开的实施例,提出了一种用于处理器的改进方案。在该方案中,在处理器中设置分发器以用于将数据和/或指令分发给处理器的各个处理器核。通过使用分发器来集中地调度数据和/或指令的分发,而无需使用常规的高速缓存一致性设计,从而避免了多核处理器的诸多高速缓存一致性问题。
一方面,常规的基于高速缓存一致性的处理器,数据传输和处理的复杂度往往较高,导致数据存储容量受到限制,并且难以提高时钟频率。本方案由分发器将数据分发给各个处理器核的数据缓存,各个处理器核的数据缓存用于读取和写入数据。使用各个处理器核各自的数据缓存与分发器直接传输数据,可以使用大容量的数据缓存,并且尽可能减少对外的数据交换。
另一方面,传统的多核调度方式通常由处理器主动发起数据传输请求,单个处理器无法得知其他处理器的数据请求,因此难以兼容数据广播的形式。也就是说,数据无法采用广播的形式来传输。本方案使用集中式的数据调度机制,可以轻松地应用广播的方式来传输数据,进而提高数据传输效率。以此方式,本方案能够充分利用有限的带宽资源,进而提高向量计算尤其是神经网络向量计算的效率。
以下将继续参考附图描述本公开的一些示例实施例。
图1示出了本公开的实施例能够在其中实现的示例环境100的示意图。在该示例环境100中,处理器101是一个多核处理器,其包括处理器核120-1、处理器核120-2、……、处理器核120-N,其中N是 大于1的整数。为便于讨论,下文中将处理器核120-1、处理器核120-2、……、处理器核120-N统称为或单独称为处理器核120。各个处理器核120可以是SIMD处理器核。在一些实施例中,处理器101可以包括四个处理器核120(即,N的值可以是4)。
当然,应当理解的是,除非特别说明,否则在这里以及本文其他地方出现的任何具体数值都是示例性的。例如,在其他实施例中,例如根据流片的工艺级别和线宽等指标的不同,可以具有相应不同数目的处理器核120。
处理器101还包括分发器110。分发器110被通信地耦合到各个处理器核120。也即,分发器110和各个处理器核120可以根据适当的数据传输协议和/或标准实现彼此通信。在运行中,分发器110能够将数据140和/或指令130分发给各个处理器核120。注意,分发器110有时也可被称为“调度器”,二者在本上下文中可以互换使用。
在一些实施例中,分发器110可以被实现为硬件电路。该硬件电路可以被集成或嵌入到处理器101中。备选地或者附加地,分发器110也可以全部或者部分地由软件模块来实现,该软件模块例如被实现为可执行的指令并被储存在存储器(未示出)中。
分发器110被配置为将处理器101从环境100中的其他设备(未示出)接收到的指令130和/或数据140分发给各个处理器核120。向分发器110发送指令130和/或数据140的设备也被称为数据处理请求的发起设备。在一些实施例中,处理器101可以经由例如总线接收由环境100中的存储设备或者其他外部设备传输的数据140和/或指令130,并且经由分发器110来将接收到的数据140和/或指令130分发给各个处理器核120。关于数据140和/或指令130的分发过程将在下文结合图2来进行描述。
应理解,仅出于示例性的目的描述环境100的结构和功能,而不暗示对于本公开的范围的任何限制。例如,处理器101可以被应用于各种现有的或将来的计算平台或计算系统中。处理器101可以在各种嵌入式应用(例如,移动网络基站等的数据处理系统)中实现,以提 供诸如大量向量计算等服务。处理器101也可以被集成或嵌入到各种电子设备或计算设备中,以提供各种计算服务。处理器101的应用环境和应用场景在此不受限制。
图2示出了根据本公开的一些实施例的用于指令130和数据140分发的示例架构200的示意图。为便于讨论,将参考图1的环境100来描述架构200。
如图2所示,多个处理器核120各自包括用于读取和写入数据的数据缓存以及与数据缓存分离的、用于读取指令的指令缓存。例如,处理器核120-1包括指令缓存220-1和数据缓存230-1;处理器核120-2包括指令缓存220-2和数据缓存230-2;……;处理器核120-N包括指令缓存220-N和数据缓存230-N。为便于讨论,下文中将指令缓存220-1、指令缓存220-2、……、指令缓存220-N统称为或单独称为指令缓存220,并且将数据缓存230-1、数据缓存230-2、……、数据缓存230-N统称为或单独称为数据缓存230。
注意,指令缓存220通常不被实现为高速缓存。从处理器101的角度看,指令缓存220是只读的。数据缓存230可以包括向量紧耦合存储器(Vector Closely-coupled Memory,VCCM)。同指令缓存220类似,数据缓存230也通常不被实现为高速缓存。然而同指令缓存220不同的是,数据缓存230是可读可写的。通过使用VCCM作为数据缓存230,而不是采用高速缓存,可以降低处理器101的设计复杂度,并且提高处理器缓存容量和时钟频率。以这样,能够提高处理器101的数据传输效率。
在一些实施例中,分发器110被配置为将接收到的指令130和/或数据140分发给多个处理器核120中的至少一个处理器核120。例如,分发器110可以被配置为将接收到的指令130和/或数据140分发给仅第一处理器核120-1。又如,分发器110可以将接收到的指令130和/或数据140分发给第一处理器核120-1和第二处理器核120-2。备选地,分发器110可以将指令130和/或数据140分发给多个处理器核120中的每个处理器核120。在一些实施例中,分发器110可以接收 配置信息210。配置信息210可以指示分发器110将指令130和/或数据140分发给多个处理器核120中的某个或某些处理器核120。例如,配置信息210可以指示分发器110将指令130和/或数据140分发给仅第一处理器核120-1。又如,配置信息210可以指示分发器110将指令130和/或数据140分发给多个处理器核120中的各个处理器核120。备选地或附加地,分发器110可以被预设为将指令130和/或数据140分发给某个或某些处理器核120,或者分发给所有处理器核120。
在一些实施例中,分发器110可以接收要由处理器101进行处理的数据集和指令集。分发器110接收到的配置信息210可以至少指示数据集中的待处理数据(也被称为数据140)和指令集中的指令130之间的关联。例如,数据140与指令130之间的关联可以指示数据集中的数据140将根据指令集中的指令130来进行处理。分发器110可以至少部分地取决于该关联来分发数据140和指令130。例如,分发器110将数据140分发给多个处理器核120中的至少一个处理器核120的相应数据缓存230,并且将与数据140相关联的指令130分发给至少一个处理器核120的相应指令缓存220以供执行。
在一些实施例中,分发器110可以向至少一个处理器核广播相同的指令130。图3示出了根据本公开的一些实施例的用于对多个处理器核广播指令130的示意图。如图3所示,指令130可以包括指令0、指令1、……、指令M(其中M为大于等于1的整数)。指令130可以经由分发器110被广播给处理器核120-1、120-2、……、120-J(其中J为大于1的整数)。例如,与指令130相同的指令310-1、310-2、……、310-J被广播给处理器核120-1、120-2、……、120-J。为便于讨论,下文中将指令310-1、指令310-2、……、指令310-J统称为或单独称为指令310。应理解,处理器核120处接收到指令310与分发器110处接收到指令130的时间之间可能存在延迟320。
应理解,虽然图3中示出了将指令130广播给处理器核120-1、120-2、……、120-J,但在一些实施例中,分发器110可以将指令130仅广播给多个处理器核120中的更多的或更少的处理器核120。例如, 分发器110可以将指令130仅广播给一个处理器核120。又如,分发器110可以将指令130广播给所有处理器核120(即,J等于N)。分发器110将指令130广播给哪个处理器核120或哪些处理器核120可以被预先设置或者基于接收到的配置信息210来进行设置。
备选地,在一些实施例中,分发器110还可以采用其他的传输方式来将指令130分发给各个处理器核120。例如,分发器110可以先将指令130发送给处理器核120-1,之后再将指令130发送给处理器核120-2,以此类推。相比这种先后分发的方式,使用广播的方式分发指令能够减少重复读取指令的开销,进而极大节省指令传输开销。
在上述对各个处理器核广播相同的指令的示例中,分发器110可以将与该指令相关联的不同的待处理数据分别发送给不同的处理器核120。例如,分发器110可以将数据140中的第一数据广播或发送给第一处理器核120-1以供处理,并且将数据140中不同于第一数据的第二数据广播或发送给第二处理器核120-2以供处理。
这种布置将是有益的。例如,在诸如神经网络推理等很多计算场景中,存在大量使用相同的指令运算不同数据的场景,在这类场景中采用广播的方式向不同处理器核120广播相同的指令和不同的数据能够极大节省指令和数据传输开销,进而提高数据处理效率。
应理解,虽然图3中仅示出了将相同的指令130广播给各个处理器核120的示例,备选地或附加地,在一些实施例中,也可以使用与图3类似的过程来将相同的数据广播给各个处理器核120。例如,可以将数据140广播给多个处理器核120中的至少一个处理器核120。在这种示例中,可以将与数据140相关联的不同指令分发给不同的处理器核120。例如,可以将指令130中的第一指令发送或广播给处理器核120-1,以由处理器核120基于第一指令处理数据130,并且将指令130中的第二指令发送或广播给处理器核120-2,以由处理器核120-2基于第二指令处理数据140。
这种将相同数据和不同指令分发给各个处理器核120的方式适用于很多数据处理场景,例如某些数据量小但处理过程复杂的数据处理 过程。例如,神经网络计算中经常会出现对同一数据使用不同计算流程的场景。以上所描述的将相同数据和不同指令分发给各个处理器核120的方式能够很好地适用于这样的场景。以此方式,能够极大节省此类数据和指令传输过程的开销,进而提高数据处理效率。
继续参考图2,各个处理器核120被配置为根据与之关联的执行流水线240-1、240-2、……、流水线240-N来执行指令。为便于讨论,下文中将执行流水线240-1、执行流水线240-2、……、执行流水线240-N统称为或单独称为执行流水线240。执行流水线240被配置为利用指令缓存220中的指令来处理数据缓存230中的数据。例如,执行流水线240可以根据指令缓存220中的指令对写入数据缓存230中的数据进行处理,并且将处理结果发送回数据缓存230。备选地或附加地,各个处理器核120可以将处理结果由数据缓存230发送给分发器110。
在一些实施例中,分发器110可以从至少一个处理器核120的相应数据缓存230分别接收该处理结果。附加地,分发器110可以将所接收到的待处理数据的处理结果发送给其他设备,诸如数据处理请求的发起设备。以此方式,分发器110可以负责外部数据与处理器核120中的数据缓存230中的数据的交换,从而减少处理器核120对外的数据交换。
在一些实施例中,分发器110可以将指令130和与指令相关联的数据140分发给各个处理器核120。各个处理器核120可以分别对接收到的数据140根据指令130进行处理,并且将处理结果发送给分发器110。
在一些实施例中,分发器110可以采用循环的方式来对数据缓存230读写数据。例如,分发器110可以将待处理数据(例如数据140)中的第三数据分发给至少一个处理器核120中的第一处理器核120-1以供处理。分发器110响应于从处理器核120-1接收到通过处理第三数据而获得的第一结果,将待处理数据中的不同于第三数据的第四数据分发给第一处理器核120-1。
图4示出了根据本公开的一些实施例的由处理器核120循环地写入数据、执行指令以及读取数据的示意图。如图4所示,分发器110将数据410-1发送给处理器核120。处理器核120对接收到的数据410-1进行数据写入430-1,以将数据410-1写入数据缓存230。处理器核120根据接收到的与数据410-1相关联的指令利用诸如图2所示的执行流水线240来进行指令执行440-1。处理器核120进而将指令执行440-1所得到的结果从数据缓存230中进行数据读取450-1。处理器核120将所读取的数据420-1发送给分发器110。
响应于从处理器核120接收到数据420-1,分发器110将数据410-2发送给处理器核120。处理器核120进而对数据410-2进行数据写入430-2、指令执行440-2及数据读取450-2等过程,并且将所读取到的对应于数据410-2的处理结果的数据420-2发送给分发器110。
类似地,响应于从处理器核120接收到前一次的数据处理结果,分发器110可以将数据410-K(其中K可以是大于1的整数)发送给处理器核120。处理器核120进而对数据410-K进行数据写入430-K、指令执行440-K及数据读取450-K等过程,并且将所读取到的对应于数据410-K的处理结果的数据420-K发送给分发器110。与数据410-1、410-2、……、410-K相关联的指令可以是同一个指令,该指令可以仅由分发器110向处理器核120传输一次。在一些实施例中,循环读写和处理数据的次数K可以是预先设置的。备选地或附加地,循环读写和处理数据的次数K可以根据分发器110所接收到的配置信息210来进行设置。
应理解,虽然图4中仅示出了其中一个处理器核120的循环读写和处理数据的过程,对于其他处理器核120,可以采用类似的过程来循环地读写和处理数据。这种循环的方式便于将大量数据载入处理器核120的数据缓存230。采用这种方式,可以仅将指令分发一次,而通过将与该指令相关联的数据采用循环读写和处理的过程,来进一步减少指令和数据传输开销。
应理解,除非另有明确指示,以上所描述的各个数据分发、指令 分发等过程可以按照任何适当的顺序来执行。以上所描述的各个数据、指令分发的实施例以及各个读写和处理数据的实施例可以相结合的实现。
以上结合图2至图4描述了采用分发器110将指令和/或数据分发给各个处理器核120的各种实施例。通过采用本公开的实施例,一方面,通过由分发器将数据分发给各个处理器核的数据缓存,各个处理器核各自的数据缓存与分发器直接传输数据,可以使各个处理器核使用大容量的数据缓存,并且尽可能减少处理器核对外的数据交换。
另一方面,本方案的实施例使用集中式的数据调度或分发机制,可以轻松地应用广播的方式来传输数据和/或指令,进而提高数据和指令的传输效率。以此方式,本方案能够充分利用有限的带宽资源,进而提高诸如向量计算的效率。
对于诸如神经网络训练和/或推理等计算,位于底层的计算单元所需要的带宽往往是数倍甚至数十倍外部能够提供的带宽。本公开的方案能够充分利用有限的带宽资源,从而提高诸如神经网络加速器等的计算效率。
图5示出了根据本公开的一些实施例的用于数据处理的过程500的流程图。过程500可以在处理器101的分发器110处实现。为便于讨论,将参考图1的环境100来描述过程500。
在框510,分发器110将待处理数据(例如,数据140)分发给多个处理器核120中的至少一个处理器核120的相应数据缓存230。例如,分发器110可以将数据140分发给多个处理器核120中的每个处理器核120或者多个处理器核120中的一个或多个处理器核120。在框520处,分发器110将与待处理数据(例如,数据140)相关联的指令130分发给至少一个处理器核120的相应指令缓存220以供执行。例如,分发器110可以将指令130分发给多个处理器核120中的每个处理器核120或者多个处理器核120中的一个或多个处理器核120。
在一些实施例中,分发器110可以通过向至少一个处理器核120 广播指令130来将指令130分发给至少一个处理器核120。在此类示例中,分发器110可以将数据140中的第一数据发送给至少一个处理器核120中的第一处理器核120-1以供处理,并且将数据140中的不同于第一数据的第二数据发送给至少一个处理器核120中的第二处理器核120-2以供处理。
附加地或备选地,在一些实施例中,分发器110可以通过将数据140广播给至少一个处理器核120来将数据140分发给至少一个处理器核120。在此类示例中,分发器110可以将指令130中的第一指令发送给至少一个处理器核120中的第一处理器核120-1,以由第一处理器核120-1基于第一指令处理数据140。分发器110还可以将指令130中的不同于第一指令的第二指令发送给至少一个处理器核120中的第二处理器核120-2,以由第二处理器核120-2基于第二指令处理数据140。
在一些实施例中,在框530处,分发器110还被配置为从至少一个处理器核120的相应数据缓存230分别接收处理结果。该处理结果由至少一个处理器核120根据指令130分别处理接收到的数据140而得到。例如,分发器110可以将数据140中的第三数据分发给至少一个处理器核120中的第一处理器核120-1以供处理。分发器110响应于从第一处理器核120-1接收到通过处理第三数据而获得的第一结果,将数据140中的不同于第三数据的第四数据分发给第一处理器核120-1以供处理。
在一些实施例中,分发器110还被配置为:接收要由处理器101进行处理的数据集和指令集。分发器110还被配置为:接收配置信息。该配置信息至少指示数据集中的待处理数据(例如,数据140)与指令集中的指令130之间的关联。例如,数据140与指令130之间的关联可以表示数据140将由指令130来处理。在此类示例中,数据140和指令130的分发可以至少部分地取决于上述关联。
应理解,虽然在图5中以特定顺序示出各个步骤,但这些步骤中的一些或全部可以以其他顺序或者并行执行。例如,图5中的框510 可以在框520之前执行,也可以在框520之后执行。本公开的范围在此方面不受限制。
图6示出了其中可以包括根据本公开的一个或多个实施例的处理器101的电子设备600的框图。应当理解,图6所示出的电子设备600仅仅是示例性的,而不应当构成对本文所描述的实施例的功能和范围的任何限制。
如图6所示,电子设备600是通用电子设备或计算设备的形式。电子设备600的组件可以包括但不限于一个或多个处理器101、存储器620、存储设备630、一个或多个通信单元640、一个或多个输入设备650以及一个或多个输出设备660。在一些实施例中,处理器101可以根据存储器620中存储的程序来执行各种处理。处理器101中的各个处理器核120可以并行执行计算机可执行指令,以提高电子设备600的并行处理能力。
电子设备600通常包括多个计算机存储介质。这样的介质可以是电子设备600可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器620可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备630可以是可拆卸或不可拆卸的介质,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息和/或数据(例如用于训练的训练数据)并且可以在电子设备600内被访问。
电子设备600可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图6中示出,可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器620可以包括计算机程序产品625,其具有一个或多个程序模块,这些程序模块被配置为执行本公开的各种实施例的各种方法或动 作。例如,这些程序模块可以被配置为实现分发器110的各种功能或动作。
通信单元640实现通过通信介质与其他电子设备或计算设备进行通信。附加地,电子设备600的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接进行通信。因此,电子设备600可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。
输入设备650可以是一个或多个输入设备,例如鼠标、键盘、追踪球等。输出设备660可以是一个或多个输出设备,例如显示器、扬声器、打印机等。电子设备600还可以根据需要通过通信单元640与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等,与一个或多个使得用户与电子设备600交互的设备进行通信,或者与使得电子设备600与一个或多个其他电子设备或计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。
根据本公开的示例性实现方式,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,还提供了一种计算机程序产品,计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令,而计算机可执行指令被处理器执行以实现上文描述的方法。
这里参照根据本公开实现的方法、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作 的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上,使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
根据本公开的一个或多个实施例,示例1描述了一种处理器,该处理器包括多个处理器核。多个处理器核各自包括用于读取和写入数据的数据缓存以及与数据缓存分离的、用于读取指令的指令缓存。该处理器还包括分发器,该分发器被通信地耦合到多个处理器核。该分发器被配置为:将待处理数据分发给多个处理器核中的至少一个处理器核的相应数据缓存;以及将与待处理数据相关联的指令分发给至少一个处理器核的相应指令缓存以供执行。
根据本公开的一个或多个实施例,示例2包括根据示例1所描述的处理器,其中将指令分发给至少一个处理器核包括:向至少一个处理器核广播指令。
根据本公开的一个或多个实施例,示例3包括根据示例2所描述的处理器,其中将待处理数据分发给至少一个处理器核包括:将待处理数据中的第一数据发送给第一处理器核以供处理;以及将待处理数据中的第二数据发送给第二处理器核以供处理。第二数据不同于第一数据。
根据本公开的一个或多个实施例,示例4包括根据示例1所描述的处理器,其中将待处理数据分发给至少一个处理器核包括:将待处理数据广播给至少一个处理器核。
根据本公开的一个或多个实施例,示例5包括根据示例4所描述 的处理器,其中将指令分发给至少一个处理器核包括:将第一指令发送给第一处理器核,以由第一处理器核基于第一指令处理待处理数据;以及将第二指令发送给第二处理器核,以由第二处理器核基于第二指令处理待处理数据。第一指令不同于第二指令。
根据本公开的一个或多个实施例,示例6包括根据示例1所描述的处理器,其中分发器还被配置为:从至少一个处理器核的相应数据缓存分别接收处理结果。该处理结果由至少一个处理器核根据指令分别处理接收到的待处理数据而得到。
根据本公开的一个或多个实施例,示例7包括根据示例6所描述的处理器,其中将待处理数据分发给至少一个处理器核包括:将待处理数据中的第三数据分发给至少一个处理器核中的第一处理器核以供处理;以及响应于从第一处理器核接收到通过处理第三数据而获得的第一结果,将待处理数据中的第四数据分发给第一处理器核以供处理,第三数据不同于第四数据。
根据本公开的一个或多个实施例,示例8包括根据示例1所描述的处理器,其中分发器还被配置为:接收要由处理器进行处理的数据集和指令集;以及接收配置信息。该配置信息至少指示数据集中的待处理数据与指令集中的指令之间的关联。待处理数据和指令的分发至少部分地取决于该关联。
根据本公开的一个或多个实施例,示例9描述了一种数据处理的方法。该方法包括:由处理器的分发器将待处理数据分发给处理器的多个处理器核中的至少一个处理器核的相应数据缓存。该分发器被通信地耦合到多个处理器核。该方法还包括将与待处理数据相关联的指令分发给至少一个处理器核的相应指令缓存以供执行。
根据本公开的一个或多个实施例,示例10包括根据示例9所描述的方法,其中将指令分发给至少一个处理器核包括:向至少一个处理器核广播指令。
根据本公开的一个或多个实施例,示例11包括根据示例10所描述的方法,其中将待处理数据分发给至少一个处理器核包括:将待处 理数据中的第一数据分发给第一处理器核以供处理;以及将待处理数据中的第二数据分发给第二处理器核以供处理。第一数据不同于第二数据。
根据本公开的一个或多个实施例,示例12包括根据示例9所描述的方法,其中将待处理数据分发给至少一个处理器核包括:将待处理数据广播给至少一个处理器核。
根据本公开的一个或多个实施例,示例13包括根据示例12所描述的方法,其中将指令分发给至少一个处理器核包括:将第一指令发送给第一处理器核,以由第一处理器核基于第一指令处理待处理数据;以及将第二指令发送给第二处理器核,以由第二处理器核基于第二指令处理待处理数据。第一指令不同于第二指令。
根据本公开的一个或多个实施例,示例14包括根据示例9所描述的方法。该方法还包括从至少一个处理器核的相应数据缓存分别接收处理结果。该处理结果由至少一个处理器核根据指令分别处理接收到的待处理数据而得到。
根据本公开的一个或多个实施例,示例15包括根据示例14所描述的方法,其中将待处理数据分发给至少一个处理器核包括:将待处理数据中的第三数据分发给至少一个处理器核中的第一处理器核以供处理;以及响应于从第一处理器核接收到通过处理第三数据而获得的第一结果,将待处理数据中的第四数据分发给第一处理器核以供处理。第三数据不同于第四数据。
根据本公开的一个或多个实施例,示例16包括根据示例9所描述的方法。该方法还包括:接收要由处理器进行处理的数据集和指令集;以及接收配置信息。该配置信息至少指示数据集中的待处理数据与指令集中的指令之间的关联。待处理数据和指令的分发至少部分地取决于该关联。
根据本公开的一个或多个实施例,示例17描述了一种电子设备,其至少包括根据示例1至8任一项所描述的处理器。
根据本公开的一个或多个实施例,示例18描述了一种计算机可 读存储介质,其上存储有计算机程序。该计算机程序被处理器执行以实现根据示例9至16任一项所描述的方法。
附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实现,上述说明是示例性的,并非穷尽性的,并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实现的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。

Claims (18)

  1. 一种处理器,包括:
    多个处理器核,所述多个处理器核各自包括用于读取和写入数据的数据缓存以及与所述数据缓存分离的、用于读取指令的指令缓存;以及
    分发器,所述分发器被通信地耦合到所述多个处理器核,并且被配置为:
    将待处理数据分发给所述多个处理器核中的至少一个处理器核的相应数据缓存;以及
    将与所述待处理数据相关联的指令分发给所述至少一个处理器核的相应指令缓存以供执行。
  2. 根据权利要求1所述的处理器,其中将所述指令分发给所述至少一个处理器核包括:
    向所述至少一个处理器核广播所述指令。
  3. 根据权利要求2所述的处理器,其中将所述待处理数据分发给所述至少一个处理器核包括:
    将所述待处理数据中的第一数据发送给所述第一处理器核以供处理;以及
    将所述待处理数据中的第二数据发送给所述第二处理器核以供处理,所述第二数据不同于所述第一数据。
  4. 根据权利要求1所述的处理器,其中将所述待处理数据分发给所述至少一个处理器核包括:
    将所述待处理数据广播给所述至少一个处理器核。
  5. 根据权利要求4所述的处理器,其中将所述指令分发给所述至少一个处理器核包括:
    将第一指令发送给所述第一处理器核,以由所述第一处理器核基于所述第一指令处理所述待处理数据;以及
    将第二指令发送给所述第二处理器核,以由所述第二处理器核基 于所述第二指令处理所述待处理数据,所述第一指令不同于所述第二指令。
  6. 根据权利要求1所述的处理器,其中所述分发器还被配置为:
    从所述至少一个处理器核的相应数据缓存分别接收处理结果,所述处理结果由所述至少一个处理器核根据所述指令分别处理接收到的所述待处理数据而得到。
  7. 根据权利要求6所述的处理器,其中将所述待处理数据分发给所述至少一个处理器核包括:
    将所述待处理数据中的第三数据分发给所述至少一个处理器核中的第一处理器核以供处理;以及
    响应于从所述第一处理器核接收到通过处理所述第三数据而获得的第一结果,将所述待处理数据中的第四数据分发给所述第一处理器核以供处理,所述第三数据不同于所述第四数据。
  8. 根据权利要求1所述的处理器,其中所述分发器还被配置为:
    接收要由所述处理器进行处理的数据集和指令集;以及
    接收配置信息,所述配置信息至少指示所述数据集中的所述待处理数据与所述指令集中的所述指令之间的关联,
    其中所述待处理数据和所述指令的分发至少部分地取决于所述关联。
  9. 一种数据处理的方法,包括:
    由处理器的分发器将待处理数据分发给所述处理器的多个处理器核中的至少一个处理器核的相应数据缓存,所述分发器被通信地耦合到所述多个处理器核;以及
    将与所述待处理数据相关联的指令分发给所述至少一个处理器核的相应指令缓存以供执行。
  10. 根据权利要求9所述的方法,其中将所述指令分发给所述至少一个处理器核包括:
    向所述至少一个处理器核广播所述指令。
  11. 根据权利要求10所述的方法,其中将所述待处理数据分发给 所述至少一个处理器核包括:
    将所述待处理数据中的第一数据分发给所述第一处理器核以供处理;以及
    将所述待处理数据中的第二数据分发给所述第二处理器核以供处理,所述第一数据不同于所述第二数据。
  12. 根据权利要求9所述的方法,其中将所述待处理数据分发给所述至少一个处理器核包括:
    将所述待处理数据广播给所述至少一个处理器核。
  13. 根据权利要求12所述的方法,其中将所述指令分发给所述至少一个处理器核包括:
    将第一指令发送给所述第一处理器核,以由所述第一处理器核基于所述第一指令处理所述待处理数据;以及
    将第二指令发送给所述第二处理器核,以由所述第二处理器核基于所述第二指令处理所述待处理数据,所述第一指令不同于所述第二指令。
  14. 根据权利要求9所述的方法,还包括:
    从所述至少一个处理器核的相应数据缓存分别接收处理结果,所述处理结果由所述至少一个处理器核根据所述指令分别处理接收到的所述待处理数据而得到。
  15. 根据权利要求14所述的方法,其中将所述待处理数据分发给所述至少一个处理器核包括:
    将所述待处理数据中的第三数据分发给所述至少一个处理器核中的第一处理器核以供处理;以及
    响应于从所述第一处理器核接收到通过处理所述第三数据而获得的第一结果,将所述待处理数据中的第四数据分发给所述第一处理器核以供处理,所述第三数据不同于所述第四数据。
  16. 根据权利要求9所述的方法,还包括:
    接收要由所述处理器进行处理的数据集和指令集;以及
    接收配置信息,所述配置信息至少指示所述数据集中的所述待处 理数据与所述指令集中的所述指令之间的关联,
    其中所述待处理数据和所述指令的分发至少部分地取决于所述关联。
  17. 一种电子设备,至少包括根据权利要求1至8任一项所述的处理器。
  18. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现根据权利要求9至16任一项所述的方法。
PCT/CN2023/098714 2022-06-14 2023-06-06 处理器以及用于数据处理的方法、设备和存储介质 WO2023241417A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210674851.9A CN117271427A (zh) 2022-06-14 2022-06-14 处理器以及用于数据处理的方法、设备和存储介质
CN202210674851.9 2022-06-14

Publications (1)

Publication Number Publication Date
WO2023241417A1 true WO2023241417A1 (zh) 2023-12-21

Family

ID=89192161

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/098714 WO2023241417A1 (zh) 2022-06-14 2023-06-06 处理器以及用于数据处理的方法、设备和存储介质

Country Status (2)

Country Link
CN (1) CN117271427A (zh)
WO (1) WO2023241417A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375801A (zh) * 2011-08-23 2012-03-14 孙瑞琛 一种多核处理器存储系统装置及方法
US20190294570A1 (en) * 2018-06-15 2019-09-26 Intel Corporation Technologies for dynamic multi-core network packet processing distribution
CN112558861A (zh) * 2020-09-29 2021-03-26 北京清微智能科技有限公司 一种面向多核处理器阵列的数据加载和存储系统及方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375801A (zh) * 2011-08-23 2012-03-14 孙瑞琛 一种多核处理器存储系统装置及方法
US20190294570A1 (en) * 2018-06-15 2019-09-26 Intel Corporation Technologies for dynamic multi-core network packet processing distribution
CN112558861A (zh) * 2020-09-29 2021-03-26 北京清微智能科技有限公司 一种面向多核处理器阵列的数据加载和存储系统及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIN-GUANG LIU, LIANG MAN-GUI: "The Development and the Software System Architecture of Multi-core Multi-threading Processor ", MICROPROCESSORS, vol. 1, 28 February 2007 (2007-02-28), pages 1 - 3, 7, XP093118107 *

Also Published As

Publication number Publication date
CN117271427A (zh) 2023-12-22

Similar Documents

Publication Publication Date Title
US11334262B2 (en) On-chip atomic transaction engine
US8473681B2 (en) Atomic-operation coalescing technique in multi-chip systems
US8368701B2 (en) Metaprocessor for GPU control and synchronization in a multiprocessor environment
US20070106827A1 (en) Centralized interrupt controller
US20020083373A1 (en) Journaling for parallel hardware threads in multithreaded processor
JP6580307B2 (ja) マルチコア装置及びマルチコア装置のジョブスケジューリング方法
US10078879B2 (en) Process synchronization between engines using data in a memory location
JP2005507115A (ja) マルチコアマルチスレッドプロセッサ
CN108351783A (zh) 多核数字信号处理系统中处理任务的方法和装置
WO2016082800A1 (zh) 一种内存管理方法、装置以及内存控制器
CN111404931B (zh) 一种基于持久性内存的远程数据传输方法
CN116414464B (zh) 调度任务的方法和装置、电子设备和计算机可读介质
CN112527729A (zh) 一种紧耦合异构多核处理器架构及其处理方法
CN103019655A (zh) 面向多核微处理器的内存拷贝加速方法及装置
CN103324599A (zh) 处理器间通信方法与系统级芯片
CN102681890A (zh) 一种应用于线程级推测并行的限制性值传递方法和装置
JP4584935B2 (ja) 動作モデルベースマルチスレッドアーキテクチャ
WO2023241417A1 (zh) 处理器以及用于数据处理的方法、设备和存储介质
CN102736949B (zh) 改善对非连贯设备要执行的任务的调度
US20220067536A1 (en) Processor system and method for increasing data-transfer bandwidth during execution of a scheduled parallel process
CN101305353B (zh) 集中式中断控制器
US12026628B2 (en) Processor system and method for increasing data-transfer bandwidth during execution of a scheduled parallel process
TWI823655B (zh) 適用於智慧處理器的任務處理系統與任務處理方法
CN103257943A (zh) 集中式中断控制器
CN106294276B (zh) 一种适用于粗粒度多核计算系统的发射模块及其工作方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23822986

Country of ref document: EP

Kind code of ref document: A1