CN117311812A

CN117311812A - Method for reordering buffer and related products thereof

Info

Publication number: CN117311812A
Application number: CN202210707245.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2023-12-29

Abstract

The present disclosure relates to a method for reorder buffering and related products, wherein the related products include reorder buffers, artificial intelligence processors, devices, boards, and computer readable storage media. The apparatus may be included in a computing processing device of a combined processing device, which may include one or more data processing devices. The foregoing combined processing means may also include interface means and other processing means. The computing processing device interacts with other processing devices to jointly complete the computing operation designated by the user. The combined processing means may further comprise storage means connected to the device and the other processing means, respectively, for storing data of the device and the other processing means. By the scheme, the access efficiency of the data can be remarkably improved.

Description

Method for reordering buffer and related products thereof

Technical Field

The present disclosure relates generally to the field of computers. More particularly, the present disclosure relates to a method for reorder buffering, a reorder buffering apparatus for performing the foregoing method, an artificial intelligence processor, a board card, a device, and a computer readable storage medium.

Background

In conventional processor designs, instruction out-of-order processing is generally performed by using a design method of instruction out-of-order processing, i.e., instructions that have no dependency relationship with each other are out-of-order processing. To implement out-of-order processing, conventional processors typically use a ReOrder Buffer (ROB) to record the actual order of all instructions and then release the order of instructions after the instructions have been executed. For an efficient artificial intelligence (AI, artificial Intelligence) processor, since the instructions in the instruction set are more capable of satisfying the needs of the artificial intelligence algorithm, the amount of data an instruction may calculate, the functions implemented, and/or the computation time required is far greater than for a conventional processor. Therefore, how to efficiently use a reorder buffer in an artificial intelligence computing scenario in order to accelerate the processing speed has become an important point of research.

Disclosure of Invention

In view of the technical problems mentioned in the background section, the present disclosure proposes an efficient reorder buffering scheme. By utilizing the scheme disclosed by the invention, the data access speed of the artificial intelligent processor can be obviously improved, so that the execution of instructions and the overall operation performance of the processor are accelerated. To this end, the present disclosure provides a scheme for reordering buffering in various aspects as follows.

In a first aspect, the present disclosure provides a reorder buffer disposed in an artificial intelligence processor, the artificial intelligence processor further comprising a storage device and an execution device, the reorder buffer comprising: a reorder buffer circuit configured to reside data read from the storage device in response to a previous read request; a receiving circuit configured to receive a current read request, wherein the current read request is to read corresponding data from a memory address of the memory device; a determination circuit configured to determine not to send the current read request when a memory address read by the current read request is the same as a memory address read by the previous read request; and a transmitting circuit configured to read resident data corresponding to the memory address from the reorder buffer circuit to transmit to the executing apparatus as data read by the current read request.

In a second aspect, the present disclosure provides an artificial intelligence processor comprising: a storage device for storing data; an execution device comprising one or more processing cores for executing tasks and configured to send read requests and/or write requests directed to the storage device; and a reorder buffer device as described in the first aspect above, which is arranged between the execution device and the storage device to at least implement a reorder buffer operation.

In a third aspect, the present disclosure provides a method for reorder buffering for use in a reorder buffering device included in an artificial intelligence processor, the artificial intelligence processor further comprising a storage device and an execution device, the method comprising: resident data read from the storage device in response to a previous read request; receiving a current read request for reading corresponding data from a storage address of the storage device; when the storage address read by the current read request is the same as the storage address read by the previous read request, judging that the current read request is not sent; and reading resident data corresponding to the memory address to be transmitted to the executing device as the data read by the current read request.

In a fourth aspect, the present disclosure provides an apparatus for reordering buffers, comprising: a processor; and a memory storing program instructions for reordering the buffers, which when executed by the processor, cause the method according to the third aspect to be implemented.

In a fifth aspect, the present disclosure provides a computer readable storage medium storing program instructions for reordering buffers, which when executed by a processor, cause the method according to the third aspect to be implemented.

With the arrangement provided in the aspects above, the reorder buffer of the present disclosure can reduce the number of requests of ROB devices shared by a plurality of execution devices, thereby saving the bandwidth of the access unit. Further, in some embodiments, the return of data to the execution device may be expedited by increasing the residence of data in the reorder buffer. In other embodiments, the processing speed of the read request can be increased, and the data caching function of the ROB device can be fully exerted by merging the current read request with the previous read request based on the storage address, i.e. selecting corresponding data of the previous read request to return without considering the current read request.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a simplified block diagram schematically illustrating an artificial intelligence processor according to an embodiment of the present disclosure;

FIG. 2 is a detailed structural block diagram schematically illustrating a reorder buffer according to an embodiment of the present disclosure;

FIG. 3 is a simplified flowchart schematically illustrating a method for reorder buffering according to the present disclosure;

FIG. 4 is a flow chart schematically illustrating details of a method for reorder buffering according to an embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating a software and hardware architecture for data stream programming according to an embodiment of the present disclosure;

fig. 6 is a block diagram illustrating a board card according to an embodiment of the present disclosure;

fig. 7 is a block diagram showing a combination processing apparatus according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating the internal structure of a computing device according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating the internal architecture of a processor core according to an embodiment of the present disclosure; and

FIG. 10 is a schematic diagram illustrating a data write process between processor cores of different clusters according to an embodiment of the present disclosure.

Detailed Description

Technical aspects of the embodiments of the present disclosure will be clearly and fully described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which may be made by those skilled in the art without the inventive effort, are intended to be within the scope of the present disclosure, based on the embodiments herein.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification, and drawings of this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in this specification and claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in this disclosure and in the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a condition or event described" is determined "or" if a condition or event described is detected "may be interpreted in the context of meaning" upon determination "or" in response to determination "or" upon detection of a condition or event described "or" in response to detection of a condition or event described ".

To improve data access efficiency in an artificial intelligence computing platform, aspects of the present disclosure innovatively propose providing a reorder buffer ("ROB") device supporting vector operations between an execution device in an artificial intelligence processor and a Network on Chip ("NoC"). By providing such ROB devices, access requests (including read requests and/or write requests) from the execution device can be uniformly processed therewith. In addition, the execution of nocs is typically out of order, while the out-of-order operation may be achieved by the ROB device of the present disclosure to record read requests and/or write requests from the executing device in order.

In the context of the present disclosure, the above-described network-on-chip refers to a network-on-chip that integrates a large number of computing resources and connects these resources on a single chip. Alternatively, various devices (e.g., executing devices or control devices) in the chip may access the network on chip through respective interfaces, and communicate with the destination module to be communicated using shared network resources. In aspects of the present disclosure, a processor core in an execution device may be coupled to an off-chip storage device via a NOC by means of a ROB device to read or write corresponding data from or to the off-chip storage device. The off-chip memory device may be, for example, a dynamic random access memory ("DRAM").

As previously described, to achieve efficient data access by the execution device, the present disclosure proposes to "merge" the read requests at the ROB device to avoid frequent reading of off-chip storage devices. In particular, for multiple read requests for the same memory address, the scheme of the present disclosure resides at the ROB device corresponding data obtained from the memory device to perform the initial read request. Based on this, for a subsequently received read request for the same memory address, it may be "merged" with a previous read request, i.e., for the subsequently received read request, the scheme of the present disclosure will no longer read the corresponding data from the memory device. Instead, the ROB device returns the corresponding data directly from the resident data to the executing device. In some implementations, such a "merge" operation may be initiated or triggered in different ways, such as by hardware or software instructions. Likewise, such "merge" operations may also be eliminated by mechanisms, thereby making the residence of data at the ROB device more flexible and steerable.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

FIG. 1 is a simplified block diagram schematically illustrating an artificial intelligence ("AI") processor 100 in accordance with an embodiment of the disclosure. It will be appreciated that the artificial intelligence processor herein may be an AI processor 601, described below in connection with FIG. 6, and has one or more processor cores so that multiple tasks may be performed in parallel.

As shown in FIG. 1, an artificial intelligence processor 100 may include an execution device 102, a storage device 104, and a reorder buffer 106 disposed therebetween. In this disclosure, the execution device may be an operator, an arithmetic unit, or a processor core in an artificial intelligence processor, which may be used to perform various arithmetic tasks, including calculation tasks of tensor or vector type data. In some scenarios, the execution means herein may be, for example, the computing means 701 shown in fig. 7, and may comprise a plurality of processor cores, wherein each processor core may individually initiate data access requests, such as read requests and write requests, to the storage means 104 for data in the storage means. The storage 104 may be various types of off-chip memory, such as dynamic random access memory ("DRAM"), which stores various data or intermediate results related to the execution of computing tasks by the execution device.

In order to achieve efficient data access, the ROB device 106 of the present disclosure is disposed between the execution device 102 and the storage device 104 for unified processing of data interactions between the execution device and the storage device. To implement a "merge" operation of read requests, ROB device 106 may record a number of different read requests that have been issued to form a history read request record, where each read request is associated with a certain memory address of the memory device to read data from that memory address. In one implementation, a particular memory space may be provided in ROB device 106 for storing the memory addresses of the read requests (i.e., memory addresses of the memory device), and these previous read requests may be distinguished and recorded by the memory addresses.

Further, the ROB device 106 may reside the data corresponding to these read requests in the ROB device, such as in a reorder buffer circuit, which will be mentioned below. Based on this, when a read request is received that has the same memory address as a read request in the history of read requests, ROB device 106 may "merge" the received read request with a previously recorded read request. Stated another way, ROB device 106 will not forward the received read request to the storage device; instead, it directly uses the corresponding data of the historical read request corresponding to the received read request to feed back to the execution device 102. In some application scenarios, to flexibly control the ROB device, the solution of the present disclosure also proposes to use various instructions to start or shut down the ROB device as described in the foregoing "merge" operation and "data dwell" operation. Thus, the ROB device disclosed by the invention can operate in a more user-friendly and efficient manner, so that the data access and storage efficiency of the execution device is improved.

Fig. 2 is a detailed block diagram schematically illustrating the reorder buffer 106 according to an embodiment of the present disclosure. It will be appreciated that the foregoing description of the reorder buffer is equally applicable to the following description, since the reorder buffer 106 shown in fig. 2, i.e., the reorder buffer shown in fig. 1.

As shown in fig. 2, reorder buffer 106 may include reorder buffer circuit 108, receive circuit 110, decision circuit 112, and transmit circuit 114. In one embodiment, reorder buffer circuit 108 is configured to reside data read from the storage device in response to a previous read request. The previous read requests herein may include a plurality of different read requests recorded from the start of the reorder buffer circuit 108, and the plurality of read requests are directed to data at different memory addresses of the memory device.

To achieve efficient control of the data residing in the reorder buffer circuit, reorder buffer circuit 108 may also be configured to maintain the residing of the corresponding data during execution of a first instruction containing one or more read requests for the same memory address. Thereafter, after execution of the first instruction ends, the corresponding data residing on the reorder buffer is released. It can be appreciated that in this application scenario, the ROB device of the present disclosure may implement the parking and releasing of data in units of instructions. Based on this, when the executing device is the cluster 805 in fig. 8 of the present disclosure, then the data resulting from multiple read requests of the same instruction may be commonly used by multiple (e.g., four as shown in fig. 8) processor cores 806 in the same cluster 805. When the instruction execution is finished, the ROB device may release the data corresponding to the plurality of read requests.

In some application scenarios, when, during execution of a first instruction, it is detected that a second instruction reads corresponding data on the reorder buffer, then the reorder buffer will hold the corresponding data resident rather than releasing the resident data after the first instruction is executed as described above. By the residence extension mode, the utilization rate of residence data can be improved, and frequent acquisition of the data from the storage device is avoided.

As a receiving end of the reorder buffer 106, the receiving circuit 110 may be configured to receive a current read request for reading corresponding data from a memory address of the memory device. The current read request here may be a read request from the executing device or a read request from the controlling device, depending on the scenario. Here, the control device is used to perform control operations on the execution device, and in some scenarios forwards a read request of the execution device to the receiving circuit 110. In some scenarios, the control device may also send a read request to the receiving circuit 110 alone, and the corresponding data as feedback of the read request may also be sent directly from the reorder buffer to the execution device.

To facilitate efficient control of the ROB device, aspects of the present disclosure also propose to use hardware instructions to initiate the ROB device to perform data parking and release operations. In view of this, the receiving circuit 110 may be configured to receive a third instruction for starting to reside corresponding data within the reorder buffer circuit. In response to receiving the third instruction by the receiving circuit 110, the reorder buffer circuit may be configured to initiate a parking operation of the corresponding data at the memory address of the memory device such that the executing device directly reads the corresponding data from the reorder buffer circuit via the read request. In the scenario where the corresponding data resides in the ROB device by the third instruction, the receiving circuit may be further configured to receive a write request for a storage address, and the reorder buffer circuit may be further configured to automatically release the corresponding data residing in response to receiving the write request for the storage address.

In other words, in the above application scenario, the data residence of the ROB device is started by the third instruction and the corresponding data is released by detecting the corresponding write request. The release operation is performed here in view of the fact that a write request for the memory address will overwrite or update the data of the memory address, at which time the resident data on the ROB device will be different from the data on the memory device, thus releasing it from the ROB device. For the third instruction, in one embodiment, it may be a synchronous instruction and one of its fields may be used to indicate its use as an instruction to initiate the data residence of the ROB device. In addition, in the context of the present disclosure, the release operation may have different implementations, such as deleting, removing, or setting the corresponding data as invalid directly from the reorder buffer circuit.

The above-mentioned write request and read request may come from the same execution device, depending on the application scenario. Additionally or alternatively, the write requests and read requests described above may also come from different execution devices that perform the same task. Thus, when read requests from multiple different execution devices for the same memory address occur, then the scheme according to the present disclosure only needs to consider one of the read requests, i.e., the read requests are "merged," thereby reducing the number of processing read requests. Similarly, a write request from one execution device for the same memory address may release resident data that was previously commonly used by multiple execution devices from the ROB device.

To implement a "merge" operation of a read request, the determination circuit 112 may be configured to determine not to send the current read request to the storage device when the storage address read by the current read request is the same as the storage address read by a previous read request. It is determined here that the current read request is not sent, i.e. is combined with the corresponding previous read request, such that the current read request is replaced by the corresponding previous read request and the read request forwarding operation is no longer performed to the storage device this time.

In response to determining not to send the current read request to the storage device, the sending circuit 114 may be configured to read resident data corresponding to the storage address from the reorder buffer circuit for sending to the execution device as data read by the current read request. It can be seen that since the data having the same memory address and corresponding to that memory address has previously resided in the reorder buffer circuit, the operation of reading the data corresponding to the current read request from the memory device will no longer be performed. Alternatively, resident data previously targeted to the memory address is read directly from the reorder buffer circuit. Since both are directed to the same memory address, the data obtained from the memory device based on the current read request and the previous read request is also the same. Based on this, the scheme of the present disclosure directly uses the resident data previously stored in the ROB device as feedback of the current read request.

In some application scenarios, to enable flexible manipulation of data in the ROB device, the receiving circuit is further configured to receive a fourth instruction to reside corresponding data in the reorder buffer, and the reorder buffer is further configured to reside the corresponding data in response to receiving the fourth instruction. Correspondingly, the receiving circuit is further configured to receive a fifth instruction releasing the corresponding data from the reorder buffer. In this case, the reorder buffer circuit is further configured to release the corresponding data in response to receiving a fifth instruction that releases the corresponding data from the reorder buffer. It will be appreciated that the use of the fourth instruction may enable a prefetch operation of resident data, e.g. the execution means may pre-resident the corresponding data in the ROB device without immediate return to the execution means. In view of this, the fourth instruction may be regarded as a prefetch instruction of the corresponding data, and it may be included in the IO instruction of the storage device, so that the delay of instruction processing may be reduced. In addition, the fifth instruction may be a data release instruction sent by the execution device to the ROB device, so that the ROB device may be ensured to release resident data in a normal order by using a sequence preserving characteristic between the execution device and the ROB device.

While the reorder buffer 106 of the present disclosure has been described in detail above with respect to fig. 2, it is to be understood that the above description is illustrative only and not limiting, and that modifications may be made to the implementation of the reorder buffer by those skilled in the art in light of the teachings of the present disclosure. For example, although the reorder buffer is depicted in a circuit arrangement, it may also be implemented by software modules or units. For example, the receiving circuit and the transmitting circuit may be replaced by a transceiving unit implemented by software code to enable the reception of read requests and/or write requests and the transfer of resident data.

Fig. 3 is a simplified flowchart schematically illustrating a method 300 for reordering buffering in accordance with the present disclosure. It is to be appreciated that the method 300 herein may be performed by the reorder buffer illustrated in fig. 1 and 2, and thus the description of the operation of the reorder buffer above is equally applicable to the description of the method 300 below.

As shown in fig. 3, at step S302, data read from the storage device in response to a previous read request resides. As previously described, the previous read request herein may be a data read request previously from the execution device for the storage device, and the parking may be to store the read data in the reorder buffer. Next, at step S304, a current read request for reading the corresponding data from the storage address of the storage device is received.

At step S306, when the memory address read by the current read request is the same as the memory address read by the previous read request, it is determined that the current read request is not transmitted. By determining whether the storage addresses are the same, it may be determined whether the previous read request can replace the current read request, i.e., whether the resident data for the previous read request is available for the current read request. Finally, at step S308, the resident data corresponding to the memory address is read to be transmitted to the execution apparatus as the data read by the current read request due to the same memory address. Obviously, with such a "merge" operation, the scheme of the present disclosure does not require frequent access to the storage for the same read request (for the same storage address), thereby speeding up the memory efficiency of the data (in particular tensor or vector data).

Fig. 4 is a detailed flow diagram schematically illustrating a method 400 for reordering buffering in accordance with an embodiment of the present disclosure. It is to be appreciated that the method 400 may be performed by the reorder buffer illustrated in fig. 1 and 2, and may be considered one possible implementation of the method 300. Accordingly, the description made above in connection with fig. 1-3 applies equally to the description that follows.

As shown in fig. 4, at step S402, an instruction for hosting data is received. As previously mentioned, the instructions herein may be the third or fourth instructions previously described. In some scenarios, the instructions herein may be hardware instructions or software instructions that initiate the ROB device for data residency, thereby enabling the ROB device to perform the read request "merge" operation of the present disclosure. Next, at step S404, in response to receiving the aforementioned instruction, the corresponding data is resident, thereby accumulating the formation history data (including the history read request and the corresponding resident data). For example, the parking at this time involves receiving a read request from the control device or the execution device for the storage device, and parking the data read from the storage device in the ROB device and returning to the execution device. Since it is still in the initial stage of accumulation in which data resides at this time, the comparison operation of the memory address of the read request may not be performed or the number of times of execution may be low. In some scenarios, when multiple read requests for the same memory address all reach the ROB device, the ROB device may also directly select one of the read requests to perform the operation of reading the corresponding data from the memory device, and ignore the remaining read requests to perform no more read operations.

At step S406, a current read request is received for reading corresponding data from a memory address of a memory device. At step S408, in response to receiving the current read request, a previous read request for the same memory address is looked up from the history read request record. At step S410, in response to finding a previous read request for the same memory address, it is determined not to send the current read request to the memory device. In other words, since the previous read request and the current read request are directed to the same memory address, and the corresponding data for that memory address already resides at the ROB device, it is not necessary to forward the current read request to the memory address. Thus, at step S412, the corresponding data on which the previous read request resides is directly read and sent to the execution device as the corresponding data of the current read request.

As previously described, in some scenarios, at step S414, a write request for a storage address is received. The write request may be at the same execution device as the read request or from a different execution device, depending on the application. At step S416, the resident corresponding data is automatically released in response to receiving a write request for the memory address. As previously described, the release operation may also be performed in units of instructions, i.e., the resident data will remain resident during execution of an instruction, and the ROB device will automatically release the resident data when the instruction is executed.

The method for reorder buffering of the present disclosure is described in detail above in connection with fig. 4. By the method shown in fig. 4, the scheme of the present disclosure reduces the number of read requests via the ROB device based on the read request "merge" policy of the memory address, thereby saving access bandwidth. In addition, the scheme of the present disclosure also supports flexible control of resident data on the ROB device through data resident and released.

Fig. 5 shows a design of a software and hardware architecture in an embodiment of the disclosure. As can be seen from the figure, the software and hardware architecture in this embodiment may include an AI processor 501, a driver and operating system 502, a compiler and programming language 503, a library 504, a framework layer 505, and an application layer 506. It is appreciated that the software and hardware architecture herein may be applied to the artificial intelligence computing system or computing platform of the present application.

Specifically, the AI processor 501 (which may be included, for example, in a board as described below in connection with the figures) considers both operational optimization and data handling optimization on the hardware design. For this purpose, it employs a customized arithmetic unit to accelerate the arithmetic and uses on-chip storage to accelerate data handling, resulting in extremely high performance and energy efficiency ratios. In addition, to support various algorithmic optimizations, the AI processor 501 may have a customized arithmetic unit and instruction set, where the instruction set may provide different granularity of operation instructions (scalar, vector, and/or matrix). Further, when factors such as access characteristics of the algorithm, hardware cost, verification difficulty and the like are considered, an on-chip storage mode can be adopted, and data handling is optimized. In actual operation, the AI processor of the present disclosure may achieve speeds that exceed the mainstream GPU (graphics processing unit) by more than a few tens of times.

The driver and operating system 502 is primarily responsible for implementing the scheduling of tasks on the AI processor 501. The scheduling operation may, for example, implement scheduling according to task priorities, communication and synchronization between multiple devices, and so on. For compiled programs, it may implement scheduled execution of tasks to be implemented on a particular processor through an operating system and drivers, including, but not limited to, the following operations: distributing and releasing the equipment memory, realizing data transmission among the equipment, maintaining task queues, and dispatching tasks according to the priority, so as to realize synchronization and cooperation among multiple equipment.

The compiler and programming language 503 may be a suite of assembly languages developed for the instruction set of the AI processor 501. In an application, it may translate deep learning algorithms developed for the AI processor 501 into processor instruction combinations to facilitate invocation of the AI processor 501 to efficiently use the AI processor 501. In some application scenarios, a compiler may be utilized to perform intermediate expression stages of compilation to optimize compilation.

Library 504 may include a runtime library 514 and a machine learning library 524. In one implementation scenario, the library 504 may use the instruction set of the AI processor 501 and perform partial optimization according to the instruction set of the AI processor 501 to increase the operation speed of the operator. The runtime library 514 may be a set of high-performance operator libraries developed specifically for the AI processor 501 and may be used to accomplish interactions between general-purpose processors and artificial intelligence processors. Further, the runtime library 514 may also provide a set of interfaces to artificial intelligence processors. For the machine learning library 524, it may be used to accelerate various machine learning or deep learning algorithms on the artificial intelligence processor. Specifically, the machine learning library 524 may provide a set of efficient, general-purpose, flexible, and extensible programming interfaces, and the upper-level machine learning applications may directly employ the programming interfaces of the various programming frameworks (e.g., pytorch, tensorFlow, caffe, MXNet, etc.), or may be directly programmed using the interfaces provided by the machine learning library 524. Additionally, the machine learning library 524 of the present disclosure may facilitate invocation of hardware platforms, while the runtime library 514 may implement some underlying common operators, such as various operations of convolution, pooling, and the like.

The framework layer 505 may add encapsulation to operators developed for AI processors and primarily to operators of the runtime library 514. In addition, the framework layer 505 may modify related task scheduling or memory management. In one application scenario, the framework layer 505 may employ the architecture of a framework such as TensorFlow.

The device side in the embodiments of the present disclosure may be an artificial intelligence chip or a board card, etc. Fig. 6 shows a schematic structural diagram of a board 600 according to an embodiment of the disclosure. As shown in fig. 6, the board 600 includes a Chip (or "processing Chip") 601, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, which is an artificial intelligent computing unit, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in complex scenarios in the fields of computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided, and the board card 600 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and large computing capacity.

The chip 601 is connected to an external device 603 via an external interface device 602. The external device 803 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a WIFI interface, or the like. The data to be processed may be transferred by the external device 603 to the chip 601 via the external interface means 602. The calculation result of the chip 601 may be transferred back to the external device 603 via the external interface means 602. The external interface device 602 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The board 600 also includes a memory device 604 for storing data, which includes one or more memory units 605. The memory device 604 is connected to the control device 606 and the chip 601 via a bus and transmits data. The control device 606 in the board 600 is configured to regulate the state of the chip 601. To this end, in one application scenario, the control device 606 may include a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 7 is a block diagram showing a combination processing apparatus 700 in a chip 601 of this embodiment. As shown in fig. 7, the combined processing device 700 includes a computing device 701 (which may correspond to the execution device described previously in the present disclosure), an interface device 702, a processing device 703, and a DRAM 704.

The computing device 701 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 703 through the interface device 702 to collectively accomplish the user-specified operations.

The interface means 702 is used for transmitting data and control instructions between the computing means 701 and the processing means 703. For example, the computing device 701 may obtain input data from the processing device 703 via the interface device 702, writing to a storage device on-chip of the computing device 701. Further, the computing device 701 may obtain control instructions from the processing device 703 via the interface device 702, and write the control instructions into a control cache on the chip of the computing device 701. Alternatively or in addition, the interface device 702 may also read data in a memory device of the computing device 701 and transmit it to the processing device 703.

The processing device 703 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 701, and the like. Depending on the implementation, the processing device 703 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 701 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 701 and processing device 703 are considered together, they are considered to form a heterogeneous multi-core structure.

The DRAM704 is used to store data to be processed, typically a DDR memory, typically 16G or more in size, for storing data for the computing device 701 and/or the processing device 703. Although not shown in fig. 7, the ROB device of the present disclosure may be disposed between the computing device 701 and the DRAM704 for performing the operational steps described with respect to fig. 1-4.

Fig. 8 shows a schematic internal structure of a computing device 701. The computing device 701 is configured to process input data such as computer vision, voice, natural language, and data mining, where the computing device 701 is configured in a multi-core hierarchical structure, and the computing device 701 is a system on a chip, and includes a plurality of clusters (clusters), each of which includes a plurality of processor cores, and may be configured to perform tasks issued by the present disclosure. In other words, the computing device 701 is structured in a hierarchy of system-on-chip-cluster-processor cores.

At the system-on-chip level, as shown in fig. 8, a computing device 701 includes an external storage controller 801, a peripheral communication module 802, an on-chip interconnection module 803, a synchronization module 804, and a plurality of clusters 805.

There may be a plurality of external memory controllers 801, 2 being illustratively shown, for accessing external memory devices, such as DRAM 904 in fig. 9 (which may correspond to the memory means of the present disclosure), in response to access requests issued by the processor cores, to read data from or write data to the off-chip. The peripheral communication module 802 is configured to receive a control signal from the processing device 703 through the interface device 702, and activate the computing device 701 to perform a task. The on-chip interconnect module 803 connects the external memory controller 801, the peripheral communication module 802, and the plurality of clusters 805 for transmitting data and control signals between the respective modules. The synchronization module 804 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 805 are the computing cores of the computing device 701, 4 being shown by way of example in the figures, and the computing device 701 of the present disclosure may also include 8, 16, 64, or even more clusters 805 as hardware progresses.

At the cluster level, as shown in FIG. 8, each cluster 805 includes a plurality of processor cores (IPU cores) 806 and a memory core (MEM core) 807. The processor cores 806 are illustratively shown as 4 in the figures, the present disclosure does not limit the number of processor cores 806. The internal architecture is shown in fig. 8. Each processor core 806 includes three major modules: a control module 91, an operation module 92 and a storage module 93.

The control module 91 is used for coordinating and controlling the operation of the operation module 92 and the storage module 93 to complete the task of deep learning, and includes a fetch unit (instruction fetch unit, IFU) 911 and an instruction decode unit (instruction decode unit, IDU) 912. The instruction fetching unit 911 is configured to fetch an instruction from the processing device 703, and the instruction decoding unit 912 decodes the fetched instruction and sends the decoded result as control information to the operation module 92 and the storage module 93.

The operation module 92 includes a vector operation unit 921 and a matrix operation unit 922. The vector operation unit 921 is used for executing vector operation and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 922 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.

The storage module 93 is used to store or carry related data, including a neuron storage unit (NRAM) 931, a weight storage unit (WRAM) 932, an input/output direct memory access module (input/output direct memory access, IODMA) 933, and a carry direct memory access module (move direct memory access, MVDMA) 934.NRAM 931 is used to store input, output data and intermediate results for computation by processor core 806; WRAM 932 is configured to store weights for the deep learning network; the IODMA 933 controls access to the NRAM 931/WRAM 932 and the DRAM 704 via the broadcast bus 1009; MVDMA 934 is used to control access to NRAM 931/WRAM 932 and SRAM 808.

Returning to FIG. 8, the memory cores 807 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 806, as well as to perform communications between the clusters 805 and the DRAM 704, between the clusters 805, between the processor cores 806, etc. In other embodiments, memory core 807 has scalar operation capabilities to perform scalar operations.

The memory core 807 includes shared memory cells (SRAM) 808, broadcast bus 809, cluster direct memory access module (cluster direct memory access, CDMA) 810, and global direct memory access module (global direct memory access, GDMA) 811. The SRAM 808 performs the role of a high-performance data transfer station, and data multiplexed between different processor cores 806 in the same cluster 805 is not required to be obtained from the processor cores 806 to the DRAM 704, but is transferred between the processor cores 806 through the SRAM 808, and the memory core 807 only needs to rapidly distribute the multiplexed data from the SRAM 808 to the plurality of processor cores 806, so as to improve the inter-core communication efficiency and greatly reduce on-chip off-chip input/output access.

Broadcast bus 809, CDMA 810, and GDMA 811 are used to perform communication between processor cores 806, communication between clusters 805, and data transfer between clusters 805 and DRAM 704, respectively. As will be described below, respectively.

The broadcast bus 809 is used to accomplish high speed communications among the processor cores 806 within the cluster 805. The broadcast bus 809 of this embodiment supports inter-core communications modes including unicast, multicast and broadcast. Unicast refers to the transfer of data from point to point (i.e., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 808 to a specific number of processor cores 806, and broadcast is a communication scheme that transfers a piece of data from SRAM 808 to all processor cores 806, a special case of multicast.

CDMA 810 is used to control access to SRAM 808 between different clusters 805 within the same computing device 701. Fig. 10 shows a schematic diagram when one processor core wants to write data to another clustered processor core to illustrate the operation of CDMA 810. In this application scenario, the same computing device includes a plurality of clusters, for convenience of illustration, only cluster 0 and cluster 1 are shown in the figure, where cluster 0 and cluster 1 include a plurality of processor cores respectively, and for convenience of illustration, also, cluster 0 in the figure only shows processor core 0, and cluster 1 only shows processor core 1. Processor core 0 is to write data to processor core 1.

Firstly, the processor core 0 sends a unicast write request to write data into the local SRAM 0, CDMA 0 is used as a master (master) end, CDMA 1 is used as a slave (slave) end, the master end sends a write request to the slave end, that is, the master end sends a write address AW and write data W, the data is transferred to the SRAM 1 of the cluster 1, then the slave end sends a write response B as a response, and finally the processor core 1 of the cluster 1 sends a unicast read request to read the data from the SRAM 1.

Returning to FIG. 8, GDMA 811 cooperates with external memory controller 801 to control access of SRAM 808 of cluster 805 to DRAM 704 or to read data from DRAM 704 into SRAM 808. From the foregoing, it is appreciated that communication between DRAM 704 and NRAM 931 or WRAM 932 may be achieved via 2 channels. The first channel is to directly contact the DRAM 704 with the NRAM 931 or WRAM 932 through the IODAM 933; the second channel is to transfer data between DRAM 704 and SRAM 808 via GDMA 811 and then between SRAM 808 and NRAM 931 or WRAM 932 via MVDMA 934. While seemingly the second channel requires more elements to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel, so communication between the DRAM 704 and the NRAM 931 or WRAM 932 may be more efficient through the second channel. Embodiments of the present disclosure may select a data transmission channel based on the hardware conditions itself.

In other embodiments, the functionality of GDMA 811 and the functionality of IODMA 933 may be integrated in the same component. The GDMA 811 and IODMA 933 are considered to be different components for convenience of description, and it is within the scope of protection of the present disclosure for those skilled in the art as long as the functions and technical effects achieved are similar to the present disclosure. Further, the functions of the GDMA 811, the IODMA 933, the CDMA 810, and the MVDMA 934 may also be implemented by the same components, which are also within the protection scope of the present disclosure as long as the implemented functions and the achieved technical effects are similar to the present disclosure.

The software and hardware architecture of the present disclosure and its internal structure are described in detail above in connection with fig. 5-10. It is to be understood that the above description is intended to be illustrative and not restrictive. Depending on the application scenario and hardware specifications, one skilled in the art may also make changes to the board card (or artificial intelligence device) and its internal structure of the present disclosure, while still falling within the scope of the present disclosure.

Based on the foregoing, those skilled in the art will appreciate that the present application also discloses an apparatus that includes a processor and a memory. In particular, the memory may store program instructions for re-ordering the buffers, which when executed by the processor, implement the operational steps described herein in connection with fig. 1-4. Additionally, since aspects of the present application may be implemented by computer program instructions, the present application also discloses a computer readable storage medium or computer program product having stored thereon a computer program/instructions for reordering buffering, thereby implementing the operational steps described in connection with fig. 1-5.

The aspects of the present disclosure are described in detail above with reference to the accompanying drawings. Depending on the application scenario, the devices or apparatus of the present disclosure may include a server, cloud server, server cluster, data processing apparatus, robot, computer, printer, scanner, tablet, intelligent terminal, PC device, internet of things terminal, mobile terminal, cell phone, automobile recorder, navigator, sensor, camera, video camera, projector, watch, earphone, mobile storage, wearable device, vision terminal, autopilot terminal, vehicle, household appliance, and/or medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The apparatus or device of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like.

Further, the device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a high power device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a low power device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may be implemented in other ways not disclosed herein. For example, in terms of the units in the foregoing embodiment of the apparatus or device, the logic function is divided herein in consideration of the logic function, and there may be another division manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The foregoing components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of a method described by an embodiment of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory ("ROM"), a random access Memory ("Random Access Memory" RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media capable of storing program codes.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned memory unit or storage device may be any suitable storage medium (including magnetic or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory ("Resistive Random Access Memory", abbreviated RRAM), dynamic random access memory ("Dynamic Random Access Memory", abbreviated DRAM), static random access memory ("Static Random Access Memory", abbreviated SRAM), enhanced dynamic random access memory ("Enhanced Dynamic Random Access Memory", abbreviated EDRAM "), high bandwidth memory (" High Bandwidth Memory ", abbreviated HBM"), hybrid memory cube ("Hybrid Memory Cube", abbreviated HMC "), ROM, RAM, etc.

The foregoing may be better understood in light of the following clauses:

clause A1, a reorder buffer configured in an artificial intelligence processor, the artificial intelligence processor further comprising a storage device and an execution device, the reorder buffer comprising:

a reorder buffer circuit configured to reside data read from the storage device in response to a previous read request;

a receiving circuit configured to receive a current read request, wherein the current read request is to read corresponding data from a memory address of the memory device;

a determination circuit configured to determine not to transmit the current read request when a memory address read by the current read request is the same as a memory address read by the previous read request; and

a transmitting circuit configured to read resident data corresponding to the storage address from the reorder buffer circuit to transmit to the executing apparatus as data read by the current read request.

Clause A2, the reorder buffer device of clause A1, wherein the decision circuit is further configured to:

in response to receiving the current read request, looking up a previous read request for the same memory address from a history read request record in the reorder buffer circuit; and

The corresponding data of the previous read request is directly read from the reorder buffer circuit and is used as the corresponding data of the current read request to be sent to the execution device.

Clause A3, the reorder buffer device of clause A1, wherein the reorder buffer circuit is configured to:

maintaining the residency of the corresponding data during execution of a first instruction comprising one or more read requests for the same memory address; and

and after the execution of the first instruction is finished, releasing the corresponding data residing on the reordering buffer.

Clause A4, the reorder buffer device of clause A3, wherein the reorder buffer circuit is configured to:

and during the execution of the first instruction by the execution device, responding to detection of reading corresponding data on the reordering buffer circuit by a second instruction, and keeping the corresponding data resident until the second instruction is released after the execution is finished.

Clause A5, the reorder buffer of clause A1, wherein the receiving circuit is further configured to receive a third instruction indicating that the corresponding data begins to reside within the reorder buffer, and in response to receiving the third instruction, the reorder buffer is configured to initiate a resident operation on the corresponding data at a storage address of a storage device, such that the corresponding data is directly read from the reorder buffer by the execution unit via the read request.

Clause A6, the reorder buffer of clause A5, wherein the receiving circuit is further configured to receive a write request for the memory address, and the reorder buffer is further configured to automatically release the resident corresponding data in response to receiving the write request for the memory address.

Clause A7, the reorder buffer of clause A6, wherein the write request and the read request are received from the same execution device.

Clause A8, the reorder buffer according to clause A6, wherein the write request and the read request are received from different execution devices that execute the same task.

Clause A9, the reorder buffer of clause A5, wherein the third instruction is a synchronize instruction, and a field of the synchronize instruction is used to indicate its use as an instruction to initiate a residence.

Clause a10, the reorder buffer of clause A1, wherein the receiving circuit is further configured to receive a fourth instruction for parking the corresponding data in the reorder buffer, and the reorder buffer is further configured to, in response to receiving the fourth instruction, park the corresponding data; and

The receive circuit is further configured to receive a fifth instruction to release the corresponding data from the reorder buffer, and the reorder buffer is further configured to release the corresponding data in response to receiving the fifth instruction to release the corresponding data from the reorder buffer.

Clause a11, the reorder buffer of clause a10, wherein the fourth instruction is contained in an IO instruction directed to the storage device.

Clause a12, an artificial intelligence processor comprising:

a storage device for storing data;

an execution device comprising one or more processing cores for executing tasks and configured to send read requests and/or write requests directed to the storage device; and

the reorder buffer device according to any of clauses A1-a11, which is arranged between the execution means and the storage means to at least effect a reorder buffer operation.

Clause a13, a method for reorder buffering, applied to a reorder buffering device included in an artificial intelligence processor, the artificial intelligence processor further comprising a storage device and an execution device, the method comprising:

Resident data read from the storage device in response to a previous read request;

receiving a current read request, wherein the current read request is used for reading corresponding data from a storage address of the storage device;

when the storage address read by the current read request is the same as the storage address read by the previous read request, judging that the current read request is not sent; and

and reading resident data corresponding to the storage address to be sent to the executing device as the data read by the current read request.

Clause a14, the method of clause a13, further comprising:

in response to receiving the current read request, looking up a previous read request for the same memory address from a history of read requests; and

and directly reading the corresponding data of the previous read request as the corresponding data of the current read request to be sent to the executing device.

Clause a15, the method of clause a13, further comprising:

and after the execution of the first instruction is finished, releasing the resident corresponding data.

Clause a16, the method of clause a15, further comprising:

and during the execution of the first instruction by the execution device, responding to the detection of the second instruction to read the corresponding data, and keeping the residence of the corresponding data until the second instruction is released after the execution is finished.

Clause a17, the method of clause a13, further comprising:

receiving a third instruction for indicating to start to reside in the corresponding data;

and in response to receiving the third instruction, starting corresponding data of a storage address of the storage device to carry out a resident operation so that the execution device can directly read the corresponding data from the reordering buffer device through the read request.

Clause a18, the method of clause a17, further comprising:

receiving a write request for the memory address;

responsive to receiving a write request for the storage address, the resident corresponding data is automatically released.

Clause a19, the method of clause a18, wherein the write request and the read request are received from the same execution device.

Clause a20, the method of clause a18, wherein the write request and the read request are received from different execution devices.

Clause a21, the method of clause a17, wherein the third instruction is a synchronization instruction and the read request is a field in the synchronization instruction.

Clause a22, the method of clause a13, further comprising:

receiving a fourth instruction for hosting the corresponding data;

in response to receiving the fourth instruction, residing the corresponding data;

receiving a fifth instruction for releasing the corresponding data; and

and releasing the corresponding data in response to receiving a fifth instruction for releasing the corresponding data.

Clause a23, the method of clause a22, wherein the fourth instruction is included in an IO instruction directed to the storage device.

Clause a24, an apparatus for reordering buffers, comprising:

a processor; and

a memory storing program instructions for reordering buffers, which when executed by a processor, cause the method according to any of clauses a13-a23 to be implemented.

Clause a25, a computer readable storage medium storing program instructions for reordering buffers, which when executed by a processor, cause the method according to any of clauses a13-a23 to be implemented.

Clause a26, a board card, comprising the artificial intelligence processor according to clause a12 or the device according to clause a 24.

While the embodiments of the present disclosure are described above, the descriptions are merely examples employed to facilitate understanding of the present disclosure, and are not intended to limit the scope and application of the present disclosure. Any person skilled in the art of the present disclosure may make any modifications and variations in form and detail without departing from the spirit and scope of the disclosure, but the scope of the disclosure is still subject to the scope defined by the appended claims.

Claims

1. A reorder buffer disposed in an artificial intelligence processor, the artificial intelligence processor further comprising a storage device and an execution device, the reorder buffer comprising:

a determination circuit configured to determine not to send the current read request when a memory address read by the current read request is the same as a memory address read by the previous read request; and

And a transmitting circuit configured to read resident data corresponding to the memory address from the reorder buffer circuit to transmit to the executing apparatus as data read by the current read request.

2. The reorder buffer of claim 1, wherein the decision circuit is further configured to:

3. The reorder buffer of claim 1, wherein the reorder buffer circuit is configured to:

maintaining the residency of the corresponding data during execution of a first instruction containing one or more read requests for the same memory address; and

4. The reorder buffer of claim 3, wherein said reorder buffer circuit is configured to:

And during the execution of the first instruction by the execution device, responding to the detection of the second instruction to read the corresponding data on the reordering buffer circuit, and keeping the residence of the corresponding data until the second instruction is released after the execution is finished.

5. The reorder buffer of claim 1, wherein the receive circuit is further configured to receive a third instruction that indicates a start of the parking of the corresponding data within the reorder buffer, and in response to receiving the third instruction, the reorder buffer is configured to initiate a parking operation on corresponding data of a memory address of a memory device such that the execution unit directly reads the corresponding data from the reorder buffer via the read request.

6. The reorder buffer of claim 5, wherein said receive circuit is further configured to receive a write request for said memory address, and said reorder buffer circuit is further configured to automatically release said corresponding data residing in response to receiving a write request for said memory address.

7. The reorder buffer of claim 6, wherein the write request and the read request are received from the same execution device.

8. The reorder buffer of claim 6, wherein the write request and the read request are received from different execution devices that execute the same task.

9. The reorder buffer of claim 5, wherein said third instruction is a synchronize instruction and a field of said synchronize instruction is used to indicate its use as an initiate resident instruction.

10. The reorder buffer of claim 1, wherein the receive circuit is further configured to receive a fourth instruction for parking the corresponding data in the reorder buffer, and the reorder buffer circuit is further configured to park the corresponding data in response to receiving the fourth instruction; and

11. The reorder buffer device of claim 10, wherein the fourth instruction is included in an IO instruction directed to the storage device.

12. An artificial intelligence processor comprising:

a storage device for storing data;

the reorder buffer device according to any of claims 1-11, being arranged between said execution means and storage means for at least implementing a reorder buffer operation.

13. A method for reorder buffering for use in a reorder buffering device included in an artificial intelligence processor, the artificial intelligence processor further comprising a storage device and an execution device, the method comprising:

receiving a current read request for reading corresponding data from a storage address of the storage device;

14. The method of claim 13, further comprising:

and directly reading the corresponding data of the previous read request, and sending the corresponding data as the corresponding data of the current read request to the execution device.

15. The method of claim 13, further comprising:

16. The method of claim 15, further comprising:

17. The method of claim 13, further comprising:

18. The method of claim 17, further comprising:

receiving a write request for the memory address;

19. The method of claim 18, wherein the write request and the read request are received from a same execution device.

20. The method of claim 18, wherein the write request and the read request are received from different execution devices.

21. The method of claim 17, wherein the third instruction is a synchronization instruction and the read request is a field in the synchronization instruction.

22. The method of claim 13, further comprising:

receiving a fourth instruction for hosting the corresponding data;

receiving a fifth instruction for releasing the corresponding data; and

23. The method of claim 22, wherein the fourth instruction is included in an IO instruction directed to the storage device.

24. An apparatus for reordering buffers, comprising:

a processor; and

a memory storing program instructions for reordering buffers, which when executed by a processor, cause the method according to any of claims 13-23 to be implemented.

25. A computer readable storage medium storing program instructions for reordering buffers, which when executed by a processor, cause the method according to any of claims 13-23 to be implemented.

26. A board comprising the artificial intelligence processor of claim 12 or the apparatus of claim 24.