CN117667211A

CN117667211A - Instruction synchronous control method, synchronous controller, processor, chip and board card

Info

Publication number: CN117667211A
Application number: CN202211067771.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2024-03-08

Abstract

The disclosure discloses an instruction synchronous control method, a synchronous controller, a processor, a chip and a board card. The processor may be included as computing means in a combined processing means, which may also include interface means and other processing means. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme of the disclosure provides an instruction synchronous control method, which can avoid unnecessary waiting time and improve processing efficiency.

Description

Instruction synchronous control method, synchronous controller, processor, chip and board card

Technical Field

The present disclosure relates generally to the field of processors. More particularly, the present disclosure relates to an instruction synchronization control method, a synchronization controller, a processor, a chip, and a board card.

Background

With the development of modern processors, the design of single-core processors to boost performance by boosting frequency has encountered a bottleneck in the power consumption wall. Multicore processors have gradually gained market popularity. In a multi-core processor, tasks may be issued to different cores for processing to improve parallelism of the program. Therefore, how to efficiently complete synchronization between different cores of a plurality of processing cores and between different processes within the same processing core becomes a key for development.

The synchronization of traditional CPUs and GPUs is achieved by directly performing the synchronization between processes or cores in coarse granularity. For example, when an instruction accesses a certain memory, then subsequent instructions related to accessing that memory are blocked. In this case, even if the following instruction and the currently executing instruction have no dependency on data, the following instruction is still blocked in a cut, and the parallelism of the program is reduced.

In view of this, a scheme capable of efficiently completing synchronization processing between processes or cores is needed.

Disclosure of Invention

To address at least one or more of the technical problems mentioned above, the present disclosure proposes, among other aspects, an instruction synchronization scheme. By the instruction synchronization scheme, waiting time overhead among processes or cores can be pressed as much as possible, and waiting during synchronization is greatly reduced.

In a first aspect, the present disclosure provides a synchronization controller comprising: the instruction transmitting unit is used for responding to the synchronous instruction, extracting a synchronous action range domain of the synchronous instruction, wherein the synchronous action range domain is used for indicating the action range of the instruction needing to be synchronized; and an address comparator for determining whether an address dependency relationship exists between the instruction managed by the synchronous instruction and the synchronous instruction based on the action range; and the instruction transmitting unit is further used for determining whether to release the control of the instruction controlled by the synchronous instruction in advance according to the result of the address comparator.

In a second aspect, the present disclosure provides an instruction synchronization control method, including: the method comprises the steps that an instruction transmitting unit responds to a synchronous instruction, and extracts a synchronous action range of the synchronous instruction, wherein the synchronous action range is used for indicating the action range of the instruction needing to be synchronized; the address comparator determines whether an instruction controlled by the synchronous instruction and the synchronous instruction have an address dependency relationship based on the action range; and the instruction transmitting unit determines whether to release the control of the instruction controlled by the synchronous instruction in advance based on the result of the address comparator.

In a third aspect, the present disclosure provides a processor comprising the synchronization controller of the first aspect described above.

In a fourth aspect, the present disclosure provides a chip comprising the processor of the foregoing third aspect.

In a fifth aspect, the present disclosure provides a board comprising the chip of the fourth aspect.

Through the instruction synchronization control method, the synchronization controller, the processor, the chip and the board card provided by the embodiment of the disclosure, a synchronization action range is added for the synchronization instruction, and the synchronization control of the instruction is executed according to the instruction of the synchronization action range, so that the subsequent instruction which does not need to be synchronized can be put down in advance for execution during synchronization, and idle waiting of an execution unit during synchronization is avoided.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;

FIG. 3 illustrates a schematic internal architecture of a processor core of a single core computing device of an embodiment of the present disclosure;

FIG. 4 illustrates a simplified schematic diagram of the internal structure of a multi-core computing device of an embodiment of the present disclosure;

FIG. 5 illustrates an internal architecture diagram of a synchronous controller implementing an instruction synchronous control scheme in accordance with some embodiments of the present disclosure;

fig. 6 illustrates an exemplary flow chart of an instruction synchronization control method according to an embodiment of the disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Exemplary hardware Environment

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal structure of a processing core when the computing device 201 in fig. 2 is a single-core device. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31 (also referred to as a controller), an arithmetic module 32 (also referred to as an operator), and a storage module 33 (also referred to as a memory).

The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used for storing or carrying related data, and includes a neuron storage unit (NRAM) 331, a weight storage unit (weight RAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204. It should be noted that the NRAM and WRAM herein may be two memory areas formed by dividing the same memory in a logic memory space, or may be two independent memories, which are not limited herein specifically.

Fig. 4 shows a simplified schematic diagram of the internal architecture of the computing device 201 of fig. 2 when it is multi-core. The multi-core computing device may be abstracted using a hierarchical hardware model. As shown, the multi-core computing device 400 is a system-on-chip that includes at least one compute cluster (cluster), each of which in turn includes a plurality of processor cores, in other words, the multi-core computing device 400 is formed in a system-on-chip-compute cluster-processor core hierarchy.

At the system-on-chip level, as shown, the multi-core computing device 400 includes an external memory controller 41, a peripheral communication module 42, an on-chip interconnect module 43, a global synchronization module 44, and a plurality of computing clusters 45.

There may be a plurality of external memory controllers 41, 2 being shown by way of example, for accessing external memory devices (e.g., DRAM 204 in FIG. 2) to read data from or write data to off-chip in response to access requests issued by the processor cores. The peripheral communication module 42 is configured to receive a control signal from the processing device (203 of fig. 2) via the interface device (202 of fig. 2) and to initiate the computing device (201 of fig. 2) to perform a task. The on-chip interconnect module 43 connects the external memory controller 41, the peripheral communication module 42, and the plurality of computing clusters 45 for transmitting data and control signals between the respective modules. The global synchronization module 44 is, for example, a global synchronization barrier controller (GBC) for coordinating the working progress of each computing cluster to ensure synchronization of information. The plurality of computing clusters 45 are the computing cores of the multi-core computing device 400, 4 on each die being illustratively shown, the multi-core computing device 400 of the present disclosure may also include 8, 16, 64, or even more computing clusters 45 as hardware evolves. The computing clusters 45 are used to efficiently execute the deep learning algorithm.

At the level of the compute clusters, each compute cluster 45 includes a plurality of processor cores 406 as control and compute units, and a shared memory core 407 as a memory unit, as shown. Further, each computing cluster may further include a local synchronization module 412, configured to coordinate the working progress of each processor core in the computing cluster, so as to ensure synchronization of information. The processor cores 406 are illustratively shown as 4, and the present disclosure does not limit the number of processor cores 406.

The storage cores 407 are mainly used for storing and communicating, i.e., storing shared data or intermediate results between the processor cores 406, and executing communication between the compute clusters 45 and the DRAM 204, communication between the compute clusters 45, communication between the processor cores 406, and the like. In other embodiments, the memory core 407 has scalar operation capabilities to perform scalar operations.

The memory core 407 includes a shared memory unit (SMEM) 408, a broadcast bus 409, a compute cluster direct memory access module (cluster direct memory access, CDMA) 410, and a global direct memory access module (global direct memory access, GDMA) 411. The SMEM 408 assumes the role of a high-performance data transfer station, and data multiplexed between different processor cores 406 in the same computing cluster 45 is not required to be obtained from the processor cores 406 to the DRAM 204 respectively, but is transferred between the processor cores 406 through the SMEM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SMEM 408 to the plurality of processor cores 406, so as to improve the inter-core communication efficiency and greatly reduce the on-chip off-chip input/output access. Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between compute clusters 45, and data transfer between compute clusters 45 and DRAM 204, respectively.

At the level of the processor cores, the structure of a single processor core may be similar to the block diagram of a single core computing device shown in FIG. 3 and will not be described in detail herein.

Synchronous instruction

When executing a sequence of instructions in a single processing core, the instructions are typically buffered separately by type into different instruction queues to be issued. One instruction queue corresponds to one instruction stream. The instructions within each instruction queue are sequentially issued, sequentially executed. Instructions in the plurality of instruction queues may be issued in parallel, so that the instructions are issued out of order as a whole. The dependency between different instruction queues needs to be guaranteed by synchronizing the instructions.

Similarly, when executing instruction sequences in multiple processing cores, dependencies between instruction streams on different processing cores also need to be guaranteed by synchronizing instructions.

One classical problem in data synchronization is the producer consumer problem, where the producer and consumer share the same memory space for the same period of time, the producer generates data into the space, and the consumer takes the data away. In the instruction stream synchronization event, similar to the data synchronization event, a producer consumer model may also be introduced, except that data transfer is not involved, but execution of instructions. Specifically, the producer side of the instruction stream synchronization event needs to execute some instructions first, and the consumer side of the instruction stream synchronization event cannot execute subsequent instructions until the producer side does not execute those instructions.

The synchronization instruction carries the identifiers of the production end and the consumption end. When the synchronous instruction is executed, the end of the execution of the previous instruction at the production end indicated by the synchronous instruction is waited for to release the subsequent instruction at the consumption end.

As mentioned in the background, current synchronization instructions are synchronized between processes (between instruction streams) or cores in a direct coarse-grained manner (e.g., according to memory type).

For example, assuming that one of the preceding instructions is accessing an NRAM (e.g., NRAM331 shown in fig. 3), the entire NRAM would be occupied, blocking subsequent instructions accessing the NRAM. Even though the NRAM is relatively large, e.g., 0.75MB, this prior instruction only accesses 1B of data, and subsequent instructions cannot access other unrelated data on the NRAM. It can be seen that this coarse-grained approach is relatively inefficient, resulting in some unnecessary latency.

Instruction synchronization control scheme

In view of this, the disclosed embodiments provide an instruction synchronization control scheme that adds a synchronization scope to a synchronization instruction, which indicates the scope of the instruction that needs to be synchronized, and then performs synchronization management based on the scope. Such a range of action may be represented in a finer granularity than a memory type, such as an address range, a register number, etc. By dividing the scope of action of the synchronous instruction in a finer granularity, the instruction which does not overlap with the scope of action can be transmitted and executed in advance, thereby reducing unnecessary waiting delay.

Fig. 5 illustrates an internal architecture diagram of a synchronous controller implementing an instruction synchronous control scheme in accordance with some embodiments of the present disclosure.

As shown, the synchronization controller 500 includes an instruction transmitting unit 510. When a synchronization instruction processed by a preceding unit component (e.g., a decoder in an instruction decoding unit IDU) reaches the instruction transmitting unit 510, the instruction transmitting unit 510 extracts a synchronization scope of the synchronization instruction in response to the synchronization instruction. The synchronization scope is used to indicate the scope of the instructions that need to be synchronized. Specifically, the instruction transmitting unit 510 may first register the synchronization instruction, for example, register an action range of the synchronization instruction, identify valid information of the synchronization instruction, and the like.

The scope of action may be represented in a finer granularity than the memory type, such as address range or register number.

The address range may be, for example, a range of storage space on memory for operands to which the instruction relates. The address range may have different manifestations depending on the granularity used.

In some embodiments, the address range may be characterized, for example, by a minimum address value and a maximum address value of the memory space occupied by the operands, which may be contiguous or may allow free portions to exist therein.

In other embodiments, when the operand is multi-dimensional data and storage between dimensions is discontinuous, the address range may be a minimum address value and a maximum address value for each dimension of the multi-dimensional data or a minimum address value and a maximum address value for a plurality of blocks of space divided according to storage continuity for the dimension.

Still further, in still other embodiments, the address range may be accurate to the address range of the data in bytes in the operand.

It will be appreciated that the more accurate the address range, the more information that needs to be stored and thus the more memory area overhead is occupied. Thus, the granularity at which the address range is characterized may be determined based on operational performance requirements and/or storage area overhead. For example, in extreme cases, when the memory area overhead is not considered, the address range can be accurate to the byte, and then based on the byte, the data dependency relationship of the subsequent instruction is judged, the parallel execution of the instruction can be realized to the maximum extent, and the operation performance is improved. When the memory area is limited, the address range can be characterized by using only the minimum address value and the maximum address value of the memory space, so that the memory area is not excessively occupied under the condition of meeting certain operation performance.

There are a variety of ways in which instructions may use data. In a register-type instruction system, an operand includes a register. Thus, in this case, the address range may be characterized using the register number. Registers are commonly used in scalar data operations because the size of the registers is typically not too large, e.g., less than 1 KB.

The synchronization instructions may be inserted in the sequence of instructions to be executed in a number of ways. For example, the synchronization instruction may be inserted directly by a user (e.g., a programmer) at the time of programming, and the programmer inserts the synchronization instruction directly where synchronization is required according to the dependency of the program instruction. For another example, synchronization instructions may be inserted by a compiler at compile time. When compiling a program, a compiler analyzes the logical relation between program commands, inserts synchronous instructions where the synchronous instructions are needed, and thereby generates a machine execution instruction sequence containing the synchronous instructions. For another example, synchronization instructions may also be inserted by the hardware when dynamic splitting is performed. For example, in dynamic splitting, there is only one access io stream visible to the software, and in hardware execution, one io instruction stream visible to the software is split into an io0 stream and an io1 stream to improve parallelism. At this time, if there is a synchronization instruction in the io instruction stream, hardware is required to copy the synchronization instruction to two streams as well, otherwise the io1 stream is not controlled and an error occurs.

When inserting a synchronous instruction, the scope of action can be calculated according to the operand scope of the instruction which needs to be synchronized, and the synchronous scope can be inserted. As described above, the granularity of the range of action may be chosen differently, and thus the manner in which the range of action is calculated correspondingly varies. For example, where address ranges are used to characterize the range of action, the operand may be taken as an entire block of data, with its minimum and maximum address values calculated as its range of action; or the operands may be divided by dimensions, and the minimum address value and the maximum address value of each dimension may be calculated separately. Those skilled in the art can perform corresponding address calculation according to granularity setting, and will not be described herein.

Continuing with FIG. 5, the synchronous controller 500 further includes an address comparator 520 for determining whether an address dependency relationship exists between the synchronous instruction and the instruction controlled by the synchronous instruction based on the extracted scope. For example, the address comparator may compare the operand range of the subsequent instruction with the range of action of the registered synchronous instruction to determine whether there is overlap between the two.

Further, the instruction issue unit 510 may determine whether to release the control of the instruction controlled by the synchronous instruction in advance based on the result of the address comparator 520.

As previously described, in the instruction stream synchronization event, a production-side consumer model is introduced. Thus, the synchronization instructions may also have different synchronization patterns, such as a producer synchronization pattern and a consumer synchronization pattern.

The producer synchronization mode refers to that when the producer (e.g., the first instruction stream) executes the synchronous instruction, the consumer (e.g., the second instruction stream) associated with the synchronous event can execute the subsequent instruction after waiting for the execution of the previous instruction in the instruction stream where the synchronous instruction is located in the synchronous event. In other words, when the first instruction stream executes the synchronous instruction to the production side synchronous mode, it is necessary to wait for the execution of the preceding instruction in the instruction stream in which the synchronous instruction is located to end, during which the execution of the following instruction in the first instruction stream is not affected.

The consumer synchronization mode refers to that when the consumer (such as the second instruction stream) executes the synchronization instruction, the consumer needs to block the transmission of the following instruction in the instruction stream where the synchronization instruction is located; subsequent instructions may not be executed until the producer (e.g., first instruction stream) associated with the synchronization event is confirmed to have executed the previous instruction. In other words, when the second instruction stream executes a synchronization instruction to the consumer synchronization mode, it is necessary to block the transmission of a subsequent instruction in the second instruction stream.

Thus, the management taken by an instruction may be different based on different synchronization patterns of the synchronization instruction.

In some embodiments, if the synchronization instruction is a consumer instruction, i.e., a consumer synchronization mode, then based on the synchronization scope, it is determined whether to issue a subsequent instruction after the synchronization instruction in advance. That is, the scope of administration of a consumer side synchronization instruction is an instruction following the synchronization instruction.

As shown in fig. 5, one or more instruction issue queues 530 may be further included in the synchronization controller 500, for buffering the instructions to be executed issued by the instruction issue queues.

In hardware synchronization within a single processing core, each instruction stream has an instruction queue, and synchronization instructions (sync instructions) are issued to all relevant instruction queues, which are then processed by a unified synchronization table (sync table) module at the control unit. The synchronization table module may be included in the instruction decode unit IDU. Each instruction queue, when encountering the sync.producer instruction of the current stream, waits for the predecessor instruction to commit, and sends a ready signal to the sync table without blocking execution of the following instructions. Upon encountering the sync. Consumer instruction of the current stream, execution of subsequent instructions is blocked until a go transmit signal is received by the sync table. And after the sync table receives ready signals of all the production ends, sending go signals to the corresponding consumption ends.

When the synchronous instruction is a consumer instruction, synchronous control is required to be carried out on the subsequent instruction according to the registered action range. For example, instruction issue unit 510 may further determine whether to issue a subsequent instruction following the synchronous instruction into instruction issue queue 530 in advance based on the results of address comparator 520.

Specifically, the address comparator 520 may traverse each subsequent instruction after the synchronous instruction in the current instruction stream to determine whether the operand range of the subsequent instruction overlaps the above-mentioned acting range. At this time, the instruction issue unit 510 may issue the subsequent instruction into the corresponding instruction issue queue when the result of the address comparator 520 indicates that there is no overlap, otherwise blocking the subsequent instruction.

For example, assuming that the range of action of a synchronization instruction as a consumer instruction is 0B to 125B and the operand range of a subsequent instruction is an address range of 512B or more, the consumer synchronization instruction does not block the issue of the subsequent instruction. And when determining that the subsequent instruction can be issued, issuing the subsequent instruction into a corresponding instruction emission queue according to the type of the subsequent instruction. If the operand ranges of a subsequent instruction fall within the scope of the synchronous instruction or overlap, the subsequent instruction may be blocked. Then, the judgment of the next instruction can be continued, and the corresponding issuing or blocking is performed.

In other embodiments, if the synchronization instruction is a producer instruction, i.e., a producer synchronization mode, then it is determined whether it is necessary to wait for completion of a preamble instruction preceding the synchronization instruction based on the synchronization scope. That is, the control range of the production side synchronous instruction is the instruction before the synchronous instruction.

As shown in FIG. 5, a commit queue (commit queue) 540 may also be included in the synchronization controller 500 for maintaining instruction execution order and dependencies within the instruction issue queue 530. When the synchronous instruction is a producer instruction, the commit queue 540 may further determine a preamble instruction preceding the synchronous instruction that needs to wait for completion thereof based on the result of the address comparator 520.

Specifically, at registration, address comparator 520 may traverse each of the preceding instructions in the current instruction stream that have been issued to instruction issue queue 530 but not completed (resolve) before the synchronous instruction to determine whether there is overlap between the operand range of the preceding instruction and the scope of action. Here, issued but not yet completed means that the instruction has been issued and is executing in the downstream execution unit, and not yet completed, its instruction lifecycle is still valid. If the preamble instruction has been issued and completed, the lifecycle of the instruction ends, and is not within the control range of the production side synchronization instruction, no traversal is required. For example, for an IO instruction of a data handling class, which is to handle data, the lifecycle ends.

Throughout, address comparator 520 may compare the operand range of issued but not yet executed predecessor instructions with the range of registered synchronous instructions. When there is overlap between the two, an identification bit may be generated to identify the preamble instruction.

In particular, commit queue 540 may identify the leading instruction to wait for it to complete when the results of address comparator 520 indicate that there is overlap, otherwise not identify the leading instruction.

For example, assuming that the range of operation of a synchronous instruction is 0B to 125B, and the range of operands of a preceding instruction is 512B or more, the synchronous instruction does not wait for the completion of the preceding instruction. Assuming that, through traversal, 4 preamble instructions overlap the scope of the current synchronization instruction and have not yet been executed, the 4 preamble instructions need to be identified.

When all the preamble instructions overlapping the action range of the current production side synchronous instruction are executed, the dependency relationship can be released immediately without waiting for the completion of the preamble instructions overlapping the action range of the current production side synchronous instruction. Releasing the dependency may, for example, return a ready signal to the sync table module. For example, when all 4 of the preceding instructions identified above have been executed, a ready signal may be returned to the synchronization table module. And after the sync table module receives ready signals of all the production ends, sending go signals to the corresponding consumption ends.

As long as all of the remaining ones of the preamble instructions that overlap the scope of the synchronization instruction have been completed, i.e. all of the identified preamble instructions have been completed, a ready signal may be sent to the control unit without waiting for all of the preamble instructions to complete before ready is sent.

It will be appreciated that in one implementation, when registering a producer side synchronization instruction, if it is found by traversal that there is no overlapping and unexecuted instruction in the preamble instruction with the scope of action of the synchronization instruction, the producer side synchronization instruction may not need to be registered. If there are any preamble instructions overlapping the scope of the synchronization instruction and not being executed, the production side synchronization instruction needs to be registered, the preamble instructions are marked by registration, and the registration information is updated according to the instruction state change.

In other implementations, the producer synchronization instruction may be registered first, and then determined by traversal and address comparison. In particular implementations, one skilled in the art may adjust the order of the relevant steps as desired without departing from the spirit of the disclosed embodiments.

Considering that the consumer synchronization instruction has the function of blocking subsequent instructions, in some implementations, it may not be necessary to record the consumer synchronization instruction in the commit queue 540. Therefore, the space of the submitting queue can be saved, and one instruction position is left for storing other contents. In some implementations, only one valid consumer synchronization instruction is supported for recording.

The producer side synchronization instruction does not have a function of blocking the subsequent instructions, and thus the producer side synchronization instruction needs to be recorded in the commit queue 540. Functional instructions following the producer side synchronization instruction or subsequent producer side synchronization instructions may still continue to be recorded in the commit queue. That is, there may be multiple producer side synchronization instructions in the commit queue, but at most one consumer side synchronization instruction.

Therefore, by adding the synchronous action range for the synchronous instruction, synchronous control of the instruction with finer granularity can be supported, and control of irrelevant instructions is released as early as possible, so that the processing efficiency is improved. For example, if a consumer mode synchronization instruction needs to access the same block of memory (e.g., NRAM) as its following instructions, the following instructions may be blocked according to previous techniques. However, according to embodiments of the present disclosure, if there is no overlap between the address range of the following instruction and the scope of action of the synchronous instruction, the following instruction may still be issued for execution without being blocked.

In some embodiments, the synchronization controller described above may be located within a single processor core or within a single processor core. In this scenario, the synchronization controller is, for example, an instruction decode unit (e.g., instruction decode unit 312 of FIG. 3) within a single processing core for synchronization management between different instruction queues within the same processing core. The synchronous instruction is, for example, a sync instruction, and is used for synchronous control between different instruction queues in the same processing core, where a consuming end and a producing end indicated in the synchronous instruction respectively correspond to one or more instruction queues or instruction streams. For example, the synchronization instruction may be expressed as: the producer (io, computer) Consumer (io, computer), wherein the producer and the Consumer are from io instruction queues (memory instruction queues) and computer instruction queues (arithmetic instruction queues).

In some embodiments, the io instruction queue may be further divided into, for example, move (data handling) flows, io0 flows and io1 flows based on a ping-pong pipeline manner. In the ping-pong mode, the storage space may be configured with at least two buffers for supporting data access between one buffer and the external storage circuit while data access between the other buffer and the processing circuit. These two buffers may be referred to as ping buffer space and pong buffer space, i.e., ping pong (ping pong) pipelining. In some embodiments, the operation instruction queue may be further divided into SIMD (single instruction multiple data) streams and SIMT (single instruction multiple thread) streams.

In other embodiments, the synchronization controller may be a local synchronization controller (e.g., local synchronization module 412 of FIG. 4) in a multi-core processor for coordinating synchronization between different processing cores within the same computing cluster; the synchronization controller may also be a global synchronization controller (e.g., global synchronization module 44 of FIG. 4) in a multi-core processor for coordinating synchronization between processing cores of different compute clusters. In this scenario, the synchronization instruction is, for example, a barrier instruction or a semaphore (semaphore), which is used for synchronization control between instruction sequences of different processing cores, where the consuming end and the producing end indicated in the synchronization instruction respectively correspond to one or more processing cores. For example, the synchronization instruction may be expressed as: the barrier.producer (core 0, core 1) Consumer (core 0, core 1), wherein both the production side and the Consumer side include two processing cores core0 and core1. The scope of the synchronization instruction is, for example, an address range of a memory (e.g., SRAM408 in the memory core of fig. 4) shared between different processing cores.

Fig. 6 illustrates an exemplary flow chart of an instruction synchronization control method according to an embodiment of the disclosure. The instruction synchronization control method may be executed by the aforementioned synchronization controller, for example.

As shown, in step 610, the instruction transmitting unit extracts a synchronization scope of the synchronization instruction in response to the synchronization instruction, wherein the synchronization scope is used to indicate a scope of the instruction requiring synchronization.

Next, in step 620, the address comparator determines whether the instruction governed by the synchronous instruction has an address dependency relationship with the synchronous instruction based on the extracted scope.

Finally, in step 630, the instruction issue unit determines whether to release in advance the control of the instruction controlled by the synchronous instruction based on the result of the address comparator.

In some implementations, if the synchronization instruction is a consumer instruction, the instruction issue unit determines whether to issue a subsequent instruction following the synchronization instruction to the instruction issue queue in advance based on a result of the address comparator.

Specifically, the address comparator may traverse each subsequent instruction to determine whether the operand range of the subsequent instruction overlaps the aforementioned range. When the result of the address comparator indicates that the overlapping exists, the instruction transmitting unit transmits the subsequent instruction to the corresponding instruction transmitting queue, otherwise, the subsequent instruction is blocked.

In other implementations, if the synchronous instruction is a producer instruction, a preamble instruction preceding the synchronous instruction that needs to wait for completion is determined in the commit queue according to the result of the address comparator.

Specifically, the address comparator traverses each of the preceding instructions issued to the instruction issue queue but not completed, and determines whether the operand range of the preceding instruction overlaps the aforementioned range of action. When the result of the address comparator indicates that there is overlap, the predecessor instruction is identified in the commit queue to wait for it to complete, otherwise the predecessor instruction is not identified.

Those skilled in the art will appreciate that the various features of the synchronous controller described above in connection with fig. 5 may be similarly applied to the synchronous control method of fig. 6 and are therefore not repeated here.

In summary, by adding the synchronization scope for the synchronization instruction, the synchronization control on the instruction can be released in advance based on the instruction of the synchronization scope, so that unnecessary waiting time is reduced, and processing efficiency is improved.

The disclosed embodiments also provide a processor including the aforementioned synchronization controller for executing the instruction synchronization control method. The disclosed embodiments also provide a chip that may include a processor of any of the embodiments described above in connection with the accompanying drawings. Further, the present disclosure also provides a board that may include the foregoing chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing may be better understood in light of the following clauses:

clause 1, a synchronization controller, comprising: the instruction transmitting unit is used for responding to the synchronous instruction, extracting a synchronous action range domain of the synchronous instruction, wherein the synchronous action range domain is used for indicating the action range of the instruction needing to be synchronized; and an address comparator for determining whether an address dependency relationship exists between the instruction managed by the synchronous instruction and the synchronous instruction based on the action range; and the instruction transmitting unit is further used for determining whether to release the control of the instruction controlled by the synchronous instruction in advance according to the result of the address comparator.

Clause 2, the synchronization controller of clause 1, further comprising: an instruction transmitting queue for buffering instructions to be executed; and when the synchronization instruction is a consumer instruction, the instruction transmitting unit is further configured to: and determining whether to issue a subsequent instruction after the synchronous instruction to the instruction emission queue in advance according to the result of the address comparator.

Clause 3, the synchronization controller of clause 2, wherein the address comparator is further configured to: traversing each subsequent instruction, and judging whether the operand range of the subsequent instruction is overlapped with the action range or not; and the instruction transmitting unit is further configured to: and when the result of the address comparator indicates that no overlapping exists, the subsequent instruction is issued to a corresponding instruction transmitting queue, and otherwise, the subsequent instruction is blocked.

Clause 4, the synchronization controller of any of clauses 2-3, further comprising: and the commit queue is used for maintaining the instruction execution sequence and the dependency relationship in the instruction transmission queue, but not recording the consumer-end instructions.

Clause 5, the synchronous controller of clause 4, wherein when the synchronous instruction is a producer instruction, the commit queue is further to: and determining a preamble instruction before the synchronous instruction needing to wait for the completion of the synchronous instruction according to the result of the address comparator.

Clause 6, the synchronization controller of clause 5, wherein the address comparator is further configured to: traversing each precursor instruction which is issued to the instruction emission queue but is not completed, and judging whether the operand range of the precursor instruction is overlapped with the action range or not; and the commit queue is further to: when the result of the address comparator indicates that there is overlap, the predecessor instruction is identified to wait for it to complete, otherwise the predecessor instruction is not identified.

Clause 7, the synchronous controller of any of clauses 5-6, wherein the commit queue is further configured to record the production side instructions.

Clause 8, the synchronization controller of any of clauses 1-7, wherein the scope is calculated and inserted into the synchronization scope according to the scope of the instructions that need to be synchronized when the synchronization instructions are inserted.

Clause 9, the synchronization controller of any of clauses 1-8, wherein the synchronization instructions are inserted by a user at programming time, or by a compiler at compiling time, or by hardware at dynamic splitting time.

Clause 10, the synchronous controller of any of clauses 1-9, wherein the scope of action is represented using any one or more of the following: address range, register number.

Clause 11, the synchronization controller according to any of clauses 1-10, wherein the synchronization controller is an instruction decode unit within a single processing core for synchronization management between different instruction queues within the same processing core, the synchronization instruction being for synchronization between different instruction queues within the same processing core; or the synchronous controller is a local synchronous controller or a global synchronous controller in the multi-core processor and is used for synchronous control among instruction sequences of different processing cores, and the synchronous instructions are used for synchronous control among the instruction sequences of the different processing cores.

Clause 12, an instruction synchronization control method, comprising: the method comprises the steps that an instruction transmitting unit responds to a synchronous instruction, and extracts a synchronous action range of the synchronous instruction, wherein the synchronous action range is used for indicating the action range of the instruction needing to be synchronized; the address comparator determines whether an instruction controlled by the synchronous instruction and the synchronous instruction have an address dependency relationship based on the action range; and the instruction transmitting unit determines whether to release the control of the instruction controlled by the synchronous instruction in advance based on the result of the address comparator.

Clause 13, the method of clause 12, wherein the determining, by the instruction transmitting unit, whether to prematurely release the control of the synchronous instruction based on the result of the address comparator comprises: if the synchronous instruction is a consumer instruction, determining whether to transmit a subsequent instruction after the synchronous instruction to an instruction transmission queue in advance based on the result of the address comparator.

Clause 14, the method of clause 13, further comprising: the address comparator traverses each subsequent instruction and judges whether the operand range of the subsequent instruction and the action range overlap or not; and when the result of the address comparator indicates that no overlapping exists, the instruction transmitting unit transmits the subsequent instruction to a corresponding instruction transmitting queue, and otherwise, the subsequent instruction is blocked.

Clause 15, the method of any of clauses 13-14, further comprising: the consumer instructions are not recorded in the commit queue.

Clause 16, the method of clause 15, further comprising: if the synchronous instruction is a production end instruction, determining a preamble instruction before the synchronous instruction needing to wait for completion according to the result of the address comparator in the submission queue.

Clause 17, the method of clause 16, further comprising: the address comparator traverses each precursor instruction which is issued to the instruction emission queue but is not completed, and judges whether the operand range of the precursor instruction is overlapped with the action range or not; and when the result of the address comparator indicates that there is overlap, identifying the predecessor instruction in the commit queue to wait for completion thereof, otherwise not identifying the predecessor instruction.

Clause 18, the method of any of clauses 16-17, further comprising: recording the production side instruction in the submission queue.

Clause 19, the method of any of clauses 12-18, wherein the scope is calculated and inserted in the synchronization scope according to the scope of the instructions that need synchronization when inserting the synchronization instructions.

Clause 20, the method of any of clauses 12-19, wherein the synchronization instructions are inserted by a user at programming time, or by a compiler at compiling time, or by hardware at dynamic splitting time.

Clause 21, the method of any of clauses 12-20, wherein the scope of action is represented using any one or more of the following: address range, register number.

Clause 22, the method of any of clauses 12-21, wherein the synchronization instruction is used for synchronization between different instruction queues within the same processing core or for synchronization between instruction streams of different processing cores.

Clause 23, a processor comprising a synchronization controller according to any of clauses 1-11.

Clause 24, a chip comprising the processor of clause 23.

Clause 25, a board card comprising the chip of clause 24.

The foregoing has described in detail embodiments of the present disclosure, with specific examples being employed herein to illustrate the principles and implementations of the present disclosure, the above examples being provided solely to assist in the understanding of the methods of the present disclosure and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the present disclosure, there are variations in the detailed description and the scope of the application, which in light of the foregoing description should not be construed to limit the present disclosure.

Claims

1. A synchronous controller, comprising:

the instruction transmitting unit is used for responding to the synchronous instruction, extracting a synchronous action range domain of the synchronous instruction, wherein the synchronous action range domain is used for indicating the action range of the instruction needing to be synchronized; and

The address comparator is used for determining whether an instruction controlled by the synchronous instruction and the synchronous instruction have an address dependency relationship based on the action range; and is also provided with

The instruction transmitting unit is further used for determining whether to release the control of the instruction controlled by the synchronous instruction in advance according to the result of the address comparator.

2. The synchronization controller of claim 1, further comprising:

an instruction transmitting queue for buffering instructions to be executed; and is also provided with

When the synchronization instruction is a consumer instruction, the instruction transmitting unit is further configured to:

and determining whether to issue a subsequent instruction after the synchronous instruction to the instruction emission queue in advance according to the result of the address comparator.

3. The synchronization controller of claim 2, wherein:

the address comparator is further to: traversing each subsequent instruction, and judging whether the operand range of the subsequent instruction is overlapped with the action range or not; and is also provided with

The instruction transmitting unit is further configured to: and when the result of the address comparator indicates that no overlapping exists, the subsequent instruction is issued to a corresponding instruction transmitting queue, and otherwise, the subsequent instruction is blocked.

4. A synchronous controller according to any of claims 2-3, further comprising:

and the commit queue is used for maintaining the instruction execution sequence and the dependency relationship in the instruction transmission queue, but not recording the consumer-end instructions.

5. The synchronization controller of claim 4, wherein:

when the synchronization instruction is a producer instruction, the commit queue is further to:

and determining a preamble instruction before the synchronous instruction needing to wait for the completion of the synchronous instruction according to the result of the address comparator.

6. The synchronization controller of claim 5, wherein:

the address comparator is further to: traversing each precursor instruction which is issued to the instruction emission queue but is not completed, and judging whether the operand range of the precursor instruction is overlapped with the action range or not; and is also provided with

The commit queue is further to: when the result of the address comparator indicates that there is overlap, the predecessor instruction is identified to wait for it to complete, otherwise the predecessor instruction is not identified.

7. The synchronization controller of any one of claims 5-6, wherein:

the commit queue is further configured to record the producer side instruction.

8. The synchronization controller according to any one of claims 1-7, wherein the scope is calculated and inserted in the synchronization scope according to the scope of instructions to be synchronized when inserting the synchronization instructions.

9. The synchronization controller of any one of claims 1-8, wherein the synchronization instructions are inserted by a user at programming time, or by a compiler at compiling time, or by hardware at dynamic splitting time.

10. The synchronous controller of any of claims 1-9, wherein the range of action is represented using any one or more of: address range, register number.

11. The synchronization controller according to any one of claims 1-10, wherein:

the synchronous controller is an instruction decoding unit in a single processing core and is used for synchronous control between different instruction queues in the same processing core, and the synchronous instruction is used for synchronization between different instruction queues in the same processing core; or alternatively

The synchronous controller is a local synchronous controller or a global synchronous controller in the multi-core processor and is used for synchronous control among instruction sequences of different processing cores, and the synchronous instruction is used for synchronous among the instruction sequences of the different processing cores.

12. An instruction synchronization control method, comprising:

the method comprises the steps that an instruction transmitting unit responds to a synchronous instruction, and extracts a synchronous action range of the synchronous instruction, wherein the synchronous action range is used for indicating the action range of the instruction needing to be synchronized;

the address comparator determines whether an instruction controlled by the synchronous instruction and the synchronous instruction have an address dependency relationship based on the action range; and

the instruction transmitting unit determines whether to release control of the instruction controlled by the synchronous instruction in advance based on a result of the address comparator.

13. The method of claim 12, wherein the instruction issue unit determining whether to prematurely release a management of an instruction managed by the synchronous instruction based on a result of the address comparator comprises:

if the synchronous instruction is a consumer instruction, determining whether to transmit a subsequent instruction after the synchronous instruction to an instruction transmission queue in advance based on the result of the address comparator.

14. The method of claim 13, further comprising:

the address comparator traverses each subsequent instruction and judges whether the operand range of the subsequent instruction and the action range overlap or not; and

When the result of the address comparator indicates that no overlap exists, the instruction transmitting unit transmits the subsequent instruction to a corresponding instruction transmitting queue, otherwise, the subsequent instruction is blocked.

15. The method of any of claims 13-14, further comprising:

the consumer instructions are not recorded in the commit queue.

16. The method of claim 15, further comprising:

if the synchronous instruction is a production end instruction, determining a preamble instruction before the synchronous instruction needing to wait for completion according to the result of the address comparator in the submission queue.

17. The method of claim 16, further comprising:

the address comparator traverses each precursor instruction which is issued to the instruction emission queue but is not completed, and judges whether the operand range of the precursor instruction is overlapped with the action range or not; and

when the result of the address comparator indicates that there is overlap, the predecessor instruction is identified in the commit queue to wait for it to complete, otherwise the predecessor instruction is not identified.

18. The method of any of claims 16-17, further comprising:

recording the production side instruction in the submission queue.

19. A method according to any of claims 12-18, wherein said scope is calculated and inserted in said synchronization scope according to the scope of instructions to be synchronized when inserting said synchronization instructions.

20. The method of any of claims 12-19, wherein the synchronization instruction is inserted by a user at programming time, or by a compiler at compiling time, or by hardware at dynamic splitting time.

21. The method of any one of claims 12-20, wherein the range of action is represented using any one or more of the following: address range, register number.

22. The method of any of claims 12-21, wherein the synchronization instruction is for synchronization between different instruction queues within the same processing core or for synchronization between instruction streams of different processing cores.

23. A processor comprising a synchronization controller according to any one of claims 1-11.

24. A chip comprising the processor of claim 23.

25. A board card comprising the chip of claim 24.