CN101794271B

CN101794271B - Implementation method and device of consistency of multi-core internal memory

Info

Publication number: CN101794271B
Application number: CN2010101376998A
Authority: CN
Inventors: 周昔平; 李靖宇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2010-03-31
Filing date: 2010-03-31
Publication date: 2012-05-23
Anticipated expiration: 2030-03-31
Also published as: CN101794271A; GB2479267A; US20120079209A1; US8799584B2; GB2479267B; GB201105414D0

Abstract

The embodiment of the invention discloses implementation method and device of the consistency of a multi-core internal memory, wherein the implementation method of the consistency of a multi-core internal memory comprises the following steps of: receiving a control signal of a first processor set for reading first data by a second level cache of the first processor set; if the first data is maintained by a second processor set at present, reading the first data through a quick consistency interface of a first level cache of the second processor set in the first level cache of the second processor set, wherein the second level cache of the first processor set is connected with the quick consistency interface of the first level cache of the second processor set; and providing the read first data for the first processor set for processing through the second level cache of the first processor set. The technical scheme can solve the problem of internal memory consistency among clusters of an ARM (Advanced RISC (Reduced Instruction-Set Computer) Machine) Cortex-A9 framework.

Description

Method and device for realizing multi-core memory consistency

Technical Field

The invention relates to the technical field of computers, in particular to a method for realizing multi-core memory consistency and multi-core processing equipment.

Background

At present, ARM company proposes a Cortex-A9 Multi-core system (MPcore) processor, and the Cortex-A9MPcore processor can support 4 cores, and each processor core in one processor group (cluster) thereof can ensure Memory access consistency (Memory coherence) through a Memory coherence monitoring Unit (SCU), and one cluster comprises 4 processor cores.

One cluster is usually provided with a first-level cache (L1 cache), and in order to further improve the efficiency of data reading, the cluster is usually provided with a second-level cache (L2 cache), but the first-level cache of the Cortex-A9MPCore only supports write-back (write back) and does not support write through (write through) operation, namely the first-level cache does not support synchronous update L2 cache after the update of L1 cache, so that the cache consistency problem among a plurality of clusters is generated.

The ARM Cortex-a9 series currently does not provide a solution for joint work of more than 4 cores, which causes great inconvenience for applications with more than 4 cores. For example, a certain product application needs an architecture larger than 4 cores, one task needs to be divided into multiple cores to be processed according to multiple threads (pipeline), and if the processing of a certain task is processed by one cluster and is terminated at another cluster, the processing of the pipeline can meet the condition of crossing the clusters, and the problem of cache consistency needs to be solved between the clusters.

Disclosure of Invention

The embodiment of the invention provides a method and a device for realizing multi-core memory consistency, which can solve the problem of memory consistency between ARMCortex-A9 architecture clusters.

In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:

a method for implementing multi-core memory consistency comprises the following steps:

a second-level cache of the first processor group receives a control signal for reading first data by the first processor group;

if the first data is currently maintained by the second processor group, reading the first data in a first-level cache of the second processor group through a fast consistency interface of the first-level cache of the second processor group, wherein a second-level cache of the first processor group is connected with the fast consistency interface of the first-level cache of the second processor group;

and providing the read first data to the first processor group for processing through the second-level cache of the first processor group.

a second-level cache of the first processor group receives a control signal for reading second data by the first processor group;

if the second data is currently maintained in the shared cache, reading the second data from the shared cache, wherein a fast consistency interface of a first-level cache of the first processor group is connected to the shared cache, and a second-level cache of the first processor group is connected to the shared cache;

and providing the read second data to the first processor group for processing through the second-level cache of the first processor group.

A multi-core processing device, comprising:

a first processor group, a second level cache of the first processor group, and a second level cache of the second processor group; wherein, the second level cache of the first processor group is connected with the fast consistency interface of the first level cache of the second processor group;

the second-level cache of the first processor group is used for receiving a control signal for reading the first data by the first processor group; if the first data is currently maintained by the second processor group, reading the first data in a first-level cache of the second processor group through a fast consistency interface of the first-level cache of the second processor group; and providing the read first data to the first processor group for processing through the second-level cache of the first processor group.

A multi-core processing device, comprising:

a first processor group, a second processor group, a shared cache, a second level cache of the first processor group, and a second level cache of the second processor group; wherein, the second-level cache of the first processor group and the fast consistency interface of the first-level cache of the first processor group are connected with the shared cache;

the second-level cache of the first processor group is used for receiving a control signal for reading the second data by the first processor group; if the second data is currently maintained in the shared cache, reading the second data from the shared cache; and providing the read second data to the first processor group for processing through the second-level cache of the first processor group.

As can be seen from the above, in a scheme of the embodiment of the present invention, if cross-processor group joint processing is required among multiple processor groups in a system, when it is found that a first processor group needs to process a certain data currently maintained by a second processor group, the ACP interface of the first-level cache of the second processor group is used to read the data from the first-level cache of the second processor group for processing, and since the data is read from the first-level cache of the second processor group currently maintaining the data, validity of the data can be ensured, the processing mode can ensure memory consistency among the multiple processor groups, and further can implement joint work of more than 4 cores.

In another scheme of the embodiment of the present invention, if a plurality of processor groups in a system need to perform joint processing across the processor groups, a shared cache is added, an ACP interface and a second level cache of a first level cache of each processor group are connected to the added shared cache, and when it is found that, for example, a first processor group needs to process a certain data currently maintained in the shared cache, the data is read from the shared cache and provided to the first processor group for processing through the second level cache of the first processor group. Because the data needing cross-processor group combined processing can be maintained in the shared cache, each processor group obtains the data of the type from the shared cache for processing, the processing mode can ensure the memory consistency among a plurality of processor groups, and further can realize the combined work of more than 4 cores.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of an external structure of a cluster provided in an embodiment of the present invention;

fig. 2 is a flowchart of an implementation method for multi-core memory consistency according to an embodiment of the present invention;

FIG. 3-a is a schematic diagram of a multi-core processing device according to a second embodiment of the present invention;

FIG. 3 b is a schematic diagram of another multi-core processing device provided in the second embodiment of the present invention;

fig. 3-c is a flowchart of a method for implementing multi-core memory consistency according to a second embodiment of the present invention;

FIG. 3-d is a schematic diagram of another multi-core processing device provided in the second embodiment of the present invention;

fig. 4 is a flowchart of a method for implementing multi-core memory consistency according to a third embodiment of the present invention;

FIG. 5-a is a schematic diagram of a multi-core processing device according to a fourth embodiment of the present invention;

FIG. 5-b is a schematic diagram of another multi-core processing apparatus according to the fourth embodiment of the present invention;

fig. 5-c is a flowchart of a method for implementing multi-core memory consistency according to a fourth embodiment of the present invention;

FIG. 6-a is a schematic diagram of a multi-core processing device according to a fifth embodiment of the present invention;

FIG. 6-b is a schematic diagram of another multi-core processing device according to a fifth embodiment of the present invention;

FIG. 7-a is a schematic diagram of a multi-core processing device according to a sixth embodiment of the present invention;

fig. 7-b is a schematic diagram of another multi-core processing device according to a sixth embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method and a device for realizing multi-core memory consistency, which can solve the problem of memory consistency among multiple clusters in an ARMCortex-A9 architecture.

The following are detailed below.

Referring first to fig. 1, fig. 1 is a schematic diagram of an external structure of a cluster (including 4 cores) of a Cortex-a9 architecture, and the cluster comprises: 4 processor cores, a first level cache, an SCU, and an acceleratecoherence Port (ACP). The SCU of the cluster can be connected to a secondary cache matched with the cluster; the external device may access the first level cache of the cluster through the ACP interface of the cluster, for example, the external device may read data from the first level cache of the cluster through the ACP interface of the cluster, or write data to the first level cache of the cluster through the ACP interface of the cluster.

The scheme of the embodiment of the invention is mainly based on the cluster with the structure, and solves the problem of memory consistency among the clusters when a plurality of clusters work jointly.

Example one

In the following, an implementation process of multi-core memory coherence of a first processor group is described as an example in which a certain multi-core processing device based on an ARM Cortex-a9 architecture at least includes a first processor group and a second processor group to work jointly, where an implementation function of the multi-core memory coherence of the first processor group is mainly implemented by a secondary cache of the first processor group, and with reference to fig. 2, an embodiment of a method for implementing the multi-core memory coherence in an embodiment of the present invention may include:

210. the second level cache of the first processor group receives a control signal for the first processor group to read the first data.

In one application scenario, when the first processor group needs to process the first data, the first processor group may issue a control signal for reading the first data, for example, the first data is currently maintained by the second processor group, and the second processor group may maintain the first data in its first-level cache.

220. If the first data is currently maintained by the second processor group, the second-level cache of the first processor group reads the first data in the first-level cache of the second processor group through an ACP interface of the first-level cache of the second processor group, wherein the second-level cache of the first processor group is connected with the ACP interface of the first-level cache of the second processor group.

In practical application, the second level cache associated with the cluster may be provided with, for example, an Advanced Extensible Interface (AXI) bus Interface and an address filter (M-F) Interface, where the second level cache may be connected to a system bus through the AXI bus Interface and may also be connected to other devices through the M-F Interface.

The second-level cache of the first processor group is directly connected with the ACP interface of the first-level cache of the second processor group through an M-F interface, for example, or is connected through a bus (that is, the M-F interface of the second-level cache of the first processor group and the ACP interface of the first-level cache of the second processor group are respectively connected to the bus, and in this architecture, the M-F interfaces of the second-level cache and the ACP interfaces of the first-level cache of one or more other processor groups can also be connected to the bus, so that the intercommunication among the plurality of processor groups can be realized).

At this time, the first data cached in the first-level cache of the second processor group may be read through the ACP interface of the first-level cache of the second processor group, and the read first data may be original data or result data processed by the second processor group or another processor group one or more times.

In an application scenario, when first data is currently maintained by a second processor group, a second-level cache of the first processor group may send a control signal for reading the first data, whose operation attribute is cacheable, to a first-level cache of the second processor group through a fast coherency interface of the first-level cache of the second processor group, and read the first data in the first-level cache of the second processor group, so as to obtain higher reading efficiency.

230. The second level cache of the first processor group provides the read first data to the first processor group for processing through the second level cache of the first processor group.

Of course, if the second processor group also needs to participate in the cross-processor group processing, the second-level cache of the second processor group may also be directly connected to the ACP interface of the first-level cache of the first processor group through the M-F interface, or connected through a bus (i.e., the M-F interface of the second-level cache of the second processor group and the ACP interface of the first-level cache of the first processor group are respectively connected to the bus). The function of implementing the multi-core memory consistency of the second processor group may also be implemented by the second level cache of the second processor group, and the implementation process may be similar to that of the second level cache of the first processor group, and is not described herein again.

It can be understood that, in this embodiment, the second level cache of the first processor group is mainly responsible for implementing the implementation function of the multi-core memory consistency, and certainly, based on the above implementation idea, other components may also be used to implement the implementation function of the multi-core memory consistency.

As can be seen from the above, in this embodiment, if cross-processor group joint processing is required among multiple processor groups in the system, when it is found that a first processor group needs to process a certain data currently maintained by a second processor group, the ACP interface of the primary cache of the second processor group is used to read the data from the primary cache of the second processor group for processing, and since the data is read from the primary cache of the second processor group currently maintaining the data, validity of the data can be ensured, the processing mode can ensure memory consistency among the multiple processor groups, and further, joint work of more than 4 cores can be achieved.

Further, after the first processor group processes the first data, a processing result of the first processor group on the first data may be obtained; writing a processing result of the first data by the first processor group into the first-level cache or the second-level cache of the second processor group through the ACP interface of the first-level cache of the second processor group (maintaining the processing result in the first-level cache or the second-level cache of the second processor group for subsequent processing by the second processor group or other processor groups); or caching the processing result of the first processor group on the first data in a second-level cache of the first processor group.

Example two

In order to better understand the technical scheme of the embodiment of the invention, a process of performing cross-cluster processing by using a multi-core processing device based on an ARM Cortex-a9 architecture and at least including cluster-a and cluster-B working jointly is described as an example.

The connection architecture of cluster-A and cluster-B of the multi-core processing device in this embodiment may be designed such that, as shown in FIG. 3-a, the SCU of cluster-A is connected to its supporting secondary cache, the AXI bus interface of the secondary cache of cluster-A is connected to the system bus, and the M-F interface of the secondary cache of cluster-A is directly connected to the ACP interface of the primary cache of cluster-B; the SCU of the cluster-B is connected to the matched second-level cache, the AXI bus interface of the second-level cache of the cluster-B is connected to the system bus, and the M-F interface of the second-level cache of the cluster-B is directly connected to the ACP interface of the first-level cache of the cluster-A. Or, as shown in fig. 3-B, it may also be designed that M-F interfaces of the second-level caches of cluster-a and cluster-B are respectively connected to the bus, ACP interfaces of the first-level caches of cluster-a and cluster-B are respectively connected to the bus, that is, the M-F interface of the second-level cache of cluster-a and the ACP interface of the first-level cache of cluster-B are connected through the bus; the M-F interface of the second-level cache of cluster-B and the ACP interface of the first-level cache of cluster-a are connected by a bus, and it can be found that the multi-core processing device can be further extended to connect more clusters based on the architecture shown in fig. 3-B.

Taking the processing of the data N as an example, referring to fig. 3-c, another embodiment of the method for implementing multi-core memory consistency in the embodiment of the present invention may include:

301. when the cluster-A needs to process the data N, the cluster-A can send out a control signal for reading the data N; wherein it is assumed here that data N is currently maintained by cluster-B;

302. and the second-level cache of the cluster-A receives a control signal for reading the data N by the cluster-A, and if the second-level cache of the cluster-A finds that the data N is maintained by the cluster-B currently, the second-level cache of the cluster-A reads the data N in the first-level cache of the cluster-B through an ACP interface of the first-level cache of the cluster-B.

The second-level cache and the first-level cache in the embodiment of the present invention may be understood as hardware units with certain data processing capability, rather than being used for storing data. Data and the like can be read or rewritten through the second level cache of cluster-A.

In actual operation, if the data N is currently cached in the first-level cache of the cluster-B, the second-level cache of the cluster-A directly reads the data N in the first-level cache of the cluster-B to the second-level cache of the cluster-A through an ACP interface of the first-level cache of the cluster-B; in addition, if the data N is currently cached in the second-level cache of the cluster-B, and the first-level cache of the cluster-B requests to read the data N when the second-level cache of the cluster-A is known, the first-level cache of the cluster-B can read the data N into the first-level cache first, and then the second-level cache of the cluster-A reads the data N in the first-level cache of the cluster-B into the second-level cache of the cluster-A through an ACP interface of the first-level cache of the cluster-B.

Wherein, the cluster reads or writes data from or into a module other than its primary cache, and then it will set the operation attribute of the control signal for reading or writing the data to non-Cacheable (Cacheable), and the operation attribute is that the reading or writing efficiency of the Cacheable (Cacheable) is relatively high, the data amount of the one-time reading and writing corresponding to the two is different, and the operation attribute of the control signal for reading or writing the data also represents a read-write attribute of the data.

In this embodiment, when the cluster-a needs to read the data N, because the data N is not currently maintained in the first-level cache of the cluster-a, that is, the cluster-a needs to read the data N from a module other than the first-level cache thereof, at this time, the cluster-a sets the operation attribute of the control signal for reading the data N as uncacheable.

Here, after receiving the control signal of the read data N whose operation attribute is uncacheable from the cluster-a, the second level cache of the cluster-a modifies the operation attribute of the control signal of the read data N to Cacheable (Cacheable), and reads the data N in the first level cache of the cluster-B through the ACP interface of the first level cache of the cluster-B, that is, the second level cache of the cluster-a can send the control signal of the read data N whose operation attribute is Cacheable to the first level cache of the cluster-B through the ACP interface of the first level cache of the cluster-B, and reads the data N in the first level cache of the cluster-B.

The non-Cacheable read-write operation efficiency is lower than the Cacheable read-write operation efficiency, and the non-Cacheable read-write operation is usually read-write from memories such as a Double Data Rate (DDR) synchronous dynamic random access processor (sdram) and a MailBox (MailBox), and the read-write speed is much lower than that of accessing a first-level cache of another cluster through an ACP interface.

At this time, the second-level cache of the cluster-A reads the data N in the second-level cache of the cluster-A through the ACP interface of the first-level cache of the cluster-B, which may be an original data of a certain task or a result data after being processed by the cluster-B or other processor groups for one or more times.

303. And the read data N is provided for the cluster-A to be processed by the second-level cache of the cluster-A.

It can be found that the data N provided by the second-level cache of the cluster-A for the cluster-A is read from the first-level cache of the cluster-B which currently maintains the data N, so that the validity of the data N can be ensured, and the memory consistency during cross-processor group joint processing among a plurality of processor groups is realized.

304. The second-level cache of the cluster-A acquires the processing result of the first processor group on the data N (for convenience of description, the processing result of the cluster-A on the data N is denoted as data N ', and the data N' is written into the first-level cache or the second-level cache of the cluster-B through the ACP interface of the first-level cache of the cluster-B.

In an application scenario, if the cluster-A sets the operation attribute of the control signal of the cached data N ' to be uncacheable (non-Cacheable), after receiving the control signal of the cached data N ' with the operation attribute of uncacheable, the second-level cache of the cluster-A can modify the operation attribute of the control signal of the cached data N ' to be Cacheable (Cacheable), write the data N ' into the first-level cache or the second-level cache of the cluster-B through the ACP interface of the first-level cache of the cluster-B, and maintain the data N ' in the first-level cache or the second-level cache of the cluster-B for subsequent processing by the cluster-B or other processor groups. The way to write the data N' into the second level cache of the cluster-B may be, for example: firstly, writing the data N ' into the first-level cache of the cluster-B, reading the data N ' from the second-level cache of the cluster-B, and deleting the cached data N ' from the first-level cache of the cluster-B.

In addition, the cluster-A can also directly cache the data N ' obtained by processing the data N in the first-level cache thereof, or the second-level cache of the cluster-A can also directly cache the data N ' in the second-level cache of the cluster-A after obtaining the data N ', that is, the data N ' is maintained by the cluster-A, and then, if the cluster-B (or other processor groups) needs to process the data N ' maintained by the cluster-A, the second-level cache of the cluster-B can also adopt a similar mode to read the data N ' in the first-level cache of the cluster-A through the ACP interface of the first-level cache of the cluster-A and provide the data N ' for the cluster-B to process.

In addition, in the case that, for example, the data N ' is already a final processing result of the corresponding task, or each subsequent processor group does not need to reuse or process the data N ', etc., the second level cache of the cluster-a may also be directly cached in the second level cache of the cluster-a, or further write it into the DDR, and the data N ' may not be written into the first level cache or the second level cache of the cluster-B through the ACP interface of the first level cache of the cluster-B.

It can be understood that if the cluster-B needs to process a certain data currently maintained by the cluster-a, the second level cache of the cluster-B may also adopt a similar manner, the data is read from the first level cache of the cluster-a through the ACP interface of the first level cache of the cluster-a and provided for the cluster-B to process, and the processing process may be the same as that of the second level cache of the cluster-a, and is not described herein again.

Next, as shown in fig. 3-d, for example, the message data N1 input through the network port is cached in the first-level cache or the second-level cache of the cluster-a through the ACP interface of the first-level cache of the cluster-a. When the cluster-B needs to process the message data N1, the second-level cache of the cluster-B can read the message data N1 in the first-level cache of the cluster-A through the ACP interface of the first-level cache of the cluster-A to provide the cluster-B processing, and at this time, in order to further improve the access efficiency, the second-level cache of the cluster-B can set the operation attribute of the control signal for reading the message data N1 as Cacheable. After the message data N1 is processed, updating the processing result of the message data N1 to the first-level cache or the second-level cache of the cluster-A through an ACP interface of the first-level cache of the cluster-A, and at the moment, if the cluster-B sets the operation attribute of the control signal for caching the processing result of the message data N1 to Cacheable, further performing synchronous operation to empty the processing result of the message data N1 from the first-level cache of the cluster-B.

As can be seen from the above, in this embodiment, if cross-processor group joint processing is required among multiple processor groups in the multi-core processing device, when it is found that cluster-a needs to process a certain data currently maintained by cluster-B, the ACP interface of the first-level cache of cluster-B is used to read the data from the first-level cache of cluster-B for processing, and since the ACP interface is read from the first-level cache of cluster-B currently maintaining the data, the validity of data N can be ensured, the processing mode can ensure memory consistency among multiple processor groups, and further, joint work of more than 4 cores can be achieved.

Furthermore, the second-level cache of the cluster can further improve the access efficiency by flexibly converting the operation attribute of the control signal of the read-write data; moreover, all the configurations can be completed through bottom software, and the upper application can be completely transparent; and configuring pipeline cross-cluster execution, and according to the occupation condition of the secondary cache, more flexibly and reasonably distributing resources such as ACP interface bandwidth and the like, thereby further improving the operating efficiency of the multi-core processing equipment.

EXAMPLE III

In this embodiment, a multi-core processing device based on the ARM Cortex-a9 architecture at least includes a first processor group and a second processor group to work jointly, where an implementation function of multi-core memory coherence of the first processor group is mainly implemented by a secondary cache of the first processor group, and an implementation process of multi-core memory coherence of the first processor group is described as an example. In this embodiment, a shared cache is added to the multi-core processing device, the ACP interface of the first-level cache of each processor group is connected to the shared cache, the second-level cache of each processor group is connected to the shared cache, and the shared cache is used to implement the consistency of the multi-core memory. Referring to fig. 4, another embodiment of a method for implementing multi-core memory consistency in an embodiment of the present invention may include:

410. a second-level cache of the first processor group receives a control signal for reading second data by the first processor group;

in one application scenario, when the first processor group needs to process the second data, the first processor group may issue a control signal for reading the second data, for example, the second data is currently maintained in the shared cache.

420. If the second data is currently maintained in the shared cache, reading the second data from the shared cache, wherein a fast consistency interface of a first-level cache of the first processor group is connected to the shared cache, and a second-level cache of the first processor group is connected to the shared cache;

420. and providing the read second data to the first processor group for processing through the second-level cache of the first processor group.

In practical application, the second level cache of the cluster can be provided with an AXI bus interface and an address filtering M-F interface, for example. The second-level cache can be connected with a system bus through an AXI bus interface and can also be connected with other equipment through an M-F interface.

For example, the second-level cache of the first processor group is directly connected to the shared cache through the M-F interface, and the ACP interface of the first-level cache of the first processor group is directly connected to the shared cache, or is connected to the bus through the bus (that is, the ACP interface of the first-level cache of the first processor group, the M-F interface of the second-level cache, and corresponding interfaces of the shared cache are respectively connected to the bus, it can be understood that, under this architecture, the ACP interfaces of the first-level cache and the M-F interfaces of the second-level cache of one or more other processor groups may also be connected to the bus, and thus, the intercommunication between the plurality of processor groups and the shared cache may be achieved).

At this time, the second data may be read from the shared cache through the M-F interface of the second level cache of the first processor group, and the read second data may be original data or result data after being processed by the second processor group or other processor groups one or more times. The data in the first-level cache of the first processor group or the second processor group can be written into the shared cache through the ACP interface of the first-level cache of the first processor group or the second processor group.

Further, after the first processor group processes the second data to obtain the processing result of the second data, the processing result of the second data by the first processor group may be written into the shared cache through the fast coherency interface of the first-level cache of the first processor group. Or, a processing result of the first processor group on the second data may also be obtained; and the processing result of the first processor group on the second data is cached in the second-level cache of the first processor group (for example, in the case that the processing result of the first processor group on the second data is already a final processing result of the corresponding task, or each subsequent processor group does not need to reuse or process the processing result of the second data).

Of course, if the second processor group also needs to participate in the cross-processor group processing, the second-level cache of the second processor group may also be directly connected to the shared cache through the M-F interface, the ACP interface of the first-level cache, or through a bus (that is, the ACP interface of the first-level cache, the M-F interface of the second-level cache, and the corresponding interfaces of the shared cache of the second processor group are respectively connected to the bus). The function of implementing the multi-core memory consistency of the second processor group may also be implemented by the second level cache of the second processor group, and the implementation process may be similar to that of the second level cache of the first processor group, and is not described herein again.

As can be seen from the above, in this embodiment, if a plurality of processor groups in the system need to perform joint processing across the processor groups, a shared cache is additionally provided, and the ACP interface and the second level cache of the first level cache of each processor group are connected to the additionally provided shared cache, and when it is found that, for example, a first processor group needs to process a certain data currently maintained in the shared cache, the data is read from the shared cache and provided to the first processor group for processing through the second level cache of the first processor group. Because the data needing cross-processor group combined processing can be maintained in the shared cache, each processor group obtains the data of the type from the shared cache for processing, the processing mode can ensure the memory consistency among a plurality of processor groups, and further can realize the combined work of more than 4 cores.

Example four

In order to better understand the scheme of the embodiment of the invention, the following description takes the example that the multi-core processing device still based on the ARM Cortex-a9 architecture at least comprises the cluster-a and the cluster-B working jointly to perform the cross-cluster processing.

The connection architecture of cluster-A and cluster-B of the multi-core processing device in this embodiment may be designed such that, as shown in FIG. 5-a, the SCU of cluster-A is connected to its supporting secondary cache, the AXI bus interface of the secondary cache of cluster-A is connected to the system bus, the ACP interface of the primary cache of cluster-A and the M-F interface of the secondary cache are directly connected to the shared cache; the SCU of the cluster-B is connected to the matched second-level cache, the AXI bus interface of the second-level cache of the cluster-B is connected to the system bus, and the ACP interface of the first-level cache of the cluster-B and the M-F interface of the second-level cache are directly connected to the shared cache. Or the ACP interface of the first-level cache of the cluster-A and the M-F interface of the second-level cache are connected to the bus as shown in the figure 5-b; the ACP interface of the first-level cache and the M-F interface of the second-level cache of the cluster-A are connected to the bus; a shared cache is also connected to the bus. Namely, the ACP interface of the first-level cache of the cluster-A and the M-F interface of the second-level cache are connected with the shared cache through buses; the ACP interface of the first-level cache and the M-F interface of the second-level cache of the cluster-B are also connected to the shared cache through the bus, and it can be found that the multi-core processing device can be further extended to connect more clusters based on the architecture shown in fig. 5-B.

Taking the processing of the data M as an example, referring to fig. 5-c, another embodiment of the method for implementing multi-core memory consistency in the embodiment of the present invention may include:

501. when the cluster-A needs to process the data M, the cluster-A can send out a control signal for reading the data M; wherein, it is assumed here that data M is currently maintained in the shared cache;

502. and the second-level cache of the cluster-A receives a control signal for reading the data M by the cluster-A, and if the second-level cache of the cluster-A finds that the data M is currently maintained in the shared cache, the second-level cache of the cluster-A reads the data M from the shared cache and provides the data M for the cluster-A to process through the second-level cache of the cluster-A.

At this time, the data M read from the shared cache by the second-level cache of the cluster-a may be an original data of a certain task, or a result data written into the shared cache through the ACP interface of the first-level cache of the cluster-a after being processed by the cluster-B or other processor groups for one or more times.

503. The second-level cache of the cluster-A controls the shared cache to read and cache the processing result of the data M from the first-level cache of the cluster-A (for convenience of description, the processing result of the data M by the cluster-A is referred to as data M') through the ACP of the first-level cache of the cluster-A by using the control signal, so as to be used for subsequent processing by the cluster-B or other processor groups.

In addition, in the case that, for example, the data M ' is already a final processing result of the corresponding task, or each subsequent processor group does not need to reuse or process the data M ', the second-level cache of the cluster-a may also be directly cached in the second-level cache of the cluster-a, or further write it into the DDR, and the data M ' may not be written into the shared cache through the ACP interface of the first-level cache of the cluster-B.

It can be understood that, if the cluster-B needs to process a certain data currently maintained in the shared cache, the second level cache of the cluster-B may also use a similar manner to read the data from the shared cache, provide the data to the cluster-B for processing through the second level cache of the cluster-B, and write the result of the data processed by the cluster-B into the shared cache through the ACP interface of the first level cache of the cluster-B or cache the data in the second level cache of the cluster-B, and the processing process may be the same as that of the second level cache of the cluster-a, which is not described herein again.

As can be seen from the above, in this embodiment, if a plurality of processor groups in the multi-core processing device need to perform combined processing across the processor groups, by adding a shared cache, an ACP interface and a second level cache of a first level cache of each processor group are connected to the added shared cache, and when it is found that, for example, cluster-a needs to process a certain data currently maintained in the shared cache, the data is read from the shared cache and provided to the cluster-a for processing through the second level cache of the cluster-a. Because the data needing cross-processor group combined processing can be maintained in the shared cache, each processor group obtains the data of the type from the shared cache for processing, the processing mode can ensure the memory consistency among a plurality of processor groups, and further can realize the combined work of more than 4 cores.

Furthermore, the access efficiency can be improved by sharing cache access, and the physical area of the second-level cache is favorably reduced; moreover, all the configurations can be completed through bottom software, and the upper application can be completely transparent; and configuring pipeline to execute across the Cluster, and according to the occupation condition of the secondary cache, more flexibly and reasonably distributing resources such as ACP interface bandwidth and the like, thereby further improving the operating efficiency of the multi-core processing equipment.

In order to better implement the technical solution of the embodiment of the present invention, an embodiment of the present invention further provides a multi-core processing device.

EXAMPLE five

Referring to fig. 6-a and 6-b, a multi-core processing device 600 provided by an embodiment of the present invention may include:

a first processor group 610, a second processor group 620, a second level cache 612 of the first processor group, and a second level cache 622 of the second processor group; wherein the second level cache 612 of the first processor group is coupled to the fast coherency interface of the first level cache of the second processor group 620. Specifically, the fast coherency interfaces of the second level cache 612 of the first processor group and the first level cache of the second processor group 620 may be directly connected; or through a bus connection, i.e., the level two cache 612 of the first processor group and the level one cache 622 of the second processor group are connected to the bus, respectively.

The second level cache 612 of the first processor group is configured to receive a control signal for the first processor group 610 to read the first data; if the first data is currently maintained by the second processor group, the first data is read from the first level cache of the second processor group 620 through the fast coherency interface of the first level cache of the second processor group 620; the read first data is provided to the first processor group 610 for processing via the second level cache 612 of the first processor group.

In an application scenario, the second-level cache 612 of the first processor group may further be configured to obtain a processing result of the first processor group 610 on the first data; writing a processing result of the first data by the first processor group 610 into the first level cache or the second level cache of the second processor group 620 through a fast consistency interface of the first level cache of the second processor group 620; alternatively, the result of the processing of the first data by the first processor group is cached in the second level cache 612 of the first processor group.

In one application scenario, the second level cache 612 of the first processor group may include:

a first receiving module, configured to receive a control signal for the first processor group 610 to read the first data, where an operation attribute of the control signal for the first processor group 610 to read the first data is that the control signal is not cacheable;

a first reading module, configured to send, through a fast coherency interface of a first-level cache of the second processor group 620, a control signal for reading the first data, an operation attribute of which is cacheable, to the first-level cache of the second processor group 620, and read the first data in the first-level cache of the second processor group 620 when the first data is currently maintained by the second processor group 620;

the first providing module is configured to provide the first data read by the first reading module to the first processor group 610 through the second level cache 612 of the first processor group for processing.

In an application scenario, the second level cache 620 of the second processor group is connected (directly connected or connected via a bus) to the fast coherence interface of the first level cache of the second processor group 610;

the second level cache 622 of the second processor group is configured to receive a control signal for the second processor group 620 to read the third data; if the third data is currently maintained by the first processor group 610, the third data is read from the first level cache of the first processor group 610 through the fast coherency interface of the first level cache of the first processor group 610; the read third data is provided to the second processor group 620 for processing through the second processor group's level two cache 622.

In an application scenario, the second-level cache 622 of the second processor group may further be configured to obtain a processing result of the second processor group 620 on the third data; writing the processing result of the second processor group 620 on the third data into the first level cache or the second level cache of the first processor group 610 through the fast consistency interface of the first level cache of the first processor group 610; alternatively, the result of the processing of the third data by the second processor group 620 is cached in the second level cache 622 of the second processor group.

In one application scenario, the second level cache 622 of the first processor group may include:

a second receiving module, configured to receive a control signal for the second processor group 620 to read the third data, where an operation attribute of the control signal for the second processor group 620 to read the third data is that the control signal is not cacheable;

a second reading module, configured to send, through a fast coherency interface of a first-level cache of the first processor group 610, a control signal for reading the third data, an operation attribute of which is cacheable, to the first-level cache of the first processor group 610, and read the third data in the first-level cache of the first processor group 610 when the third data is currently maintained by the first processor group 610;

and a second providing module, configured to provide the third data read by the second reading module to the second processor group 620 for processing through the second level cache 622 of the second processor group.

It is to be understood that the functions of each component of the multi-core processing device 600 in this embodiment may be specifically implemented according to the method in the foregoing embodiment, and are not described herein again. The multi-core processing device 600 in this embodiment may be a network device such as a router and a switch.

As can be seen from the above, in this embodiment, if cross-processor group joint processing is required among multiple processor groups in the multi-core processing device 600, when the second-level cache of the first processor group finds that the first processor group needs to process a certain data currently maintained by the second processor group, the ACP interface of the first-level cache of the second processor group is used to read the data from the first-level cache of the second processor group for processing, and since the data is read from the first-level cache of the second processor group currently maintaining the data, validity of the data can be ensured, the processing mode can ensure memory consistency among the multiple processor groups, and further can implement joint work of more than 4 cores.

EXAMPLE six

Referring to fig. 7-a and 7-b, a multi-core processing device 700 provided by an embodiment of the present invention may include:

a first processor group 710, a second processor group 720, a shared cache 730, a level two cache of the first processor group 712, and a level two cache of the second processor group 722;

the second level cache 712 of the first processor group and the fast coherency interface of the first level cache of the first processor group 710 are connected to the shared cache 730. Specifically, the second-level cache 712 of the first processor group and the fast coherency interface of the first-level cache of the first processor group 710 may be directly connected to the shared cache 730, respectively; or to the shared cache 730 via a bus, i.e., the second level cache 712 of the first processor group, the fast coherency interface of the first level cache of the first processor group 710, the shared cache 730 are coupled to the bus, respectively.

The second level cache 712 of the first processor group is configured to receive a control signal for the first processor group 710 to read the second data; if the second data is currently maintained in the shared cache 730, reading the second data from the shared cache 730; the read second data is provided to the first processor complex 710 for processing by the second level cache 712 of the first processor complex.

In an application scenario, the second-level cache 712 of the first processor group may further be configured to write a processing result of the second data by the first processor group 710 into the shared cache 730 through a fast coherency interface of the first-level cache of the first processor group 710; or, acquiring a processing result of the first processor group 710 on the second data; the results of the processing of the second data by the first processor group 710 are cached in the second level cache 712 of the first processor group.

In an application scenario, the fast coherency interfaces of the second level cache 722 of the second processor group and the first level cache of the second processor group 720 are connected (directly connected or connected through a bus) to the shared cache 730;

the second level buffer 722 of the second processor group is configured to receive a control signal for the second processor group 720 to read the fourth data; if the fourth data is currently maintained in the shared cache 730, reading the fourth data from the shared cache 730; the read fourth data is provided to the second processor group 720 for processing via the second processor group's level two cache 722.

In an application scenario, the second-level cache 722 of the second processor group may further be configured to write, through a fast coherency interface of the first-level cache of the second processor group 720, a processing result of the fourth data by the second processor group 720 into the shared cache 730; or, acquiring a processing result of the second processor group 720 on the fourth data; the result of the processing of the fourth data by the second processor group 720 is cached in the second level cache 722 of the second processor group.

As can be seen from the above, in this embodiment, if it is necessary to perform joint processing across processor groups among multiple processor groups in the multi-core processing device 700, a shared cache is additionally provided, and an ACP interface and a second level cache of a first level cache of each processor group are connected to the additionally provided shared cache, and when it is found that, for example, a first processor group needs to process a certain data currently maintained in the shared cache, the data is read from the shared cache and provided to the first processor group for processing through the second level cache of the first processor group. Because the data needing cross-processor group combined processing can be maintained in the shared cache, each processor group obtains the data of the type from the shared cache for processing, the processing mode can ensure the memory consistency among a plurality of processor groups, and further can realize the combined work of more than 4 cores.

It is understood that the functions of the components of the multi-core processing device 700 in this embodiment may be specifically implemented according to the method in the foregoing embodiment, and are not described here again. The multi-core processing device 600 in this embodiment may be a network device such as a router and a switch.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

To sum up, in a scheme of the embodiment of the present invention, if cross-processor group joint processing is required among multiple processor groups in a system, when it is found that a first processor group needs to process a certain data currently maintained by a second processor group, the ACP interface of the first-level cache of the second processor group is used to read the data from the first-level cache of the second processor group for processing, and since the data is read from the first-level cache of the second processor group currently maintaining the data, validity of the data can be ensured, and the processing mode can ensure memory consistency among the multiple processor groups.

In another scheme of the embodiment of the present invention, if a plurality of processor groups in a system need to perform joint processing across the processor groups, a shared cache is added, an ACP interface and a second level cache of a first level cache of each processor group are connected to the added shared cache, and when it is found that, for example, a first processor group needs to process a certain data currently maintained in the shared cache, the data is read from the shared cache and provided to the first processor group for processing through the second level cache of the first processor group. Because the data needing to be jointly processed across the processor groups can be maintained in the shared cache, each processor group acquires the type of data from the shared cache for processing, and the processing mode can ensure the memory consistency among a plurality of processor groups.

Furthermore, the access efficiency can be improved by sharing cache access, and the physical area of the second-level cache is favorably reduced; moreover, all the configurations can be completed through bottom software, and the upper application can be completely transparent; and configuring pipeline cross-cluster execution, and according to the occupation condition of the secondary cache, more flexibly and reasonably distributing resources such as ACP interface bandwidth and the like, thereby further improving the operating efficiency of the multi-core processing equipment.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.

The method and the apparatus for implementing multi-core memory consistency provided by the embodiment of the present invention are described in detail above, and a specific example is applied in the present disclosure to explain the principle and the implementation manner of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for implementing multi-core memory consistency is characterized by comprising the following steps:

providing the read first data to a first processor group for processing through a second-level cache of the first processor group;

acquiring a processing result of the first processor group on the first data;

writing a processing result of the first data by the first processor group into a first-level cache or a second-level cache of the second processor group through a fast consistency interface of the first-level cache of the second processor group; or,

and caching a processing result of the first processor group on the first data in a second-level cache of the first processor group.

2. The method of claim 1,

the receiving a control signal for the first processor group to read the first data includes:

receiving a control signal for reading first data by a first processor group with an operation attribute of non-cacheability;

the reading the first data in the first-level cache of the second processor group through the fast coherency interface of the first-level cache of the second processor group includes:

and sending a control signal of reading the first data with the cacheable operation attribute to the first-level cache of the second processor group through a fast consistency interface of the first-level cache of the second processor group, and reading the first data in the first-level cache of the second processor group.

3. A method for implementing multi-core memory consistency is characterized by comprising the following steps:

providing the read second data to the first processor group for processing through a second-level cache of the first processor group;

writing a processing result of the first processor group on the second data into the shared cache through a fast consistency interface of a first-level cache of the first processor group;

or,

acquiring a processing result of the first processor group on the second data; and caching the processing result of the first processor group on the second data in a second-level cache of the first processor group.

4. A multi-core processing device, comprising:

the second-level cache of the first processor group is used for receiving a control signal for reading the first data by the first processor group; if the first data is currently maintained by the second processor group, reading the first data in a first-level cache of the second processor group through a fast consistency interface of the first-level cache of the second processor group; providing the read first data to a first processor group for processing through a second-level cache of the first processor group;

the second-level cache of the first processor group is also used for acquiring the processing result of the first processor group on the first data; writing a processing result of the first data by the first processor group into a first-level cache or a second-level cache of the second processor group through a fast consistency interface of the first-level cache of the second processor group; or caching the processing result of the first processor group on the first data in a second-level cache of the first processor group.

5. The multi-core processing device of claim 4, wherein:

the second level cache of the first processor group comprises:

the first receiving module is used for receiving a control signal of a first processor group for reading first data, and the operation attribute of the control signal of the first processor group for reading the first data is that the control signal is not cacheable;

a first reading module, configured to send a control signal for reading the first data with an operation attribute of cacheable to a first-level cache of the second processor group through a fast coherency interface of the first-level cache of the second processor group when the first data is currently maintained by the second processor group, and read the first data in the first-level cache of the second processor group;

and the first providing module is used for providing the first data read by the first reading module to the first processor group for processing through a second-level cache of the first processor group.

6. A multi-core processing device, comprising:

the second-level cache of the first processor group is used for receiving a control signal for reading the second data by the first processor group; if the second data is currently maintained in the shared cache, reading the second data from the shared cache; providing the read second data to the first processor group for processing through a second-level cache of the first processor group;

the second-level cache of the first processor group is also used for writing the processing result of the first processor group to the second data into the shared cache through the fast consistency interface of the first-level cache of the first processor group; or acquiring a processing result of the first processor group on the second data; and caching the processing result of the first processor group on the second data in a second-level cache of the first processor group.