CN111459872B - Quick inter-core data synchronization method for multi-core parallel computing - Google Patents

Quick inter-core data synchronization method for multi-core parallel computing Download PDF

Info

Publication number
CN111459872B
CN111459872B CN202010324853.6A CN202010324853A CN111459872B CN 111459872 B CN111459872 B CN 111459872B CN 202010324853 A CN202010324853 A CN 202010324853A CN 111459872 B CN111459872 B CN 111459872B
Authority
CN
China
Prior art keywords
data
core
buffer
indicator
inter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010324853.6A
Other languages
Chinese (zh)
Other versions
CN111459872A (en
Inventor
王旭
陈南希
张晓林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Microsystem and Information Technology of CAS
Original Assignee
Shanghai Institute of Microsystem and Information Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Microsystem and Information Technology of CAS filed Critical Shanghai Institute of Microsystem and Information Technology of CAS
Priority to CN202010324853.6A priority Critical patent/CN111459872B/en
Publication of CN111459872A publication Critical patent/CN111459872A/en
Application granted granted Critical
Publication of CN111459872B publication Critical patent/CN111459872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/177Initialisation or configuration control
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Multi Processors (AREA)

Abstract

The invention provides a quick inter-core data synchronization method for multi-core parallel computing, which comprises the following steps: a buffer area and a buffer area data structure are configured between every two directly connected cores, and a buffer area writing indicator and a buffer area reading indicator are arranged in each buffer area data structure; initializing a buffer write indicator and a buffer read indicator; all cores simultaneously execute the steps of reading inter-core data and writing inter-core data; the steps of reading and writing inter-core data are repeated. The method of the invention is simultaneously provided with the buffer write-in indicator and the buffer read-in indicator, which can reduce the operation times of the prefetcher and the data cache, thereby improving the efficiency of the multi-core parallel computing scheduling method while ensuring the correct data transmission.

Description

Quick inter-core data synchronization method for multi-core parallel computing
Technical Field
The invention belongs to the field of parallel computing, and particularly relates to a fast inter-core data synchronization method for multi-core parallel computing.
Background
DSP (digital signal processor) has been widely used in the fields of image processing, consumer electronics, instruments and meters, industrial control, automobiles, medical treatment, and the like, because of its strong computing power. To improve performance, DSPs often have multiple processor cores, such as the C66x family of DSPs from TI, america, inc. (Texas instruments), can be configured with up to 8 cores. In order to fully exploit the performance of a multi-core DSP, it is necessary to distribute computing tasks over multiple processor cores for parallel computing. The most popular DSP multi-core parallel computing method at present is the SYS/BIOS operating system of TI company. The operating system is largely adapted and packaged according to the hardware structure characteristics of the DSP, so that a user can easily convert a single-core running program into a multi-core running program.
Some application fields are sensitive to the time taken to process data and thus have severe demands on the latency of the system. Since the operating system has multiple threads running at the same time, it is difficult to precisely control the running time of the computing task on the DSP.
In multi-core parallel computing, different steps of one computing task may be assigned to different processor cores. At this time, the output of one core is the input of the other core. In order to transfer data from the previous core to the next core, a buffer (buffer) needs to be opened up in the memory. The former core writes the data after preliminary processing into the buffer area, and the latter core takes the data out and carries out the next processing. In a C66 x-series DSP, each core has its own data cache (data cache) and prefetcher (prefetch). The data cache and prefetcher will first be briefly described. The DSP core typically runs much faster than the memory. Therefore, a high-speed data cache is designed in the DSP core, and data frequently used by the DSP core are placed in the data cache nearby. For example, when data is read from memory, the data is loaded into the DSP core and a copy of the data is stored in the data cache. If the DSP core reads this data again shortly thereafter, it is not necessary to read from the slow memory, but it can be read directly from the data cache. The cache is also similar when writing data into memory. The prefetcher further steps on the basis of the data caching. Each DSP core has its own separate prefetcher, and after reading one data, the prefetcher predicts which memory regions of the data will be read next and reads them in advance. If the prediction is correct, the DSP core does not have to read the data into the slow memory. If the prediction is wrong, the prefetched data is directly discarded.
The data cache and prefetcher of the DSP core cause the problem of data consistency when the DSP core reads and writes data, that is, the data cache and prefetcher speeds up the speed of the DSP core to read and write data from the memory, but at the same time, the data in the memory is inconsistent with the data in the DSP core (that is, the data in the data cache or prefetcher), so that the data in the memory cannot be updated in time, and the data may not be transferred from one core to the other core correctly. The meaning of "not updated in time" here is twofold: 1) In the case where the DSP core reads data from the memory, it is possible that the data in the memory has been updated, but the DSP core does not know that the data in the data cache or prefetcher is still used, which is that the data in the memory cannot be updated to the DSP core in time, i.e., when the latter core reads data from the memory, part of the data may be read from the data cache or prefetcher, not from the memory; 2) In the process of writing data into the memory by the DSP core, the calculation result in the DSP may be written into the data cache only and not written into the memory, which is the case that the data in the DSP core cannot be updated into the memory in time, i.e. when the previous core writes data into the memory, part of the data may be written into the data cache only and not actually written into the memory. In the case of a data cache, the hardware nature of the data cache makes it possible that the data is merely stored from the DSP core's calculation result to the data cache and not further to memory. This feature is determined by the hardware architecture of the data cache and is not controlled by the instructions of the DSP core. These conditions result in the data of the previous core not being properly transferred to the next core.
The data consistency problem between cores is not needed to be considered by a single-core running program, so that the data synchronization between cores is a special problem for multi-core parallel computing, and the consumed computing resources are additional expenses caused by the multi-core parallel computing. These overheads may reduce the operating efficiency of the parallel computing scheduling method, thereby affecting the performance of multi-core parallel computing.
Simply shutting down the data cache and prefetcher can cause a dramatic drop in the efficiency of the DSP core in processing the data. Therefore, it is preferable that the DSP core still use the data cache and prefetcher when processing data, and that the processor instructions be used by the multi-core parallel computing scheduling method to explicitly maintain the coherency of data in the DSP core with data in the memory when transferring data from one core to another. Therefore, ensuring the consistency of inter-core data is a core problem in multi-core parallel computing, and needs to be realized by a special inter-core data synchronization method.
To solve this problem, patent document No. CN201811305984.9 proposes a DSP multi-core parallel computing scheduling method that does not require an operating system. The scheduling method selects one main core, and then coordinates the operation of the other auxiliary cores by utilizing an inter-core interrupt mechanism of the DSP, so that parallel computation of a plurality of DSP cores is realized under the condition of no operating system. The scheduling method also introduces a special inter-core relation data structure and an inter-core buffer data structure, so that the scheduling method can be applied to various parallel computing models, and is a multi-core parallel computing scheduling method with high universality.
In the process of transferring data from a previous core to a next core, in order to avoid collision of the read-write buffers of two cores at the same time, a ping-pong buffer structure is generally adopted. In the scheduling method proposed in the above patent document CN201811305984.9, the following buffer data structure is opened up and used in the memory:
Figure BDA0002462541640000031
wherein, the data pointer 0 and the data pointer 1 respectively point to two memory areas with the same size. One memory region stores data for reading and the other memory region stores data for writing. The buffer data length indicates the size of each memory region. In many-core parallel computing models, in some cases (e.g., a buffer is used to store the final processed data) no ping-pong structure is required, so it is required to indicate whether the current buffer is of ping-pong structure by the type of buffer.
In many application areas, data processing is periodic. In the running process of the multi-core parallel computing scheduling method, roles of two memory areas of the ping-pong buffer can be switched back and forth. Let core 1 need to pass data to core 2. In the nth data processing cycle, the data pointer 0 points to the memory area for the core 2 to read data, and the data pointer 1 points to the memory area for the core 1 to write data. By the n+1st data processing cycle, core 2 needs to read the latest data to the memory area pointed by data pointer 1; while the data in the memory area pointed to by data pointer 0 has been processed by core 2 in the last cycle, core 1 can write new data in this area. Therefore, a buffer read indicator needs to be set to indicate which memory region is currently being used to read data. From the buffer read indicator, it can be inferred directly that another memory region is being used for writing data.
According to the DSP specification document of TI company, in order to ensure data consistency, when the core 2 reads data from the memory, it first uses the processor instruction to discard the data in the prefetcher and the old data in the data cache, and then reads the latest data from the memory. After the core 1 writes the data into the memory, it needs to perform a write-back operation on the data cache by using the processor instruction.
Not only is the data pointer 0 and the memory area pointed by the data pointer 1 operated, the multi-core parallel computing scheduling method needs to strictly operate the prefetcher and the data cache according to the DSP description document. The above-mentioned problem of data consistency also exists when updating the buffer read indicator. According to the previous description, the value of the buffer read indicator is also updated continuously during the operation of the multi-core parallel computing scheduling method. Therefore, before the core 1 and the core 2 read the value of the buffer read indicator, the data in the prefetcher and the old data in the data cache are discarded by the processor instruction, where the old data refers to all the data possibly stored in the data cache, and then the buffer read indicator is read from the memory. After modifying the buffer read indicator, a write back operation is performed on the data cache using the processor instructions.
This means that in the process of transferring data from core 1 to core 2, the multi-core parallel computing scheduling method discards data in the prefetcher and old data in the data cache multiple times, and performs write-back operations on the data cache multiple times. These operations on prefetchers and data caches are all the overhead introduced by the multi-core parallel computing scheduling approach. The cost reduces the operation efficiency of the multi-core parallel computing scheduling method and influences the performance of multi-core parallel computing.
In summary, the scheduling method disclosed in CN201811305984.9 can correctly transfer the data of the previous core to the next core, thereby solving the problem of data consistency. The problem is that the multi-core parallel computing scheduling method consumes more computing resources, namely, occupies more clock cycles and has slower operation speed because the data in the prefetcher and the data in the data cache are discarded for many times and the write-back operation is executed for the data cache for many times.
Disclosure of Invention
The invention aims to provide a quick inter-core data synchronization method for multi-core parallel computing, which is used for effectively reducing the additional overhead when maintaining the inter-core data consistency of the multi-core parallel computing.
In order to achieve the above object, the present invention provides a fast inter-core data synchronization method for multi-core parallel computing, which is used for maintaining inter-core data consistency during DSP multi-core parallel computing, and includes:
s1: according to the inter-core connection relation of the multi-core parallel computing model, a buffer area and a buffer area data structure are configured between every two directly connected cores in the inter-core connection relation, and a buffer area writing indicator and a buffer area reading indicator are arranged in each buffer area data structure;
s2: initializing a corresponding buffer write indicator and buffer read indicator;
s3: all cores of the DSP simultaneously execute the steps of reading inter-core data and writing inter-core data;
s4: and entering the next data processing period, and repeating the step S3 until the operation of the multi-core parallel computing model is completed.
In the step S1, the buffer data structure is:
Figure BDA0002462541640000051
wherein, the data pointer 0 and the data pointer 1 respectively point to two memory areas of the ping-pong buffer. The buffer data length refers to the length of each memory area. The buffer read indicator is used for indicating which of the two memory areas stores input data in the current data processing period; the buffer write indicator is used to indicate which of the two memory regions holds output data during the current data processing cycle. The buffer type is used to indicate whether the buffer is a ping-pong buffer.
In the step S2, each core of the DSP is provided with a period counter so as to count the data processing periods, and the interval between the period counters of any two cores in the memory is greater than the minimum granularity of the write-back operation of the DSP data cache; and the step S2 is specifically performed by:
judging whether the current core has input data in each data processing period according to a period counter and a multi-core parallel computing model of each core of the DSP, if the current core has no input data in an N-1 data processing period and has input data in an N data processing period, setting a buffer read indicator in a buffer data structure between the current core and a forward core thereof to be 0 and setting a buffer write indicator in a buffer data structure between the current core and a backward core thereof to be 0 at the beginning of the N data processing period, wherein N is a positive integer larger than 1; further, if there is input data in the 1 st data processing cycle, the buffer read indicator in the buffer data structure between the current core and its forward core is set to 0, and the buffer write indicator in the buffer data structure between the current core and its backward core is set to 0.
In the step S3, the step of reading inter-core data and the step of writing inter-core data are performed simultaneously on different cores.
In the step S3, for each current core, the step of reading inter-core data is as follows:
s31: the current core does not perform any additional operation on the prefetcher and the data cache, and directly reads the buffer read indicator from the buffer data structure;
s32: according to the value of the buffer area reading indicator, a data pointer pointing to a memory area of input data is found;
s33: discarding the data in the prefetcher and the old data in the data cache by using the processor instruction, and then reading the input data through the data pointer in the step S32;
s34: the value of the buffer read indicator is directly inverted without any additional operation on the prefetcher and the data cache, and is used as the value of the buffer read indicator in the next data processing cycle.
For each current core, the step of writing inter-core data is as follows:
s31': the current core does not perform any additional operation on the prefetcher and the data cache, and directly reads the buffer write indicator from the buffer data structure;
s32': finding a pointer pointing to a memory area of output data according to a value of a buffer write indicator;
s33': writing output data through the data pointer, and then executing write-back operation on the data cache by utilizing a processor instruction;
s34': the value of the buffer write indicator is directly inverted without any additional operation on the prefetcher and the data cache, and is used as the value of the buffer write indicator in the next data processing cycle.
The fast inter-core data synchronization method for multi-core parallel computing simultaneously configures the buffer write-in indicator and the buffer read-out indicator, thereby reducing the operation times of the prefetcher and the data cache, ensuring that the data can be correctly transferred from the former core to the latter core, reducing the clock period required by the transfer process, and further improving the efficiency of the multi-core parallel computing scheduling method.
Drawings
FIG. 1 is a general flow chart of the fast inter-core data synchronization method of the present invention.
FIG. 2 is a schematic diagram of a data flow model in a multi-core parallel computing model.
FIG. 3 is a schematic workflow diagram of the steps of the present invention for fast reading inter-core data.
FIG. 4 is a workflow diagram of the fast write inter-core data step of the present invention.
Detailed Description
The invention will be further illustrated with reference to specific examples. It should be understood that the following examples are illustrative of the present invention and are not intended to limit the scope of the present invention.
As shown in fig. 1, the fast inter-core data synchronization method of the present invention is used for maintaining inter-core data consistency during DSP multi-core parallel computing, and is applicable to DSP, for example, C66x series DSP, and is improved based on the multi-core parallel computing scheduling method disclosed in patent document CN201811305984.9, and includes the following steps:
step S1: according to the inter-core connection relation of the multi-core parallel computing model, a buffer area and a buffer area data structure are configured between every two directly connected cores in the inter-core connection relation, and a buffer area writing indicator and a buffer area reading indicator are arranged in each buffer area data structure.
As shown in fig. 2, the DSP has a plurality of cores, with different cores distinguished by numerical numbers (e.g., each block in fig. 2 represents a core). Taking fig. 2 as an example, the DSP has 8 cores, each numbered sequentially from the numeral "0", denoted as: core 0, core 1, core 2 … … core 7. In this embodiment, the multi-core parallel computing model is a data flow model. Furthermore, in other embodiments, the multi-core parallel computing model may also be other models.
The allocation of the runtime of each core of the DSP is pre-specified in the source code by way of configuration data structures, rather than being dynamically allocated during operation. The correspondingly configured data structures include inter-core connection relationships and buffer data structures.
The inter-core connection relationship is used for describing the inter-core input-output relationship of the DSP so as to adapt to various parallel computing models. In this embodiment, the inter-core connection relationship is shown in fig. 2.
The meaning of the two terms forward core and backward core is consistent with patent document CN 201811305984.9. That is, in the parallel computing model, if the output of a certain core is the input data of the current core, the core is referred to as a forward core of the current core; for example, in the data flow model shown in FIG. 2, core 1's forward core is core 0, while core 0 has no forward cores. If the input of a certain core is the output data of the current core, the core is called a backward core of the current core; for example, in the data flow model shown in FIG. 2, core 1 has a backward core of core 2, while core 7 has no backward core.
In other embodiments, there may be multiple forward cores for one core depending on the parallel computing model chosen. Likewise, there may be multiple backward cores for a core.
The plurality of buffer data structures are configured before the algorithm is run. In particular, it is necessary to develop a program code for a DSP when writing the program code. After programming the DSP program code, compiling and downloading the DSP program to the DSP for operation.
In the inter-core connection relationship, there is an input-output relationship between two directly connected cores, where the output of one core is the input of the other core. Since the two directly connected cores are simultaneously reading and writing data, the buffer between the two cores needs to use a ping-pong buffer. The invention is to configure ping-pong buffer area between all cores needing data exchange, and at the same time, to configure buffer area data structure. If data exchange is not needed between two cores, the ping-pong buffer and the buffer data structure are not needed to be configured.
Specifically, taking fig. 2 as an example, in this embodiment, since there are 8 cores in total, 7 ping-pong buffers (located in the memory) and 7 buffer data structures (located in the memory) are to be provided. Since the 7 buffer data structures are identical in structure, and in all of the 7 buffer data structures, the method described in detail below needs to be performed. Therefore, the invention only describes 1 part of the parts in detail, and the rest 6 parts are not repeated.
In the step S1, the buffer data structure is:
Figure BDA0002462541640000081
data pointer 0 and data pointer 1 point to two memory regions of the ping-pong buffer, respectively. The buffer data length refers to the length of each memory area. The buffer read indicator is used for indicating which of the two memory areas stores input data in the current data processing period; the buffer write indicator is used to indicate which of the two memory regions holds output data during the current data processing cycle. The buffer type is used to indicate whether the buffer is a ping-pong buffer.
Since the buffer area between two cores is a ping-pong buffer area, two areas need to be opened up in the memory to store read and write data respectively.
It can be seen that, compared with the conventional buffer data structure, one additional item is added in the buffer data structure designed by the invention: the buffer writes to the indicator. Thus, the current core that needs to read data only needs to use the buffer read indicator and the current core that needs to write data only needs to use the buffer write indicator per data processing cycle.
Step S2: corresponding buffer write indicators and buffer read indicators are initialized.
Wherein each core of the DSP is provided with a cycle counter for counting data processing cycles. The present invention sets a cycle counter for each core. Otherwise, the cycle counter needs to be continuously modified and updated along with the running of the multi-core parallel computing scheduling method. If all cores use the same cycle counter, this can lead to data consistency problems. The cycle counter is set in each core, so that the extra cost caused by a series of operations on the prefetcher and the data cache when each core reads and writes the cycle counter is avoided.
In addition, the interval between the cycle counters of any two cores in the memory is larger than the minimum granularity of the write-back operation of the DSP data cache, and the cycle counters of all cores cannot be stored continuously in the memory. This is determined by the hardware characteristics of the DSP. The C66x DSP cores have a minimum granularity (the minimum granularity of discarding old data and performing a write-back operation is the same) when operating on the data cache (discarding old data in the data cache with processor instructions and performing a write-back operation on the data cache with processor instructions). When writing the data in the data cache back to the memory, all the data within this granularity are written back to the memory together. The interval between the cycle counters of any two cores in the memory is set to be larger than the minimum granularity of the write-back operation of the DSP data cache, so that mutual interference of the two cores when the data in the respective data caches are written back to the memory is avoided.
The step S2 is specifically performed by the following manner:
for each core of the DSP, judging whether the current core has input data in each data processing period according to a period counter and a multi-core parallel computing model, if the current core has no input data in an N-1 data processing period and has input data in an N data processing period, setting a buffer reading indicator in a buffer data structure between the current core and a forward core to be 0 and setting a buffer writing indicator in a buffer data structure between the current core and a backward core to be 0 at the beginning of the N data processing period. N is a positive integer greater than 1. In addition, if the current core has input data in the 1 st data processing cycle, the buffer read indicator in the buffer data structure between the current core and its forward core is set to 0, and the buffer write indicator in the buffer data structure between the current core and its backward core is set to 0.
Taking a data flow model in a multi-core parallel computing model as an example, the process of judging whether the cores have input data in each data processing period is as follows. As shown in fig. 2, each block represents a processor core, where the number is the number of the processor. Input data is passed from core 0 to core 1, core 2, core 3 … … in sequence. Specifically, in the 1 st data processing cycle, only core 0 has input data, and the rest of the cores have no data, so only core 0 is running. For ping-pong buffers between core 0 and core 1, the buffer write indicator is now active and the buffer read indicator is inactive. By the 2 nd processing cycle, core 0 passes the processed data to core 1, so core 1 also has input data. At this point the ping-pong buffer between core 0 and core 1, the buffer write indicator and the buffer read indicator are both active. The buffer read indicator in the ping-pong buffer between core 1 and core 2 is still inactive. For core 2, there is no input data until the 3 rd processing cycle. Thus, by the cycle counter set as above and the parallel computing model shown in fig. 2, it is possible to determine whether or not a certain core has input data.
Step S3: all cores of the DSP perform the steps of reading inter-core data and writing inter-core data simultaneously.
The steps of reading inter-core data and writing inter-core data may be performed simultaneously on different cores, without a separate order. In this patent, all cores of the DSP are running at the same time, that is, the forward core is writing data to the ping-pong buffer while the backward core is reading data from the ping-pong buffer. By "concurrently" is meant herein that the threads are executing simultaneously on different cores, rather than the usual multiple threads taking turns on the CPU.
In this patent, the forward core and the backward core simultaneously read and write data at the same time, but since the ping-pong buffer area is provided with two memory areas, one is used for reading and the other is used for writing, the forward core and the backward core operate in different memory areas, so that the two memory areas do not conflict, and the simultaneous operation of hardware level can be realized.
As shown in fig. 3, for each current core, the inter-core data is read, which specifically includes the following steps:
step S31: the current core does not perform any additional operation on the prefetcher and the data cache, and directly reads the buffer read indicator from the buffer data structure;
the inter-core data consistency can be maintained without any additional operation on the prefetcher and the data cache, so that the overhead of the multi-core parallel computing scheduling algorithm in maintaining the inter-core data consistency can be reduced.
Step S32: according to the value of the buffer area reading indicator, a data pointer pointing to a memory area of input data is found;
there are two data pointers, data pointer 0 and data pointer 1, in the buffer data structure. When the value of the buffer read indicator is 0, the data pointer 0 points to the memory area of the input data. When the value of the buffer read indicator is 1, it is the data pointer 1 that points to the memory area of the input data.
Step S33: discarding the data in the prefetcher and the old data in the data cache by using the processor instruction, and then reading the input data through the data pointer in the step S32;
the reading and writing of the patent are from the perspective of the DSP core, and the data is read from the memory to the data cache, because the data cache is positioned in the DSP core. The reading herein refers to the DSP core reading data from the memory to the data cache using instructions.
Step S34: the value of the buffer read indicator is directly inverted without any additional operation on the prefetcher and the data cache, and is used as the value of the buffer read indicator in the next data processing cycle.
Inter-core data coherency is maintained without any additional operations on prefetchers and data caches.
In the step S34, inverting the value of the buffer read indicator includes: the value of the buffer read indicator is changed from 0 to 1 or from 1 to 0.
As shown in fig. 4, in step S3, for each current core, the step of writing inter-core data is as follows:
step S31': the current core does not perform any additional operation on the prefetcher and the data cache, and directly reads the buffer write indicator from the buffer data structure;
inter-core data coherency is maintained without any additional operations on prefetchers and data caches.
Step S32': finding a pointer pointing to a memory area of output data according to a value of a buffer write indicator;
there are two data pointers, data pointer 0 and data pointer 1, in the buffer data structure. When the value of the buffer write indicator is 0, the data pointer 0 points to the memory area outputting the data. When the value of the buffer write indicator is 1, the data pointer 1 points to the memory area outputting the data.
Step S33': writing output data through the data pointer in the step S32', and then executing write-back operation on the data cache by utilizing a processor instruction;
the reading and writing of the patent are from the perspective of the DSP core, because the data cache is located in the DSP core. Data is buffered from the data to the memory called "write". Write back here refers to the current core writing data from the data cache to memory using special instructions.
Since there is one characteristic at the time of data cache design: if a certain instruction is to make the DSP core write the calculation result to the memory, the actual execution result is that the DSP core only writes the calculation result to the data cache, and does not actually write the data to the memory. To ensure that data can be written to memory, a special write-back instruction needs to be invoked to force the data in the data cache to be rewritten to memory.
Step S34': the value of the buffer write indicator is inverted without any additional operation on the prefetcher and the data cache, and is used as the value of the buffer write indicator in the next data processing period;
inter-core data coherency is maintained without any additional operations on prefetchers and data caches.
In the step S34', the value of the buffer write indicator is inverted, including: the value of the write buffer indicator is changed from 0 to 1 or from 1 to 0.
Step S4: and entering the next data processing period, and repeating the step S3 until the operation of the multi-core parallel computing model is completed.
Wherein each repetition of step S3 corresponds to a next data processing cycle, so that the values of the buffer read indicator and the buffer write indicator each time step S3 is repeated are the values of the next data processing cycle in step S34 and step S34'.
The working principle of the fast inter-core data synchronization method for multi-core parallel computing of the present invention when data is transferred from a previous core to a next core is specifically described below by taking a buffer between core 1 and core 2 as an example.
For ease of discussion, it is assumed here that the values of both the buffer write indicator and the buffer read indicator have been initialized. The data pointer 0 points to the memory area for the core 2 to read data, the data pointer 1 points to the memory area for the core 1 to write data, and then the value of the buffer read indicator is 0 and the value of the buffer write indicator is 1.
Core 1 may write the indicator value directly from the memory read buffer without any additional operations to the prefetcher and the data cache. The reasons are as follows: since the core 2 does not need to use the buffer write pointer, no operation is performed on the buffer write pointer. In other words, only core 1 has operated on the buffer write indicator before this. If core 1 has already saved the buffer write indicator in the data cache, then the hardware device of core 1 will directly derive the value of the buffer write indicator from the data cache. If the buffer write indicator is not held in the data cache of core 1, then the hardware device of core 1 will automatically read the value of the buffer write indicator into memory. In both cases, core 1 can correctly get the value of the buffer write indicator, i.e. 1. Next, core 1 writes data to the memory region pointed to by data pointer 1. After the write, the core 1 performs a write-back operation on the data cache by using the processor instruction. Core 1 then inverts the value of the buffer write indicator to obtain a 0 and uses this as the new value of the buffer write indicator in the next data processing cycle.
While core 1 is performing data processing, core 2 is also performing data processing. Core 2 also reads the value of the pointer directly from the memory read buffer without any additional operations to the prefetcher and the data cache. The reasons are as follows: since the core 1 does not need to use the buffer read pointer, no operation is performed on the buffer read pointer. In other words, only core 2 has operated on the buffer read indicator before this. If the buffer read indicator is already held in the data cache of core 2, then the hardware device of core 2 will directly derive the value of the buffer read indicator from the data cache. If core 2 does not hold a buffer read indicator in the data cache, then the hardware device of core 2 will automatically read the value of the buffer read indicator into memory. In both cases, core 2 may correctly get the value of the buffer read indicator, i.e., 0. Next, the core 2 discards the data in the prefetcher and the old data in the data cache by the processor instruction, and reads the data from the memory area pointed to by the data pointer 0. Core 2 then inverts the value of the buffer read indicator to obtain a 1, which is taken as the new value of the buffer read indicator in the next data processing cycle.
It can be seen that after adding the buffer write pointer, core 1 and core 2 do not need any additional operations on the prefetcher and the data cache when the read buffer reads the pointer (and the write pointer). While both hardware components, prefetcher and data cache, are still operating, core 1 always gets the value of the buffer write indicator correctly, and core 2 always gets the value of the buffer read indicator correctly. It is noted that since the prefetcher and the data cache are not operated accordingly, the value of the buffer read indicator (and the write indicator) in the memory may not be consistent with the value in the DSP core, i.e., the value in the memory may be un-updated and erroneous. For example, if the value of the buffer write indicator is 0, the buffer read indicator should normally be 1. If the value of the buffer write indicator is 1, then the buffer read indicator should be 0. After the method of this patent is used, the buffer write indicator and the buffer read indicator in the memory may be both 0 or both 1. So the buffer write indicator and the buffer read indicator are two variables used inside the multi-core parallel computing scheduling method, the patent only ensures that the scheduling algorithm itself can get the correct value, and other programs should not attempt to read the two variables from the memory.
In summary, the fast inter-core data synchronization method for multi-core parallel computing provided by the invention has the advantages that the buffer write-in indicator and the buffer read-in indicator are simultaneously set, so that in the process of transferring data from the former core to the latter core, only the data in the prefetcher and the data cache are required to be discarded once, and the write-back operation is performed on the data cache once, thereby obviously reducing the additional expenditure of the multi-core parallel computing scheduling method.
Experimental results
The fast inter-core data synchronization method is described above, and the performance of the fast inter-core data synchronization method is tested below. For comparison purposes, the same test environment as that of patent CN201811305984.9 was used here, as shown in table 1.
Table 1 values of parameters in performance testing
Figure BDA0002462541640000141
The test results are shown in Table 2. It can be seen that after the inter-core data synchronization method provided by the present invention is adopted, the overhead when reading and writing inter-core data is 27.8% and 18.7% of the multi-core parallel computing scheduling method of patent document CN201811305984.9, respectively.
TABLE 2 Performance test results
Figure BDA0002462541640000151
The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and various modifications can be made to the above-described embodiment of the present invention. All simple, equivalent changes and modifications made in accordance with the claims and the specification of the present application fall within the scope of the patent claims. The present invention is not described in detail in the conventional art.

Claims (4)

1. The fast inter-core data synchronization method for multi-core parallel computing is used for maintaining inter-core data consistency during DSP multi-core parallel computing, and is characterized by comprising the following steps:
step S1: according to the inter-core connection relation of the multi-core parallel computing model, a buffer area and a buffer area data structure are configured between every two directly connected cores in the inter-core connection relation, and a buffer area writing indicator and a buffer area reading indicator are arranged in each buffer area data structure;
step S2: initializing a corresponding buffer write indicator and buffer read indicator;
step S3: all cores of the DSP simultaneously execute the steps of reading inter-core data and writing inter-core data;
step S4: entering the next data processing period, and repeating the step S3 until the operation of the multi-core parallel computing model is completed;
in the step S3, for each current core, the step of reading inter-core data is as follows:
step S31: the current core does not perform any additional operation on the prefetcher and the data cache, and directly reads the buffer read indicator from the buffer data structure;
step S32: according to the value of the buffer area reading indicator, a data pointer pointing to a memory area of input data is found;
step S33: discarding the data in the prefetcher and the old data in the data cache by using the processor instruction, and then reading the input data through the data pointer in the step S32;
step S34: directly inverting the value of the buffer read indicator without any additional operation on the prefetcher and the data cache, and taking the value as the value of the buffer read indicator in the next data processing period;
for each current core, the step of writing inter-core data is as follows:
step S31': the current core does not perform any additional operation on the prefetcher and the data cache, and directly reads the buffer write indicator from the buffer data structure;
step S32': according to the value of the buffer write-in indicator, a data pointer pointing to a memory area of output data is found;
step S33': writing output data through the data pointer of the step S32', and then executing write-back operation on the data cache by utilizing a processor instruction;
step S34': the value of the buffer write indicator is directly inverted without any additional operation on the prefetcher and the data cache, and is used as the value of the buffer write indicator in the next data processing cycle.
2. The method for fast inter-core data synchronization for multi-core parallel computing according to claim 1, wherein in the step S1, the buffer data structure is:
Figure FDA0004088845690000021
wherein, the data pointer 0 and the data pointer 1 respectively point to two memory areas of the ping-pong buffer; the data length of the buffer area refers to the length of each memory area; the buffer read indicator is used for indicating which of the two memory areas stores input data in the current data processing period; the buffer write-in indicator is used for indicating which of the two memory areas stores output data in the current data processing period; the buffer type is used to indicate whether the buffer is a ping-pong buffer.
3. The method for fast inter-core data synchronization for multi-core parallel computing according to claim 1, wherein in the step S2, each core of the DSP is provided with a cycle counter so as to count data processing cycles, and a distance between cycle counters of any two cores in the memory is larger than a minimum granularity of a write-back operation of the DSP data cache;
and the step S2 is specifically performed by:
judging whether the current core has input data in each data processing period according to a period counter and a multi-core parallel computing model of each core of the DSP, if the current core has no input data in an N-1 data processing period and has input data in an N data processing period, setting a buffer read indicator in a buffer data structure between the current core and a forward core thereof to be 0 and setting a buffer write indicator in a buffer data structure between the current core and a backward core thereof to be 0 at the beginning of the N data processing period, wherein N is a positive integer larger than 1; in addition, if the current core has input data in the 1 st data processing cycle, the buffer read indicator in the buffer data structure between the current core and its forward core is set to 0, and the buffer write indicator in the buffer data structure between the current core and its backward core is set to 0.
4. The method according to claim 1, wherein in step S3, the step of reading inter-core data and the step of writing inter-core data are performed simultaneously on different cores.
CN202010324853.6A 2020-04-22 2020-04-22 Quick inter-core data synchronization method for multi-core parallel computing Active CN111459872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010324853.6A CN111459872B (en) 2020-04-22 2020-04-22 Quick inter-core data synchronization method for multi-core parallel computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010324853.6A CN111459872B (en) 2020-04-22 2020-04-22 Quick inter-core data synchronization method for multi-core parallel computing

Publications (2)

Publication Number Publication Date
CN111459872A CN111459872A (en) 2020-07-28
CN111459872B true CN111459872B (en) 2023-05-12

Family

ID=71679590

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010324853.6A Active CN111459872B (en) 2020-04-22 2020-04-22 Quick inter-core data synchronization method for multi-core parallel computing

Country Status (1)

Country Link
CN (1) CN111459872B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181893B (en) * 2020-09-29 2022-07-05 东风商用车有限公司 Communication method and system between multi-core processor cores in vehicle controller

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970602A (en) * 2014-05-05 2014-08-06 华中科技大学 Data flow program scheduling method oriented to multi-core processor X86
EP3401784A1 (en) * 2017-05-11 2018-11-14 Tredzone SAS Multicore processing system
WO2019060386A2 (en) * 2017-09-19 2019-03-28 Bae Systems Controls Inc. System and method for managing multi-core accesses to shared ports
CN109558226A (en) * 2018-11-05 2019-04-02 上海无线通信研究中心 A kind of DSP multi-core parallel concurrent calculating dispatching method based on internuclear interruption

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970602A (en) * 2014-05-05 2014-08-06 华中科技大学 Data flow program scheduling method oriented to multi-core processor X86
EP3401784A1 (en) * 2017-05-11 2018-11-14 Tredzone SAS Multicore processing system
WO2019060386A2 (en) * 2017-09-19 2019-03-28 Bae Systems Controls Inc. System and method for managing multi-core accesses to shared ports
CN109558226A (en) * 2018-11-05 2019-04-02 上海无线通信研究中心 A kind of DSP multi-core parallel concurrent calculating dispatching method based on internuclear interruption

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于数据预取的多核处理器末级缓存优化方法;单书畅等;《计算机辅助设计与图形学学报》(第09期);全文 *

Also Published As

Publication number Publication date
CN111459872A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
JP3790039B2 (en) Branch prediction adjustment method
US7516306B2 (en) Computer program instruction architecture, system and process using partial ordering for adaptive response to memory latencies
KR920006275B1 (en) Data processing apparatus
JP6153533B2 (en) Runtime instrumentation oriented sampling
EP0394624B1 (en) Multiple sequence processor system
CN104978284B (en) Processor subroutine cache
US20140129784A1 (en) Methods and systems for polling memory outside a processor thread
JPH096680A (en) Cache flush device
CN112667289B (en) CNN reasoning acceleration system, acceleration method and medium
US5689694A (en) Data processing apparatus providing bus attribute information for system debugging
US7987347B2 (en) System and method for implementing a zero overhead loop
US7991985B2 (en) System and method for implementing and utilizing a zero overhead loop
JPH05204709A (en) Processor
US8250344B2 (en) Methods and apparatus for dynamic prediction by software
EP0482200B1 (en) Interrupt processing system
CN111459872B (en) Quick inter-core data synchronization method for multi-core parallel computing
JPH08221272A (en) Method for loading of instruction onto instruction cache
US6748523B1 (en) Hardware loops
US20030154469A1 (en) Apparatus and method for improved execution of a software pipeline loop procedure in a digital signal processor
US6766444B1 (en) Hardware loops
US10180839B2 (en) Apparatus for information processing with loop cache and associated methods
JPH02159624A (en) First-in first-out register device
KR100809294B1 (en) Apparatus and method for executing thread scheduling in virtual machine
CN113961452A (en) Hard interrupt method and related device
EP0415351A2 (en) Data processor for processing instruction after conditional branch instruction at high speed

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant