CN118012818A - Inter-process communication optimization method based on new generation Shenwei many-core processor - Google Patents

Inter-process communication optimization method based on new generation Shenwei many-core processor Download PDF

Info

Publication number
CN118012818A
CN118012818A CN202410428041.4A CN202410428041A CN118012818A CN 118012818 A CN118012818 A CN 118012818A CN 202410428041 A CN202410428041 A CN 202410428041A CN 118012818 A CN118012818 A CN 118012818A
Authority
CN
China
Prior art keywords
communication
different
data
chip
inter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410428041.4A
Other languages
Chinese (zh)
Inventor
刘弢
张忠亮
高宝峰
郭莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Qilu University of Technology
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology, Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Qilu University of Technology
Priority to CN202410428041.4A priority Critical patent/CN118012818A/en
Publication of CN118012818A publication Critical patent/CN118012818A/en
Pending legal-status Critical Current

Links

Landscapes

  • Computer And Data Communications (AREA)

Abstract

The invention relates to an inter-process communication optimization method based on a new generation Shenwei many-core processor; the method comprises the steps of dividing a core group in a chip into different process communication domains, belonging to the technical field of electronic information; the method comprises the steps of optimizing communication between chips and optimizing communication inside the chips; communication optimization between chips, comprising: dividing a communication domain; dividing the process, and distributing different processes into different communication domains; different processes are divided into different communication domains according to the data usage rule and the machine hardware architecture; different inter-process communication operations are performed simultaneously in different communication domains; communication optimization inside a chip, comprising: for different types of inter-process communication operations, a process with a specific core group number applies for a space on the cross section, different cores on the same chip respectively add data to specific positions in the cross section at the same time, and synchronization is performed once among all processes. The present invention provides different methods for different types of process communication operations.

Description

Inter-process communication optimization method based on new generation Shenwei many-core processor
Technical Field
The invention belongs to the technical field of electronic information, and particularly relates to an inter-process communication optimization method based on a new generation Shenwei many-core processor.
Background
High performance computing plays an important role in promoting technological innovation and economic development, and is an important 'high point' in striving for in all developed countries of the world. The method has remarkable results in the field of high-performance calculation, in particular to the autonomous research and development and application of the supercomputer. The new generation Shenwei super computer is a representative work of a domestic super computer, realizes the complete localization of core devices such as a slave processor, a network chip set, a storage and management system and the like for the first time, and shows the innovation strength of China in the field of the super computer.
SW26010pro is a new generation of high-performance heterogeneous many-core processor which is independently developed in China. The architecture of the SW26010pro processor is shown in FIG. 1, where the SW26010pro processor has 6 Core Groups (CG). Each core group contains one management processing unit (Manage Processing Element, MPE) and 64 computation processing units (Computing Processing Elements, CPEs). Each processor has a total of 96GB of memory, referred to as main memory. The main memory is based on virtual address management and can be divided into two types, namely a continuous segment and a cross segment. Each core group has a private space, called a contiguous segment space, in which addresses are consecutively addressed. Both the master core and the slave core have direct access to the contiguous segment space of the group to which they belong. Within the same chip, there is also a shared space between core groups, called a cross-segment space. This space is a distributed shared area that is accessible to both the master and slave cores for the entire chip.
Currently, there are generally two approaches to parallel optimization of new generation Shenwei supercomputers and Shenwei many-core processors. One is parallel optimization of process levels of inter-core isomorphism, and the other is thread level parallel optimization of intra-chip isomerization. The main stream optimization mainly focuses on heterogeneous thread-level parallel optimization in a chip, such as load balancing of slave core tasks, optimization of master-slave core data transmission, optimization of SIMD (Single instruction multiple data) and the like by using OpenACC guidance sentences or Athread libraries. There is less parallel optimization of process levels for inter-core isomorphism.
Process-level parallel optimization is a common method for improving the performance of computer systems, fully utilizing hardware resources, and solving large-scale problems. When the process level is optimized in parallel, the whole task is mainly divided, and each process independently executes own task. A single process needs to exchange or provision data between processes during execution of a task and after completion of task execution. In this process, communication between processes is a key factor affecting the overall program parallelism efficiency. Reducing the number of communications between processes, increasing the computation time duty cycle of a single process is critical to improving overall efficiency. The new generation of Shenwei many-core processors increases each chip from 4 core groups to 6 core groups compared to the previous generation. This means that inter-process communication inside the chip is more intimate, and communication optimization of the intra-chip processes becomes an important part of process-level parallel optimization. When the program is parallel at the process level, frequent communication is needed among the processes, so that the problem of high communication overhead is caused.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an inter-process communication optimization method based on a new generation Shenwei many-core processor;
The inter-process communication optimization method corresponding to the Shenwei many-core processor hardware architecture is adopted in the same chip and between different chips. And further, the communication times among the processes are reduced, the communication optimization among the processes is finally realized, and the running efficiency of the program is improved.
Unlike the previous process to communicate in the global communication domain, the present invention divides the 6 processes corresponding to the 6 core groups in each chip into 6 different communication domains. Firstly, data processing is carried out among the core groups in the chip, and then a specific process is used for communication in the divided communication domain. The invention divides the communication between the processes into two parts which respectively correspond to the inside of the chip and between the chips. The hardware architecture of the Shenwei many-core processor is realized by subdividing communication among processes. The optimization in the chip adopts the method provided by the invention to optimize the data transmission efficiency of the process in the chip. Communication optimization between chips aims at reducing data transmission times among chips and increasing data transmission bandwidth. The invention also designs a special inter-process communication optimization interface-SPCI (Sunway Process Communication Interface), which is convenient for a programmer to call, and reduces the programming difficulty of parallel programs.
The invention finally realizes the inter-process communication optimization based on the new generation Shenwei many-core processor. Experimental results show that the method for transmitting data in the chip reduces the inter-process communication time in the chip and reduces the communication times among different processes of the chip. The running efficiency of the high-performance program on the new generation Shenwei cluster is improved.
Term interpretation:
A communication domain, which defines a communication relationship between a set of processes. A communication domain may be regarded as a group of processes that can send and receive messages to and from each other among processes belonging to the same communication domain.
The technical scheme of the invention is as follows:
The inter-process communication optimization method based on the new generation Shenwei many-core processor is characterized in that in the new generation Shenwei many-core processor, core groups in a chip are divided into different process communication domains, and cross-section shared memory is used for inter-core group communication in the chip; the method comprises the steps of optimizing communication between chips and optimizing communication inside the chips;
communication optimization between chips, comprising: dividing a communication domain; dividing the process, and distributing different processes into different communication domains; different processes are divided into different communication domains according to the data usage rule and the machine hardware architecture;
different inter-process communication operations are performed simultaneously in different communication domains, including process communication operations of different data of the same type and different data of different types;
Communication optimization inside a chip, comprising: for different types of inter-process communication operation, a process with a specific core group number applies for a space on the cross section for storing data needing to be collected among the same chip core group, wherein the size of the data is 6 times of the size of the data collected from each process; the data is added simultaneously to specific positions in the cross section respectively in different cores on the same chip, and after the data adding operation, synchronization is performed once among all processes.
It is further preferred that when the inter-process communication operation is an operation of collecting data from each process to the current process by one process, the data to be collected is stored in the cross section of the main memory.
Further preferably, when the inter-process communication operation is an inter-process communication operation of the data scanning type, each process calculates a local accumulation result and a relative local intermediate result; the local accumulated result is the accumulated result from the process starting in the chip to the current process for each process on the chip; the relative local intermediate result is an intermediate result from each process on a chip to the last process; storing the local accumulated result and the relative local intermediate result in a continuous segment of the main memory, and processing the local accumulated result and the relative global intermediate result after the inter-chip process communication is finished; obtaining a global accumulation result; the global accumulated result is the accumulated result from the global start process to the current process, stored in the consecutive segments of the main memory.
According to the present invention, the division of the communication domain includes:
firstly, obtaining the number of a core group on a corresponding chip, wherein each core group corresponds to a process;
Then dividing the number of the core group on the same chip into different communication domains; is as follows: processes corresponding to different core groups on the same chip are divided into 6 different communication domains; processes corresponding to the same inter-core group numbers among different chips are divided into the same communication domain;
Further preferably, the communication domains include communication domain 0, communication domain 1, communication domain 2, communication domain 3, communication domain 4, and communication domain 5; communication domain 0 includes process 0, process 6, …, process 6n; communication domain 1 includes process 1, process 7, …, process 6n+1; communication domain 2 includes process 2, process 8, …, process 6n+2; communication domain 3 includes process 3, process 9, …, process 6n+3; communication domain 4 includes process 4, process 10, …, process 6n+4; the communication domain 5 comprises a process 5, a process 11, … and a process 6n+5, wherein n is the number of a new generation Shenwei many-core processor.
In accordance with the preferred embodiment of the present invention, call SPCI (Sunway Process Communication Interface) interfaces implement different types of interprocess communication operations.
The beneficial effects of the invention are as follows:
1. The method provided by the invention is particularly optimized for different types of process communication operation. Different methods are provided for different types of process communication operations. And abstracting the optimized process communication operation into an interface function, thereby reducing programming complexity and improving usability.
2. The invention is respectively under the conditions of 1 chip (6 processes), 4 chips (24 processes), 16 chips (96 processes) and 32 chips (192 processes). The process communication time of three different test samples which are used by the collection type, the scanning type, the collection type and the scanning type in combination and the process communication time of the three different test samples which are optimized by the method provided by the invention are tested. Experimental results show that the test sample process communication optimized by the method has obvious acceleration effect.
Drawings
FIG. 1 is a schematic diagram of a SW26010pro processor architecture;
FIG. 2 is a schematic diagram of a hardware architecture of an interprocess communication partition and Shenwei many-core processor;
FIG. 3 is a schematic diagram of communication domain partitioning logic based on a Shenwei many-core processor;
FIG. 4 is a schematic diagram of inter-chip communication optimization logic based on a Shenwei many-core processor;
FIG. 5 is a schematic diagram of an optimized memory partition in a chip based on a Shenwei many-core processor;
FIG. 6 is a schematic diagram of scan type process communication on-chip optimization logic based on Shenwei many-core processor.
Detailed Description
The invention is further defined by, but is not limited to, the following drawings and examples in conjunction with the specification.
Example 1
According to the special memory architecture of the Shenwei many-core processor, a section of shared space can be accessed among core groups on the same chip, and the section of shared space is called a cross section. In the new generation of Shenwei supercomputers, the communication overhead between the chips is larger than that of the processes inside the chips because of the special hardware architecture. In a new generation Shenwei many-core processor, core groups in a chip are divided into different process communication domains, and cross-section shared memory is used for inter-core group communication in the chip; thereby dividing the inter-process communication into two parts, corresponding to the inside of the chip and between the chips, respectively. The inter-process communication partition corresponds to the hardware architecture of the Shenwei many-core processor as shown in fig. 2. The method comprises the steps of optimizing communication between chips and optimizing communication inside the chips;
Communication optimization between chips, comprising: dividing a communication domain; dividing the process, and distributing different processes into different communication domains; the duty cycle of the inter-chip communication in the overall process communication can be reduced. Different processes are divided into different communication domains according to the data usage rule and the machine hardware architecture; meanwhile, communication optimization inside the chip and communication optimization among the cross chips are designed. The data separation between the chips is realized, and the data communication modes between different hardware architectures are defined.
Based on the divided process communication domain, the communication optimization between chips is realized. The method for optimizing the communication between chips is shown in fig. 4. Different inter-process communication operations are performed simultaneously in different communication domains, including process communication operations of different data of the same type and different data of different types; in communication domain 0, different data is sent by different processes to process number 0 (process 0). This is a collection type of operation. In the communication domain 5, a scan type operation is performed by a specific process in a different chip in the communication domain to which it belongs. After the relative local intermediate result is subjected to the scanning type operation, a relative global intermediate result can be obtained. Only the collection type operation or the scan type operation may be performed in 6 different communication domains, or both operations may be performed simultaneously. After the inter-chip communication optimization method provided by the invention is optimized, more data can be processed in the same time, and the inter-process communication bandwidth is improved. The granularity of data processing can be increased, and the number of inter-process communication times can be reduced.
Communication optimization inside a chip, comprising:
By adopting the method provided by the invention in the chip, different inter-process communication operations are optimized. The method has the advantages of improving the multiplexing efficiency of the data in the chip and reducing the data transmission overhead in the chip.
The application and use of the memory space within the chip is shown in fig. 5. For different types of inter-process communication operation, a process with a specific core group number applies for a space on the cross section for storing data needing to be collected among the same chip core group, wherein the size of the data is 6 times of the size of the data collected from each process; the data is added simultaneously to specific positions in the cross section respectively in different cores on the same chip, and after the data adding operation, synchronization is performed once among all processes. Ensuring that the data is correctly written into the cross section; the method provided by the invention reduces the data transmission time in the chip and improves the multiplexing times of the data among the core groups. And optimizing the communication operation among different processes by combining the characteristics of the communication operation among different processes.
Example 2
The inter-process communication optimization method based on the new generation Shenwei many-core processor according to the embodiment 1 is characterized in that:
When an inter-process communication operation is an operation of collecting data from each process to a current process by one process, it is an N-to-1 type of inter-process communication operation. This type of operation requires collecting data from all processes, and in the case of a large number of processes, the communication overhead required is enormous. And as the number of processes and the amount of data collected increases, so does the communication time.
Aiming at the on-chip optimization of the inter-process communication of the collected data type, the invention adopts a method based on a special memory architecture of the Shenwei many-core processor. Mainly the data to be collected, previously stored in the main stored contiguous segments. Optimized for storage in the cross-section of the main memory. To increase the sharing of data within the chip.
Inter-process communication operations of the data scan type are to scan specific data among all processes, each process getting one accumulated data from the start process to the current process. If the number of processes is large, the communication time of the processes ranked at the back is prolonged, and the overall communication time is affected.
The method of optimizing the operation of the present invention for this type of data scanning inter-process communication is shown in fig. 6. When the inter-process communication operation is an inter-process communication operation of a data scanning type, each process calculates a local accumulation result and a relative local intermediate result; the local accumulated result is the accumulated result of each process on a chip to the current process; the relative local intermediate result is an intermediate result from each process on a chip to the last process; storing the local accumulated result and the relative local intermediate result in a main stored continuous segment, wherein the main stored continuous segment is used for storing private data of each core group; after the inter-chip process communication is finished, processing the local accumulated result and the relative global intermediate result; the process herein means: by receiving different calculation types, different calculations are performed, such as summing, multiplying, maximizing and minimizing, etc. Obtaining a global accumulation result; the global accumulated result is the accumulated result from the start process to the current process, stored in the consecutive segments of the main memory. The global accumulated result is an accumulated result from the global starting process to the current process. The local accumulation result is an accumulation result from a starting process to a current process within the chip. Both differ in that the starting process is different. The starting process of the global accumulation result is process 0, and the starting process of the local accumulation process is process 6n, wherein n is the number of the new generation Shenwei many-core processor.
As shown in fig. 3, the division of the communication domain includes:
firstly, obtaining the number of a core group on a corresponding chip, wherein each core group corresponds to a process;
Then dividing the number of the core group on the same chip into different communication domains; is as follows: processes corresponding to different core groups on the same chip are divided into 6 different communication domains; processes corresponding to the same inter-core group numbers among different chips are divided into the same communication domain; for example, the communication domain 0 includes a process 0, a process 6, … and a process 6n, and the communication domain 5 includes a process 5, a process 11, … and a process 6n, where n is the number of the new generation Shenwei many-core processor.
The communication domains comprise a communication domain 0, a communication domain 1, a communication domain 2, a communication domain 3, a communication domain 4 and a communication domain 5; communication domain 0 includes process 0, process 6, …, process 6n; communication domain 1 includes process 1, process 7, …, process 6n+1; communication domain 2 includes process 2, process 8, …, process 6n+2; communication domain 3 includes process 3, process 9, …, process 6n+3; communication domain 4 includes process 4, process 10, …, process 6n+4; the communication domain 5 comprises a process 5, a process 11, … and a process 6n+5, wherein n is the number of a new generation Shenwei many-core processor.
The SPCI (Sunway Process Communication Interface) interface is invoked to implement different types of interprocess communication operations. The design and implementation of SPCI interfaces improves the efficiency of process level optimization of the new generation Shenwei supercomputer. The invention optimizes the communication between the two processes of N pairs of 1 and scanning types. By using the method provided by the invention, the process communication in the core group is replaced by using the cross section shared data in the core group. And then dividing the processes into different communication domains, so as to reduce the communication times among the processes. Communication time is greatly reduced and different communication tasks can be performed in different communication domains. The SPCI interface is described in table 1.
Table 1 SPCI description of interfaces;
SPCI interface Function of
void SPCI_Init(MPI_Comm world_comm, MPI_Comm *cross_comm,int *world_rank,int *world_size,int *group_rank) Initialization SPCI
int SPCI_gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,void *recvbuf, int recvcount, MPI_Datatype recvtype,int root, MPI_Comm comm) N-to-1 type set communication function
int SPCI_Exscan(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) Scan type set communication function
void SPCI_Finish() Ending SPCI, releasing the resource
In Table 1 ,void SPCI_Init(MPI_Comm world_comm, MPI_Comm *cross_comm,int *world_rank,int *world_size,int *group_rank), void indicates that the function does not need to return a value; SPCI _init represents the function name of the initializing SCPI interface; MPI_Comm, represents the communication domain data type; world_comm, representing a global communication domain; mpi_comm represents a communication domain pointer type; cross_comm represents the divided communication domain pointer, and int represents the integer pointer type; world_rank represents a pointer to the global process ranking number; a world_size, a pointer representing the total number of processes in a global; a world_size, a pointer representing the number of global processes; group_rank represents a pointer to a process's ranking number within the chip.
int SPCI_gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,void *recvbuf, int recvcount, MPI_Datatype recvtype,int root, MPI_Comm comm) In (2), int represents that the function requires a return value of the int type; SPCI _gather, representing the function name of the gather-type process communication; const, which represents a variable modified with a const modifier, the value of which is not changed; void, a general pointer type; sendbuf, representing a pointer to a transmit buffer containing data to be transmitted; sendcount, which represents the amount of data that needs to be sent; MPI_ Datatype, which represents the data type of the communication data; sendtype, which represents a specific data type of data to be transmitted; recvbuf, representing received data; recvcount, which represents the amount of data that needs to be accepted; recvtype, which represents the specific data type of data that needs to be accepted; root, which represents a root process in a communication process; MPI_Comm, represents the communication domain data type; comm, the communication domain for process communication.
int SPCI_Exscan(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) In (2), int represents that the function requires a return value of the int type; SPCI _ Exscan, which represents the function name of the scan type process communication; const, which represents a variable modified with a const modifier, the value of which is not changed; void, a general pointer type; sendbuf, representing a pointer to a transmit buffer containing data to be transmitted; * recvbuf, representing pointers to receive buffers for storing data received in the process communication; int, which represents an integer data type; count, which indicates the amount of data to be transmitted; MPI_ Datatype, which represents the data type of the communication data; type, which represents a specific data type of data; MPI_Op, a data type indicating the type of data calculation; op, which represents a specific calculation type; MPI_Comm, represents the communication domain data type; comm, the communication domain for process communication.
When the SCPI interface is used, a programmer can call different inter-process communication functions in the SCPI interface according to different inter-process communication requirements. The optimization of inter-process communication using the SCPI interface is shown in table 2.
Table 2 uses the SCPI interface for inter-process communication optimization;
In table 2, the header file of the SCPI interface is first referenced, followed by invocation of the SPCI _init function in the main function, initializing SPCI interface. Then, the method comprises the steps of. And performing process communication after the calculation of the reheat point program segment is completed. Call SPCI _gather for collection type operation and call SPCI _ Exscan for scan type operation. Finally, the method includes the steps of. After the process communication ends, the SPCI _finish () function is called, ending the SCPI interface.
# Include, a preprocessing instruction in C and C++, for containing an external header file or library file. SCPI.h, header files of the SCPI interface.
Int main, represents the main function and requires an integer type of return value. argc, which represents the number of command line parameters, is a variable of an integer data type. char, the character data type. char represents a two-dimensional pointer of char type. Argv, representing an array of command line parameters.
SPCI _init represents a function that initializes SPCI the interface. world_comm, representing a global communication domain; cross_comm, which represents the address of the cross_comm variable, is the divided communication domain; the world_rank represents the address of a variable world_rank, and the world_rank is the process ranking number of a process in the global; the world_size, which represents the address of the world_size variable, is the number of global processes; and group_rank represents the address of the group_rank, and the group_rank is the process ranking number of the process in the Shenwei chip.
SPCI _Gather, represents the function name of the Gather-type operation. The group_bank represents the address of the variable group_bank, and the group_bank is the data to be transmitted. 6, each process needs to send 6 data; MPI_INT indicates that the type of data to be transmitted is an integer type; recv_buf is the data that needs to be received. 6, indicating that 6 data needs to be accepted from each process. MPI_INT indicates that the type of data that needs to be received is an integer type. And 0, wherein the process number 0 is a root process and is responsible for collecting data. Cross_comm represents the divided communication domain.
SPCI _ Exscan represents the function name of the scan type operation. And scan_buf, which represents the address of the variable scan_buf, which is the data that needs to be scanned. The variable recv_buf_1 represents the address of the variable recv_buf_1, and recv_buf_1 is the data to be received. 1, representing scanning one data from each process. MPI_INT indicates that the type of data to be scanned is an integer type. MPI_SUM represents the type of computation as summation. Cross_comm represents the divided communication domain.
SPCI _finish (), which indicates ending the SCPC interface and releasing the resources.
Return 0 represents a return value of 0 for the primary function.
OpenMC is a program for simulating a particle transport process by the Monte Carlo method, and the operation stability of a nuclear reactor can be evaluated by calculating a k-eff value or the like. The test sample program used in the present invention is a part extracted from synchronize _bank () function under the eigenvalue. Cpp file in the OpenMC program, and the function is to calculate the index ranges of particles allocated to different processes in the fission database. There are three different test examples in total. The test sample data is derived from a portion of interprocess communication data during OpenMC program execution. The test sample program is used for counting the execution time of different test samples. The method comprises the execution time of an original program which is not optimized by the method and the execution time of the program which is optimized by the method, wherein the time unit is microseconds. The test results are the average of the 3 runs taken.
The test sample corresponding to SPCI _Gather is to collect the calculation data of each process into the root process. SPCI _Gather in OpenMC is to collect calculation data of different processes in each round of calculation, and obtain a final result of each round of calculation by collecting calculation data of different processes. SPCI-Exscan corresponds to a test sample in which the cumulative result of all process calculation data from the start of a process to the current process is obtained. SPCI-Exscan the role in OpenMC is that the calculation determines the starting and ending positions in the particle database of the particles assigned to each process for the next round of calculation. The test samples used in combination with SPCI _Gather and SPCI _ Exscan were combined with the first two test samples, followed by SPCI _Gather and SPCI _ Exscan. The combined use condition is more in line with the real running condition of OpenMC programs.
The testing program in the invention is carried out in a new generation Shenwei super computer, the peak performance of the machine of the new generation Shenwei super computer is 3.13PFlops, and the node calculation performance is 6.12TFlops. The invention uses 16 nodes, and optimizes three different test sample programs through the method and the interface. Each test sample program was tested under 1, 4, 16 and 32 shenwei many-core processors, corresponding to 6, 24, 96 and 192 processes, respectively.
Table 3 acceleration ratio of SPCI _ Gather alone:
Table 4 acceleration ratio of SPCI _ Exscan alone:
table 5 acceleration ratio for the combination of SPCI _gather and SPCI _ Exscan:
it can be seen that the acceleration ratio of SPCI _Gather alone is between 1.29 and 3.81, that of SPCI _ Exscan alone is between 1.38 and 2.15, and that of the combination of SPCI _Gather and SPCI _ Exscan is between 1.29 and 2.07. The reason is that different types of inter-process communication have different execution characteristics, and the joint optimization effect of the two communication optimization methods provided by the invention can be better exerted along with the increase of the number of processes. Experimental results prove that the method has a good accelerating effect.

Claims (6)

1. The inter-process communication optimization method based on the new generation Shenwei many-core processor is characterized in that in the new generation Shenwei many-core processor, core groups in a chip are divided into different process communication domains, and cross section shared memory is used for inter-core group communication in the chip; the method comprises the steps of optimizing communication between chips and optimizing communication inside the chips;
communication optimization between chips, comprising: dividing a communication domain; dividing the process, and distributing different processes into different communication domains; different processes are divided into different communication domains according to the data usage rule and the machine hardware architecture;
different inter-process communication operations are performed simultaneously in different communication domains, including process communication operations of different data of the same type and different data of different types;
Communication optimization inside a chip, comprising: for different types of inter-process communication operation, a process with a specific core group number applies for a space on the cross section for storing data needing to be collected among the same chip core group, wherein the size of the data is 6 times of the size of the data collected from each process; the data is added simultaneously to specific positions in the cross section respectively in different cores on the same chip, and after the data adding operation, synchronization is performed once among all processes.
2. The method for optimizing interprocess communication based on a new generation Shenwei many-core processor of claim 1, wherein when the interprocess communication operation is an operation of collecting data from each process to a current process by one process, the data to be collected is stored in a cross section of the main memory.
3. The method for optimizing interprocess communication based on a new generation Shenwei many-core processor of claim 1, wherein each process calculates a local accumulation result and a relative local intermediate result when the interprocess communication operation is an interprocess communication operation of the data scan type; the local accumulated result is the accumulated result from the process starting in the chip to the current process for each process on the chip; the relative local intermediate result is an intermediate result from each process on a chip to the last process; storing the local accumulated result and the relative local intermediate result in a continuous segment of the main memory, and processing the local accumulated result and the relative global intermediate result after the inter-chip process communication is finished; obtaining a global accumulation result; the global accumulated result is the accumulated result from the start process to the current process, stored in the consecutive segments of the main memory.
4. The method for optimizing interprocess communication based on a new generation Shenwei many-core processor of claim 1, wherein the partitioning of the communication domain comprises:
firstly, obtaining the number of a core group on a corresponding chip, wherein each core group corresponds to a process;
then dividing the number of the core group on the same chip into different communication domains; is as follows: processes corresponding to different core groups on the same chip are divided into 6 different communication domains; processes corresponding to the same inter-core group numbers between different chips are partitioned into the same communication domain.
5. The method for optimizing interprocess communication based on the new generation Shenwei many-core processor of claim 4, wherein the communication domain comprises communication domain 0, communication domain 1, communication domain 2, communication domain 3, communication domain 4, communication domain 5; communication domain 0 includes process 0, process 6, …, process 6n; communication domain 1 includes process 1, process 7, …, process 6n+1; communication domain 2 includes process 2, process 8, …, process 6n+2; communication domain 3 includes process 3, process 9, …, process 6n+3; communication domain 4 includes process 4, process 10, …, process 6n+4; the communication domain 5 comprises a process 5, a process 11, … and a process 6n+5, wherein n is the number of a new generation Shenwei many-core processor.
6. The method for optimizing interprocess communication based on a new generation shenwei many-core processor of any one of claims 1-5, wherein call SPCI interface implements different types of interprocess communication operations.
CN202410428041.4A 2024-04-10 2024-04-10 Inter-process communication optimization method based on new generation Shenwei many-core processor Pending CN118012818A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410428041.4A CN118012818A (en) 2024-04-10 2024-04-10 Inter-process communication optimization method based on new generation Shenwei many-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410428041.4A CN118012818A (en) 2024-04-10 2024-04-10 Inter-process communication optimization method based on new generation Shenwei many-core processor

Publications (1)

Publication Number Publication Date
CN118012818A true CN118012818A (en) 2024-05-10

Family

ID=90954464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410428041.4A Pending CN118012818A (en) 2024-04-10 2024-04-10 Inter-process communication optimization method based on new generation Shenwei many-core processor

Country Status (1)

Country Link
CN (1) CN118012818A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271344A (en) * 2018-08-07 2019-01-25 浙江大学 The data preprocessing method read based on Shen prestige chip architecture parallel file
CN111104119A (en) * 2018-10-25 2020-05-05 祥明大学校产学协力团 MPI program conversion method and device for memory center type computer
CN117033026A (en) * 2023-08-17 2023-11-10 山东省计算中心(国家超级计算济南中心) Optimizing method for multilevel collective communication based on new generation Shenwei super computer hardware architecture

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271344A (en) * 2018-08-07 2019-01-25 浙江大学 The data preprocessing method read based on Shen prestige chip architecture parallel file
CN111104119A (en) * 2018-10-25 2020-05-05 祥明大学校产学协力团 MPI program conversion method and device for memory center type computer
CN117033026A (en) * 2023-08-17 2023-11-10 山东省计算中心(国家超级计算济南中心) Optimizing method for multilevel collective communication based on new generation Shenwei super computer hardware architecture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘侃,杨磊,薛巍,陈文光: "适用于申威众核架构的稀疏矩阵-矩阵乘法", 计算物理, vol. 41, no. 1, 31 January 2024 (2024-01-31), pages 7 - 8 *

Similar Documents

Publication Publication Date Title
Valiant A bridging model for multi-core computing
Peterka et al. Scalable parallel building blocks for custom data analysis
CN109918199B (en) GPU-based distributed graph processing system
CN106708626A (en) Low power consumption-oriented heterogeneous multi-core shared cache partitioning method
KR20090089327A (en) Method and system for parallization of pipelined computations
CN110659278A (en) Graph data distributed processing system based on CPU-GPU heterogeneous architecture
Burkes et al. Design approaches for real-time transaction processing remote site recovery
Torabzadehkashi et al. Accelerating hpc applications using computational storage devices
CN111639054B (en) Data coupling method, system and medium for ocean mode and data assimilation
Zhang et al. Models of parallel computation: a survey and classification
Bienz et al. Modeling data movement performance on heterogeneous architectures
Armstrong et al. Local interpolation using a distributed parallel supercomputer
CN113806606A (en) Three-dimensional scene-based electric power big data rapid visual analysis method and system
CN110222410A (en) A kind of electromagnetic environment emulation method based on Hadoop MapReduce
CN118012818A (en) Inter-process communication optimization method based on new generation Shenwei many-core processor
Alizadeh et al. Efficient process arrival pattern aware collective communication for deep learning
Bahulkar et al. Performance evaluation of PDES on multi-core clusters
Zhang et al. To co-run, or not to co-run: A performance study on integrated architectures
Pan et al. CongraPlus: towards efficient processing of concurrent graph queries on NUMA machines
Mamidala et al. Optimizing mpi collectives using efficient intra-node communication techniques over the blue gene/p supercomputer
Yang An efficient dispatcher for large scale graphprocessing on opencl-based fpgas
Yin et al. Heterogeneous Big Data Parallel Computing Optimization Model using MPI/OpenMP Hybrid and Sensor Networks
Wu et al. A model-based software solution for simultaneous multiple kernels on GPUs
US9658823B2 (en) Source-to-source compiler and run-time library to transparently accelerate stack or queue-based irregular applications on many-core architectures
Liu et al. Parallel processing architecture of remotely sensed image processing system based on cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination