CN117354370A

CN117354370A - Method, system and equipment for synchronous aggregation in universal network for distributed application program

Info

Publication number: CN117354370A
Application number: CN202311326990.3A
Authority: CN
Inventors: 任棒棒; 夏俊旭; 郭得科; 罗来龙; 程葛瑶; 张千桢
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-10-13
Filing date: 2023-10-13
Publication date: 2024-01-05

Abstract

The application relates to a method, a system and equipment for synchronous aggregation in a universal network for distributed application programs. The method comprises the following steps: an aggregate task request of an application is obtained. And determining the aggregator resources and the execution sequence of the target task according to the aggregation task request and a preset scheduling strategy, distributing an isolation region for each aggregator resource through the controller, setting an offset rule corresponding to the isolation region, and writing the isolation region and the offset rule into an aggregation table of the controller to obtain the aggregation rule. And the switch receives the data packet sequence sent by the sender according to the execution sequence and the aggregation rule and positions the data packet sequence to the aggregation table to obtain a matched aggregator. And merging the data packet sequences through an aggregator to obtain a result data packet, and sending the result data packet to a receiver. The method can realize automatic allocation and runtime scheduling of the data plane resources and paths, and effectively reduce the resource overhead of the switch.

Description

Method, system and equipment for synchronous aggregation in universal network for distributed application program

Technical Field

The present disclosure relates to the field of intra-network aggregation technologies, and in particular, to a method, a system, and an apparatus for intra-network synchronization aggregation for distributed applications.

Background

Under the push of programmable network devices, a new communication and computing paradigm called intra-network aggregation (INA) has been proposed and applied to a variety of distributed systems, including Distributed Training (DT), high Performance Computing (HPC), distributed block storage, and network monitoring. The INA offloads the aggregate of the data flows onto the switch to reduce traffic and overall job completion time. Existing prototypes have shown performance improvement of INA, for example 66% in DT jobs, 2.7-6.8 times in storage.

While INA has proven its success in a single application, the tight coupling of application and INA functions can lead to problems such as redundant development, inability to update at runtime, potential security risks, and inefficiency in resource utilization. These problems prevent widespread adoption of INAs in development, deployment and operation, and the inability to achieve parallel multiplexing of multiple distributed applications.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, a system, and a device for synchronizing and aggregating in a general-purpose network for distributed applications, which can reduce the cost of switch resources when the distributed applications are multiplexed in parallel.

A universal in-network synchronous aggregation method facing distributed application programs is applied to a distributed application program sharing cluster universal network architecture, and comprises the following steps:

an aggregate task request of an application is obtained.

Determining the execution sequence of the aggregator resources of the target task and the target task according to the aggregation task request and a preset scheduling strategy, distributing an isolation region for each aggregator resource through the controller, setting an offset rule corresponding to the isolation region, and writing the offset rule corresponding to the isolation region and the isolation region into an aggregation table of the controller to obtain the aggregation rule.

And the switch receives the data packet sequence sent by the sender of the target task according to the execution sequence and the aggregation rule, and positions the data packet sequence to the aggregation table to obtain an aggregator matched with the data packet sequence.

And merging the data packet sequences through an aggregator to obtain a result data packet, sending the result data packet to a receiver of the target task, replying an ACK message sequence according to the result data packet by the receiver, and synchronously returning the ACK message sequence to a sender through a switch multicast.

In one embodiment, the method further comprises: in the distributed application program sharing cluster general network architecture, a communication path from a sender to a receiver of a target task is generated according to a routing protocol, and a server, a controller and a switch corresponding to the application program form an aggregation hierarchical structure according to the communication path. The plurality of applications send aggregate task requests to the home agent, which sends the aggregate task requests in parallel to the controller.

In one embodiment, the method further comprises: and determining the execution sequence of the aggregator resources of the target tasks and the target tasks according to the aggregate task requests and a scheduling strategy preset by the controller, and setting an execution task switch for the target tasks according to the execution sequence by the controller. On an executing task exchanger, a controller allocates an isolation area for each aggregator resource, sets an offset rule of the isolation area, and writes the offset rule corresponding to the isolation area and the isolation area into an aggregation table of the controller to obtain an aggregation rule of a target task.

In one embodiment, the method further comprises: the sender of the target task divides the target task data block into a sequence of data packets and sends the sequence of data packets to the switch in a maintained window.

In one embodiment, the method further comprises: the exchanger receives the data packet sequence sent by the sender of the target task according to the execution sequence, and address and position the data packet sequence on the aggregation table according to the sequence number and the offset rule of the data packet sequence:

Aggregator.index←packet.seq_num+Offset

wherein, aggregate. Index is the index of the isolation region, packet. Seq_num is the sequence number of the data packet sequence, and Offset is the Offset rule. And acquiring an aggregator in the isolation area corresponding to the sequence number of the data packet sequence.

In one embodiment, the method further comprises: and merging the data packet sequences with the same message sequence number through the aggregator to obtain a result data packet, sending the result data packet to a receiver of the target task, replying a message ACK message sequence by the receiver according to the result data packet, clearing the aggregator corresponding to each target task according to a switch unit formed by the number of the target tasks and the aggregation hierarchical structure when the ACK message sequence reaches the switch, and returning the ACK message sequence to a sender corresponding to the target task.

In one embodiment, aggregating task requests includes: target task ID, sender ID of target task, receiver ID of target task, aggregation function, and aggregation type. The polymerization types include: reduce and Allreduce.

In one embodiment, the method further comprises: if the aggregation type is Reduce, the receiver reorganizes the ACK message sequence into a feedback message, and the feedback message is transmitted to the local agent of the sender as an aggregation result by the controller. If the aggregation type is Allreduce, the sender reorganizes the payload of the ACK message sequence into a feedback message, and the feedback message is transmitted to the sender's home agent by the controller as an aggregation result. The local agent returns the aggregation result to the application program for starting the target task through IPC.

A distributed application-oriented universal in-network synchronous aggregation system, the system comprising:

and the aggregation task request acquisition module is used for acquiring the aggregation task request of the application program.

The aggregation rule acquisition module is used for determining the execution sequence of the aggregator resources of the target task and the target task according to the aggregation task request and a preset scheduling strategy, distributing an isolation area for each aggregator resource through the controller, setting an offset rule corresponding to the isolation area, and writing the offset rule corresponding to the isolation area and the isolation area into an aggregation table of the controller to obtain the aggregation rule.

And the aggregator matching module is used for receiving the data packet sequence sent by the sender of the target task according to the execution sequence and the aggregation rule by the switch, positioning the data packet sequence to the aggregation table, and obtaining an aggregator for matching the data packet sequence.

And the aggregation module is used for merging the data packet sequences through the aggregator to obtain a result data packet, sending the result data packet to a receiver of the target task, replying an ACK message sequence according to the result data packet by the receiver, and synchronously returning the result data packet to the sender through the multicast of the exchange unit.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

An aggregate task request of an application is obtained.

The method, the device, the computer equipment and the storage medium for synchronous aggregation in the universal network facing the distributed application program firstly acquire the aggregation task request of the application program, which means that the aggregation task request can be dynamically adjusted and responded according to the requirement of the application program so as to realize efficient data aggregation. The aggregator resources and the execution sequence of the target task are determined by using a preset scheduling strategy, which means that the data aggregation can be optimized according to different strategies so as to meet the requirements of different application scenes, and an isolation area is allocated for each aggregator resource by a controller, and a corresponding offset rule is set, so that the resources can be effectively utilized, and the data packets can be accurately controlled. The switch processes the data packet sequence sent by the sender according to the execution sequence and the aggregation rule, so that the correct routing and aggregation of the data packets are ensured, and the resource expenditure is reduced. And the aggregator combines the data packet serial numbers to obtain a result data packet, and then sends the result data packet to a receiver of the target task. Further efficiently processes and merges data, reducing communication overhead. And the receiver replies an ACK message sequence according to the result data packet, and multicasts the ACK message sequence back to the sender through the switch, so that the reliability and feedback mechanism of communication are realized. Therefore, the technical problems of communication performance and resource expense are solved through dynamic task scheduling, data packet processing and merging and effective resource management and control. Thus, a flexible method is provided to meet the requirements of different application programs, meanwhile, the complexity of development work is reduced, and wide concurrent application programs are supported.

Drawings

FIG. 1 is an application scenario diagram of a universal in-network synchronization aggregation method facing a distributed application program in one embodiment;

FIG. 2 is a flow diagram of a method of universal in-network synchronous aggregation for distributed applications in one embodiment;

FIG. 3 is a GISA interface code, according to one embodiment;

FIG. 4 is a schematic diagram of multi-level generic intra-network synchronization aggregation in one embodiment;

FIG. 5 is a diagram of a GISA message structure, according to one embodiment;

FIG. 6 is a diagram of a switch aggregator layout in one embodiment;

FIG. 7 is an aggregate performance graph using 7 source nodes and 1 target node in one embodiment;

fig. 8 is a graph of throughput at different packet loss rate settings in one embodiment;

FIG. 9 is a performance diagram of one embodiment with different source node and concurrent task number settings, where FIG. 9 (a) is throughput performance with different source node number and FIG. 9 (b) is performance for processing multiple concurrent tasks;

FIG. 10 is a single task overhead diagram in one embodiment, where FIG. 10 (a) is the switch state overhead required to aggregate tasks and FIG. 10 (b) is the traffic overhead for the entire network;

FIG. 11 is a graph of throughput in a different distributed training model in one embodiment, where FIG. 11 (a) is throughput in the VGG16 model, FIG. 11 (b) is throughput in the AlexNet model, and FIG. 11 (c) is throughput in the ResNet50 model;

FIG. 12 is a performance diagram in an erasure code storage system in which FIG. 12 (a) is repair time and FIG. 12 (b) is network traffic overhead, under an embodiment;

FIG. 13 is a performance graph in a network measurement application in one embodiment, where FIG. 13 (a) is the completion time of the CMS for different numbers of monitoring nodes and FIG. 13 (b) is the impact of CMS size on the transmission completion time;

FIG. 14 is a block diagram of a general in-network synchronous aggregation system that faces distributed applications in one embodiment;

fig. 15 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The universal intra-network synchronization aggregation method for the distributed application program can be applied to a distributed application program sharing cluster universal network architecture (namely general In-network Synchronous Aggregation, GISA) shown In figure 1. The GISA network architecture includes: the application plane, the control plane and the data plane, wherein the application plane is composed of a plurality of servers, 3 servers (server 1, server 2 and server 3) are taken as a brief example for illustration in an example in fig. 1, a plurality of application programs are loaded in a domain where each server is located, and a data information interaction mode among the application programs in different servers is distributed. The control plane consists of controllers which can be deployed on any one server in the current network and are responsible for determining the task execution sequence, distributing aggregator resources for tasks and installing routing rules. The data plane deploys a GISA proxy server within the local domain of each server, causing it to exchange messages with applications via inter-process communication (IPC).

In one embodiment, as shown in fig. 2, a method for synchronizing and aggregating in a universal network for a distributed application program is provided, and the method is applied to the universal network architecture of the shared cluster of the distributed application program in fig. 1, and is illustrated by way of example, and includes the following steps:

step 202, an aggregate task request of an application is obtained.

The aggregate request specifies the following information: task ID, sender ID, receiver ID, and other configurations (e.g., aggregation type: reduce or Allreduce).

Specifically, when the intra-generic network aggregation operation is started, the GISA starts the controllers in the cluster and starts the agent as a daemon on each server. Each agent establishes a communication channel with the controller and assigns each agent an ID associated with its host MAC address for routing. When an application needs to perform an aggregator operation, each endpoint submits a request to its home agent.

Further, the application offloads aggregated type (Reduce or allrreduce) operations to the GISA, each with multiple senders and one receiver. In a network, the routing protocol will generate paths from one sender to a receiver, all forming a tree structure in the topology, i.e. an aggregation hierarchy.

Further, the agent passes the request to the controller, which may receive a plurality of aggregate task requests.

Step 204, determining the execution sequence of the aggregator resources of the target task and the target task according to the aggregate task request and a preset scheduling policy, allocating an isolation region for each aggregator resource by the controller, setting an offset rule corresponding to the isolation region, and writing the offset rule corresponding to the isolation region and the isolation region into an aggregation table of the controller to obtain the aggregation rule.

Specifically, the controller selects "all endpoints are ready" target tasks according to a scheduling policy (e.g., first come first served) that requires a decision of the order of execution of the tasks and the aggregator resources allocated to each target task.

Further, the task executing switch on the task configuring hierarchical structure is controlled first, on each task executing switch, the controller keeps an isolated area in the Aggreator Table, and installs the rule with the area offset to direct the traffic of the task to the area. The controller then informs all endpoints to begin the aggregation operation, e.g., the sender transmits data packets to the network and the receiver obtains the results.

Further, the GISA selects any server as a central controller, when there are multiple target tasks, they can coordinate on resource allocation, and when all its endpoints are ready, the sender of the target task can start transmitting data packets to the aggregator; otherwise, the aggregation will never be completed due to the lack of some senders.

And step 206, the switch receives the data packet sequence sent by the sender of the target task according to the execution sequence and the aggregation rule, and positions the data packet sequence to the aggregation table to obtain an aggregator matched with the data packet sequence.

Specifically, in each target task, each sender divides its data block into a sequence of data packets, and all senders initialize using the same sequence number. The sender maintains a window and always sends packets in the window.

Further, the switch aggregates the sequence of packets according to an order of execution (e.g., first come first served), and when a packet arrives at the switch, it is located to an aggregator in the isolation zone corresponding to the target task. The addressing method is to add the sequence number of the data packet to the offset, namely:

Aggregator.index←packet.seq_num+Offset

wherein, aggregate. Index is the index of the isolation region, packet. Seq_num is the sequence number of the data packet sequence, and Offset is the Offset rule.

Further, the sequence number is circular within the same size as the isolation region, so that packets do not make out-of-range access. The aggregator initializes to EMPTY and accumulates each packet, and it also maintains a PortBItMap to record the participation of its child nodes. After accumulating each packet, if the bitmap is not full, discarding the packet because the aggregation is not complete; if the bitmap is full, the aggregator value is copied back to the packet, which passes the result along its route to the downstream device.

Further, the GISA designs a retransmission mechanism on the host and a deduplication mechanism on the switch, which also maintains the transmission time stamp of the data packet when the sender sends the data packet in its sliding window, and if the ACK of the data packet does not arrive within the timeout threshold, the data packet will be retransmitted. The retransmission is repeated three times, and for the special case of the unsynchronized window, the sender sends the data packet with the special flag to bypass the switch aggregation and directly obtain the result from the receiver.

Further, the first occurrence of each data packet will be recorded in the bitmap (with bit set to 1) and the subsequent occurrences can be identified, and the packets will no longer be counted. The exchanger only carries out repeated data deletion on the converged messages, and does not carry out repeated data deletion on the forwarded messages. That is, after the above steps, if the bitmap is not full, the packet will be discarded; otherwise, the message will carry the content of the aggregator to the downstream device.

And step 208, merging the data packet sequences through an aggregator to obtain a result data packet, sending the result data packet to a receiver of the target task, replying an ACK message sequence according to the result data packet by the receiver, and synchronously returning the ACK message sequence to the sender through a switch multicast.

Specifically, the switch merges all the sender data packets with the same sequence number to obtain a result data packet, and the receiver of the target task receives the aggregated result data packet and replies an ACK message with the same message sequence number to the sender. If the aggregate type operation is Reduce, then the ACK is not loaded; otherwise, the aggregation type operation is AllReduce, and the ACK message is accompanied with a result. When the ACK packet and the data packet reach the switch, it clears its aggregator, and the switch multicasts the ACK packet to the sender along the aggregation hierarchy. After each sender receives the ACK message, the sliding window is moved forward, and new messages are continuously sent in the window.

In the above-mentioned universal in-network synchronous aggregation method for distributed application programs, first, the aggregation task request of the application program is obtained, which means that it can dynamically adjust and respond according to the requirements of the application program, so as to realize efficient data aggregation. The aggregator resources and the execution sequence of the target task are determined by using a preset scheduling strategy, which means that the data aggregation can be optimized according to different strategies so as to meet the requirements of different application scenes, and an isolation area is allocated for each aggregator resource by a controller, and a corresponding offset rule is set, so that the resources can be effectively utilized, and the data packets can be accurately controlled. The switch processes the data packet sequence sent by the sender according to the execution sequence and the aggregation rule, so that the correct routing and aggregation of the data packets are ensured, and the resource expenditure is reduced. And the aggregator combines the data packet serial numbers to obtain a result data packet, and then sends the result data packet to a receiver of the target task. Further efficiently processes and merges data, reducing communication overhead. And the receiver replies an ACK message sequence according to the result data packet, and multicasts the ACK message sequence back to the sender through the switch, so that the reliability and feedback mechanism of communication are realized. Therefore, the technical problems of communication performance and resource expense are solved through dynamic task scheduling, data packet processing and merging and effective resource management and control. Thus, a flexible method is provided to meet the requirements of different application programs, meanwhile, the complexity of development work is reduced, and wide concurrent application programs are supported.

In one embodiment, in the distributed application sharing cluster general network architecture, a communication path from a sender to a receiver of a target task is generated according to a routing protocol, and a server, a controller and a switch corresponding to an application form an aggregation hierarchical structure according to the communication path. The plurality of applications send aggregate task requests to the home agent, which sends the aggregate task requests in parallel to the controller.

It is worth noting that instead of using scalar values as sequence elements, each data stream is described as a "multi-set sequence" that represents an aggregated semantic that can enrich the synchronous aggregation within a generic network. If the user requires calculation of an "average" of a plurality of vectors, each vector element value is converted into a multiple set (value, 1), the switch aggregates the multiple sets by adding the two dimensions separately, and the receiver calculates the average by dividing the first value by the second value.

Using symbol D _i To represent the data flow from the transmitting node i. D (D) _i Further expressed as:

D _i ＝<V _i,1 ,V _i,2 ,V _i,3 ,…,V _i,k >

wherein k is the sequence length, V _i,j And j is more than or equal to 1 and k is more than or equal to one set. Aggregation result D of sequences from n transmitting nodes ^* Also a multiset sequence, which can be expressed as

Wherein j is more than or equal to 1 and less than or equal to k, and the symbol isThe multiple sets representing the criteria add.

Further, the interface of the application program is shown in fig. 3. The application initializes a task and invokes init (), which informs the controller to allocate switch resources and to allocate itself a task ID. At run-time, the application calls request () to submit data to the GISA agent, where the data contains multiset and its format. The request () also specifies an operation on multiset, roles of endpoints (sender/receiver), and aggregation mode (Reduce or AllReduce). If the mode is Reduce, the sending end returns success/failure, and the receiving end returns a result; if the mode is AllReduce, the sending end returns a result, and the receiving end returns success/failure.

In one embodiment, the execution sequence of the target task and the aggregator resource of the target task is determined according to the aggregate task request and a scheduling policy preset by the controller, and the controller sets an execution task switch for the target task according to the execution sequence. On an executing task exchanger, a controller allocates an isolation area for each aggregator resource, sets an offset rule of the isolation area, and writes the offset rule corresponding to the isolation area and the isolation area into an aggregation table of the controller to obtain an aggregation rule of a target task.

In one embodiment, the sender of the target task divides the target task data block into a sequence of data packets and sends the sequence of data packets to the switch in a maintained window.

In one embodiment, the switch performs addressing and positioning on the aggregation table according to the sequence of the data packets sent by the sender of the execution sequence receiving target task and according to the sequence number and the offset rule of the data packet sequence:

Aggregator.index←packet.seq_num+Offset

In one embodiment, the aggregator merges the data packet sequences with the same message sequence numbers to obtain a result data packet, and sends the result data packet to the receiver of the target task, and the receiver replies a message ACK message sequence according to the result data packet, when the ACK message sequence arrives at the switch, the switch group formed by the number of the target tasks and the aggregation hierarchical structure clears the aggregator corresponding to each target task, and returns the ACK message sequence to the sender corresponding to the target task.

It is worth to say that, the task executing switch executes the aggregation operation according to the received aggregation function and the aggregation type, if the aggregation function is a self-configurable aggregation function, the switch executes the intra-network aggregation; if the aggregation function meets the decomposable aggregation function, the switch executes intra-network aggregation again by decomposing a self-decomposable aggregation function. The specific cases are as follows:

1) The aggregation function f is a self-decomposable aggregation function if f satisfiesFor some merging operators o and all non-empty multisets X and Y, where the symbol +.>Representing a standard multi-set sum.

Operation ofAnd o both satisfy the exchange law and the combination law, calculate a plurality of multiple sets ++>Can be decomposed recursively, arbitrarilyAll with the same end result.

Taking distributed training as an example, the multi-set degradation in this case is scalar values, gradient aggregation involves aggregating gradient data from different working nodes by SUM functions, which are self-resolvable aggregate functions, namely:

the self-resolvable functions also include MIN, MAX, XOR, COUNT, etc., which can be used in various systems. The self-discriminable function may be executed on the switch alone without the assistance of the end host.

2) If the aggregate function f is a decomposable function g and a self-decomposable aggregate function h, f is a decomposable aggregate function, where

The GISA performs h on the switch and g on the receiver. Wherein the AVERAGE vector can be formalized as follows

h(X)＝(x,1)

g(a,b)＝a/b

Wherein operator+ is the sum of standard points of pairs in two different dimensions, i.e. (x ₁ ,y ₁ )+(x ₂ ,y ₂ )＝(x ₁ +x ₂ ,y ₁ +y ₂ )。

Another example of a resolvable aggregation function (but not a self-resolvable aggregation function) is the RANGE function, which is used to give the difference between the maximum and minimum in the statistical set, which can be decomposed into a form similar to the above, as an aggregation function of the present method.

It follows that aggregation operations can be performed on switches on the path according to an aggregation function, and when data of multiple links reaches the same switch, the switch aggregates them and then sends the aggregated result to the next node. By analogy, the incast transmission problem and the congestion of the output port can be effectively avoided.

In one embodiment, if the aggregation type is Reduce, the receiver reassembles the ACK message sequence into a feedback message, and the feedback message is transmitted as an aggregation result from the controller to the home agent of the sender. If the aggregation type is Allreduce, the sender reorganizes the payload of the ACK message sequence into a feedback message, and the feedback message is transmitted to the sender's home agent by the controller as an aggregation result. The local agent returns the aggregation result to the application program for starting the target task through IPC.

It should be noted that the scheduling policy of the method adopts a simple first-come first-served (FCFS) method to process concurrent tasks. When the controller decides whether to execute a task, it will check all switches in the task aggregation hierarchy. If any switch does not have enough and N consecutive aggregators, then the task will be suspended and wait for the available resources. If all switches have N consecutive aggregators and the controller decides to execute it, these aggregators (the area is denoted as Offset) are allocated for this task. The controller installs the switch rules on the switch, directing traffic of the designated tasks to the zone.

Further, in operation, the data packet is matched to an area according to the task ID, and mapped to an aggregator in the area according to (region_offset+seq_num), and if the data packet is the data packet, aggregation processing is performed; otherwise, if the message is an ACK message, the aggregator is cleared. After the task is completed, the task agent notifies the controller. The controller reclaims the assigned aggregator by deleting the corresponding rule from the switch involved. These free aggregators can then be reassigned to other tasks by issuing new rules.

This process does not require recompilation of the data plane, as the switch memory aggregator table is not altered. Only when an administrator intends to extend or shrink the aggregator resources on the switch, e.g. increase the number of aggregators to support more tasks or higher aggregate throughput, or add new aggregate functionality, the switch needs to be recompiled.

In addition, the aggregator region size N also affects both configurations on the host. The sequence number range should be [0, N-1]; if the sequence of packets of the message is longer than N, the packets will cycle through the sequence number range. The sliding window of the sender should be limited to not more than N. Both configurations ensure that no two different data packets are passed and mapped to the same aggregator and erroneously aggregated in one sequence.

It follows that the present method is more versatile in three ways than existing INA solutions. First, it is decoupled from the application and supports multiplexing of multiple applications. Second, the interface (fig. 3) is generic to support more data formats and operations. Third, its deployment does not require assumptions about the network topology, and the convergence function can be performed on any switch in the topology.

In one embodiment, an example of multi-level synchronization aggregation is shown in FIG. 4, where servers H1, H2, H4, and H5 are the source nodes and server H6 is the destination node. Solid arrows indicate the direction of the data packet, and dashed arrows indicate various packet loss scenarios. If a packet from H2 is lost (scenario 1 in the figure), H2 will retransmit the packet to supplement the lost data. Retransmission is repeated three times, which means that special cases of out-of-sync window, such as loss at H2, will cause H1 to not receive ACK, retransmit its packet, the sender sends the packet with special flag to bypass the switch aggregation and get the result directly from the receiver. In addition, if the aggregate packets from S1 to S5 are lost, none of H1, H2 and H4 receives an ACK and retransmits the packets, which will trigger S1 to send the previous aggregate result again to S5.

Because of the complex condition that partial ACK data packets are lost, the method also provides an improvement on window asynchronization, if a sender still does not receive ACK, retry is performed three times, and a special data packet (with a special FRD mark) is sent; the special packet is unicast in that it bypasses the aggregation logic of all switches, reaches the receiver, and triggers the receiver to reply with a unicast ACK (in the case of AllReduce, the ACK bearing result). The unicast ACK synchronizes the lossy sender's window with other windows, thereby solving the problem of the lossy sender getting stuck.

Further, since the packet sequence numbers are cyclic over a range, the old packet (from the lossy sender) and the new packet (from the lossless sender) may be mapped to the same aggregator and be erroneously aggregated. The method limits the window size to be less than half of the aggregation area, so that old and new data packets cannot be overlapped, and halving the window size can lead to halving the throughput and serious resource waste.

Further, a 1-bit indicator is added to the switch aggregator for distinguishing between new and old messages. The sequence of packets is divided into batches, each batch containing the same number of packets as the aggregator interval N, with odd and even batches. The packet also carries its parity (denoted VER,0 for even and 1 for odd) in the header. Each aggregator also adds a VER field to check if the aggregator's value is in the same batch as the data packet. With the VER field, the switch aggregation logic becomes as follows: when the packet arrives, if the aggregator is empty (identified by portbmap), the aggregator accepts the packet; if the aggregator matches the VER field of the data packet, the aggregator processes the data packet; otherwise, the aggregator does not match the VER fields of the packet and the packet will be discarded. When an ACK packet arrives, the aggregator is cleared if its VER matches the aggregator; otherwise, the aggregator is not purged. Thus solving the problem that the aggregator may aggregate erroneous data packets.

Further, the method adds a RST flag in the data packet, and the retransmitted data packet sets the bit, if a RST packet arrives at an empty aggregator, it will be discarded directly; otherwise, the processing logic of the normal message is followed. Therefore, the problem that the memory of the switch is leaked due to the retransmitted data packet is solved.

It is worth noting that the source routing is adopted for the data packets, and no new rules are needed to interfere with the existing routing problem. During initialization, the switch computes each path from a sender to a receiver in the aggregated hierarchy and converts one of the paths into a switch output port on that path. The controller informs each sender agent of the path (in the format of a list of switch output ports).

In operation, each packet encodes a list of output ports of the path in its header. In each hop of the switch, the switch sends the data packet according to the list head and pops up the list head. Note that when a plurality of packets are aggregated into one, their output port lists are not combined erroneously because their hop tables after the current exchange are identical.

Route learning for ACK messages. Reuse of the portbmap for ACK routing in the aggregator, configuration of the portbmap to the same number of bits as the number of switch ports, and further association of each bit in the portbmap with one switch port. Thus, bit 1 in the bitmap may indicate not only that the packet arrived at the child node, but also which switch physical port the packet came from.

At run-time, when a packet sets its bit in portbmap, its incoming switch port is also learned. When its ACK packet arrives, its aggregator's portbmap is acquired; the switch finds all the output ports and copies the ACK packets to them. The switch then initializes all bits of the aggregator portbits to 0 and waits for the next batch of packets or other tasks to reuse it.

It follows that "bitmap full" does not mean that it is 1, but that the number of 1 s in the bitmap is equal to the CN value in the header.

In one embodiment, as shown in fig. 4 and 5, the design uses FPGA devices to represent network switches that implement packet forwarding logic and INA logic for the GISA. In an implementation, the proxy builds on the DPDK with 1200 lines of c++ code, the controller with 800 lines of c++ code, and the switch with 1300 lines of Verilog code. Wherein fig. 4 shows the format of the data packet. The GISA minimizes packet header overhead by replacing TCP/IP headers and compressing the GISA header. The GISA header contains fields of TaskID, SN (sequence number), ACK, VER, ECN, RST, FRD, etc., which are already described in the previous section. ERR is used to record errors that occur during an aggregation operation, such as additive overflows. FIN is used to denote the last packet in the transmission sequence. The PLD specifies whether the ACK packet should carry the result (for AllReduce). HOP and tuple of OP and CN are used for source routing, where HOP represents the number of remaining HOPs, OP represents the output switch port, CN represents the fan-in of the current switch, i.e. the number of child nodes. Packet Payload may format encode multiple sets according to the application, and GISA installs a switching rule to specify its parsing method and aggregation operation performed on it. Fig. 5 shows the data structure of the aggregator, which contains PortBitmap, VER, ECN, ERR and Payload specified in the design.

In one embodiment, the GISA is implemented on an FPGA-based test platform to evaluate its performance and advantages in various distributed applications. The test platform included 5 Intel aria 10FPGA devices and 9 workstations. Connecting 5 FPGA devices into a two-layer network: each device has 4 10GbE ports, one device acts as a backbone switch, and four devices act as leaf switches, with each of the backbone switches being connected by a physical link. All FPGA devices were mounted on a workstation of Intel Xeon Platinum8124M CPU, 128GB RAM,500GB SSD as GISA controllers. We then connect each of the remaining 8 workstations to the leaf switches via physical links, where each leaf switch connects two workstations. These workstations run the GISA agent, equipped with Intel Kuri 9-13900K CPU,64GB RAM,500GB SSD,NVIDIA GeForce RTX 2080Ti GPU and Intel 82599 10GbE network cards, all running Ubuntu 20.04.6 and kernels 5.15.0-76.

1) Throughput and delay: as shown in fig. 7, 512 aggregators were precompiled for the GISA on each FPGA device, with a maximum packet payload capacity of 1024 bytes per aggregator. Fig. 6 illustrates that increasing the number of aggregators can achieve higher GISA throughput. This is because more aggregators allow the source node to inject more packets into the network, thereby reducing network idle time. However, as throughput approaches the performance limit of hardware, allocating more aggregators brings only marginal utility. For example, for a 1024 byte load, throughput stops its linear growth, peaking at 10.02Gbps when more than 192 aggregators are added. The reason is that the bottleneck limiting throughput translates from packets in transit to hardware processing power. Thus, injecting more packets into the network does not further improve throughput, but rather results in more packet loss.

2) Reliability: the impact of packet loss on throughput is shown in fig. 8, where there are 7 source nodes and 1 destination node. The message is randomly discarded at the input port of each node according to the designated probability. Since packet loss forces the GISA to reduce its transmit window size, an increase in packet loss rate will slightly decrease its throughput. The throughput of the GISA gradually drops from 9.8Gbps to about 7.0Gbps, and the unicast packet loss rate is always below 1.8Gbps as the packet loss rate increases from 0% to 1%. Furthermore, the GISA may prevent aggregate task interruption by triggering timeout retransmission logic. Thus, it can maintain significant advantages over unicast methods even in unreliable network environments.

3) Multi-source and multi-tasking: one of the main advantages of the GISA is that it is capable of transmitting data at a line rate regardless of the number of source nodes. Fig. 9 (a) shows throughput performance at different source node numbers. Specifically, the GISA transmits data using the INA, ensuring that the same amount of data is transmitted on all links on the transmission path. Thus, the throughput of the GISA is not affected by the traffic intersection, while still maintaining its maximum link rate. In contrast, the throughput of unicast drops rapidly as the number of source nodes increases. This demonstrates that GISA can provide substantial benefits over traditional transmission methods, particularly when handling large-scale communication groups.

Fig. 9 (b) shows the performance of the GISA to handle multiple concurrent tasks, where the number of tasks varies from 1 to 4, and the aggregator is evenly distributed among the tasks. Each transfer involves 100MB of data from 7 source nodes to one destination node. Compared with a unicast communication mode, the GISA greatly shortens the task completion time. Notably, the GISA also supports aggregator allocation based on intelligent scheduling policies, which is critical to applications with various QoS requirements (e.g., expiration dates). Our future work includes exploring the scheduling policy of the GISA.

4) Cost, overhead: furthermore, a fat tree topology of 1024 hosts is simulated, which helps to evaluate the routing overhead of the GISA in a large-scale network. Fig. 10 (a) illustrates the switch state overhead required for the aggregation task. The GISA-native method relies on forwarding rules to determine packet routing, and as the number of source nodes increases, the switch state increases significantly. To construct the transmission path, it issues a number of routing rules to specify the next hop for each packet. In contrast, the GISA optimization method may use the portbmap field in the aggregator to direct the output port of the ACK packet and use the source route for packet forwarding. Thus, it can significantly reduce switch state consumption. Specifically, when there are 640 source nodes, the number of entries on the switch after GISA optimization is reduced by 6.4 times compared to when GISA is not optimized.

Another significant advantage of the GISA is its ability to reduce traffic. The traffic overhead of the entire network is assessed by varying the number of source nodes, each transmitting 100MB of data. The results are shown in FIG. 10 (b). Compared with unicast, the GISA utilizes the switch to aggregate the related traffic on the transmission path, the traffic is reduced by up to 69.3% at 5 source nodes, and the traffic is reduced by up to 3.78 times at 640 source nodes. This highlights the potential of the GISA to significantly mitigate network traffic, which may benefit applications involving a large number of communication nodes.

5) Distributed training: three typical training models were chosen to evaluate the performance of the GISA in DT applications, verifying its versatility. As a result, as shown in fig. 11, in the VGG16 and AlexNet models (fig. 11 (a) and 11 (b)), the unicast communication mode shows significant performance degradation with an increase in the number of operating nodes. By aggregating gradient data at the leaf switches, ATP may achieve higher throughput than unicast, but it does not utilize the backbone switches to further aggregate gradient data. This results in reduced training throughput of ATP due to blockage at the spinal switches. In contrast, the GISA can efficiently aggregate data with each layer switch on the transmission path. Thus, as the number of working nodes increases, the training throughput of the GISA is hardly affected. Specifically, in the alexent model (fig. 11 (b)), the throughput of GISA was increased by 27.3% compared to ATP, and 83.6% compared to unicast when 7 working nodes were involved.

In ResNet50 (FIG. 11 (c)), the performance gap between these methods is less pronounced because the model is computationally intensive. Thus, it is difficult to obtain significant training performance improvement through optimization of network communication like the GISA.

6) Distributed storage: to repair a failed block in an erasure coded storage system, the traditional approach is to retrieve multiple related blocks over the network and recover the failed block at a repair node. However, this approach tends to result in link-in congestion of the repair node, resulting in high delays in degraded readings. The repair time increases with the increase of the coding parameter k of RS (k, m). To address this problem, the most advanced Repair Pipeline (RP) method converts discrete blocks, layering operations into concurrent sub-operations on sub-blocks, thereby effectively avoiding congestion due to as-cast transmissions. The GISA can achieve performance comparable to or even higher than RP, as shown in fig. 12 (a), mainly because the switch can process packets at the line rate and thus the throughput is higher than the aggregation performed on the host.

While RP can alleviate congestion and reduce repair time, one significant disadvantage of RP is that it is still subject to high traffic. We further evaluate this overhead in a simulated fat-tree network with 1024 hosts and evaluate the coding parameters using more blocks, such as RS (9, 3) and RS (12, 4). As a result, as shown in fig. 12 (b), as the value of the coding parameter k increases, the number of blocks required to repair one failed block increases, resulting in a large increase in the flow rate. However, GISA can effectively alleviate this problem by aggregating the encoded blocks in the network, when k=12, RP repair a failed block requires 8.01GB of network traffic, while GISA requires only 4.39GB, with a traffic reduction as high as 45.19%.

7) Network monitoring: sketch-based network monitoring is a prominent research field in recent years, and the structure is also suitable for synchronous aggregation. By using MAX operations, we can aggregate CMSs obtained from various monitoring nodes and then transmit the aggregated data to the collector. However, in a scenario where the cluster size is quite large, a large surge in traffic can result in a longer transmission time required to collect these results, as in the unicast approach shown in fig. 13 (a). High traffic can also severely impact other network operations and services.

The GISA may effectively reduce traffic while reducing acquisition delay of the CMS. Fig. 13 (a) shows the completion time of collecting CMSs from different numbers of monitoring nodes. Notably, as the number of monitoring nodes increases, the transmission completion time of the unicast communication mode increases significantly. In contrast, GISA was hardly affected, and its completion time remained relatively stable. This suggests that the GISA can serve monitoring tasks well in large-scale distributed clusters.

Fig. 13 (b) further illustrates the effect of CMS size on the transmission completion time when there are 7 monitoring nodes. When the size of the CMS is increased from 2MB to 16MB, the completion time of the GISA is only slightly increased, while the completion time gap of the GISA and Unicast is increased by approximately 9.07 times. These two figures demonstrate the advantage of the GISA in handling more nodes and larger CMS scale monitoring tasks.

In summary, even in large networks, the throughput of the GISA maintains its maximum line rate (about 10 Gbps), and the traffic can be significantly reduced by about 3.78 times compared to Unicast. In addition, the GISA is suitable for various application scenarios, and compared with the most advanced method, performance acceleration can be realized under acceptable system overhead.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in fig. 14, there is provided a universal in-network synchronous aggregation system for distributed applications, including: an aggregate task request acquisition module 1402, an aggregate rule acquisition module 1404, an aggregator matching module 1406, and an aggregation module 1408, wherein:

An aggregate task request acquisition module 1402 is configured to acquire an aggregate task request of an application.

The aggregation rule obtaining module 1404 is configured to determine an execution sequence of the aggregator resources of the target task and the target task according to the aggregation task request and a preset scheduling policy, allocate an isolation area for each aggregator resource by the controller, set an offset rule corresponding to the isolation area, and write the offset rule corresponding to the isolation area and the isolation area into an aggregation table of the controller to obtain the aggregation rule.

And the aggregator matching module 1406 is configured to receive the data packet sequence sent by the sender of the target task according to the execution order and the aggregation rule, and locate the data packet sequence to the aggregation table to obtain an aggregator with matched data packet sequence.

The aggregation module 1408 is configured to combine the data packet sequences by the aggregator to obtain a result data packet, send the result data packet to a receiver of the target task, and the receiver replies an ACK message sequence according to the result data packet and synchronously returns the ACK message sequence to the sender by the switch multicast.

For specific limitations regarding the general intra-network synchronization aggregation system for distributed applications, reference may be made to the above limitation regarding the general intra-network synchronization aggregation method for distributed applications, which is not described herein. The modules in the above-mentioned universal in-network synchronous aggregation system facing the distributed application program can be implemented in whole or in part by software, hardware and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 15. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for universal in-network synchronous aggregation for distributed applications. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 14-15 are block diagrams of only some of the structures associated with the present application and are not intended to limit the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory storing a computer program and a processor that when executing the computer program performs the steps of:

an aggregate task request of an application is obtained.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. The universal intra-network synchronous aggregation method for the distributed application program is characterized by being applied to a distributed application program sharing cluster universal network architecture; the method comprises the following steps:

acquiring an aggregation task request of an application program;

determining the execution sequence of the aggregator resources of the target task and the target task according to the aggregation task request and a preset scheduling strategy, distributing an isolation area for each aggregator resource through a controller, setting an offset rule corresponding to the isolation area, and writing the offset rule corresponding to the isolation area and the isolation area into an aggregation table of the controller to obtain an aggregation rule;

The exchanger receives the data packet sequence sent by the sender of the target task according to the execution sequence and the aggregation rule, and positions the data packet sequence to the aggregation table to obtain an aggregator matched with the data packet sequence;

and merging the data packet sequences through the aggregator to obtain a result data packet, sending the result data packet to a receiver of the target task, and enabling the receiver to reply an ACK message sequence according to the result data packet and synchronously and back to the sender through a switch multicast.

2. The method of claim 1, further comprising, prior to the step of obtaining the aggregate task request of the application:

in a distributed application sharing cluster general network architecture, a communication path from a sender to a receiver of a target task is generated according to a routing protocol, and a server, the controller and the switch corresponding to the application form an aggregation hierarchical structure according to the communication path;

and the plurality of application programs send the aggregate task requests to the local agents, and the plurality of local agents send the aggregate task requests to the controller in parallel.

3. The method of claim 2, wherein determining, according to the aggregate task request and a preset scheduling policy, an execution order of an aggregate resource of a target task and the target task, allocating, by a controller, an isolation area for each aggregate resource, setting an offset rule corresponding to the isolation area, and writing the offset rule corresponding to the isolation area and the isolation area into an aggregate table of the controller, to obtain an aggregate rule, including:

determining the execution sequence of the aggregator resources of the target task and the target task according to the aggregate task request and a scheduling strategy preset by a controller, wherein the controller sets an execution task switch for the target task according to the execution sequence; and on the executing task switch, the controller allocates an isolation area for each aggregator resource, sets an offset rule of the isolation area, and writes the offset rule corresponding to the isolation area and the isolation area into an aggregation table of the controller to obtain an aggregation rule of the target task.

4. The method of claim 3, further comprising, before the step of the switch receiving the sequence of packets sent by the sender of the target task according to the execution order and the aggregation rule, and locating the sequence of packets to the aggregation table to obtain the aggregator that matches the sequence of packets:

And the sender of the target task divides the target task data block into a data packet sequence, and sends the data packet sequence to the switch in a maintained window.

5. The method of claim 4, wherein the step of the switch receiving the sequence of data packets sent by the sender of the target task according to the execution order and the aggregation rule and locating the sequence of data packets to the aggregation table to obtain the aggregator with matched sequence of data packets comprises:

the switch receives the data packet sequence sent by the sender of the target task according to the execution sequence, and performs addressing positioning on the aggregation table according to the sequence number of the data packet sequence and the offset rule:

Aggregator.index←packet.seq_num+Offset

wherein, aggregate. Index is the index of the isolation region, packet. Seq_num is the sequence number of the data packet sequence, and offset is the offset rule;

and acquiring an aggregator in the isolation area corresponding to the sequence number of the data packet sequence.

6. The method of claim 5, wherein the merging, by the aggregator, the sequence of packets to obtain a result packet, the result packet is sent to a receiver of the target task, the receiver replies an ACK message sequence according to the result packet, and the ACK message sequence is synchronously returned to the sender by a switch multicast, including:

And merging the data packet sequences with the same message sequence number through the aggregator to obtain a result data packet, sending the result data packet to a receiver of the target task, and returning a message ACK message sequence by the receiver according to the result data packet, wherein when the ACK message sequence reaches the switch, the aggregator corresponding to each target task is cleared according to a switch group formed by the number of the target tasks and the aggregation hierarchical structure, and the ACK message sequence is returned to a sender corresponding to the target task.

7. The method according to any one of claims 1 to 6, wherein the aggregate task request comprises: a target task ID, a sender ID of the target task, a receiver ID aggregation function of the target task, and an aggregation type;

the polymerization type includes: reduce and Allreduce.

8. The method of claim 7, further comprising, after the step of the receiver replying to an ACK message sequence based on the resulting data packet and synchronizing back to the sender by a switch multicast:

if the aggregation type is Reduce, the receiver reorganizes the ACK message sequence into a feedback message, and the feedback message is transmitted to the local agent of the sender by the controller as an aggregation result; if the aggregation type is Allreduce, the sender reorganizes the effective load of the ACK message sequence into a feedback message, and the feedback message is transmitted to the local agent of the sender as an aggregation result by a controller;

And the local agent returns the aggregation result to the application program for starting the target task through IPC.

9. A system for synchronizing and aggregating within a generic network for distributed applications, the system comprising:

the aggregation task request acquisition module is used for acquiring an aggregation task request of an application program;

the aggregation rule acquisition module is used for determining the execution sequence of the aggregator resources of the target task and the target task according to the aggregation task request and a preset scheduling strategy, distributing an isolation area for each aggregator resource through a controller, setting an offset rule corresponding to the isolation area, and writing the offset rule corresponding to the isolation area and the isolation area into an aggregation table of the controller to obtain an aggregation rule;

the aggregator matching module is used for receiving the data packet sequence sent by the sender of the target task according to the execution sequence and the aggregation rule by the switch, and positioning the data packet sequence to the aggregation table to obtain an aggregator matched with the data packet sequence;

and the aggregation module is used for merging the data packet sequences through the aggregator to obtain a result data packet, sending the result data packet to a receiver of the target task, and enabling the receiver to reply an ACK message sequence according to the result data packet and synchronously and back to the sender through a switch multicast.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 8 when the computer program is executed.