FR3091764A1

FR3091764A1 - Modification of the characteristics of fast network interconnection systems: Simulation and Application to predict the performance of parallel applications

Info

Publication number: FR3091764A1
Application number: FR1873784A
Authority: FR
Inventors: Noureddine TAGUELMIMT; Stephan JAURE
Original assignee: Bull SA
Current assignee: Bull SA
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2020-07-17
Also published as: FR3091765A1; FR3091765B1

Abstract

..

Description

Modification of the characteristics of the interconnection systems of fast networks: Simulation and Application to the prediction of the performance of parallel applications

1) Nom du ou des rédacteurs Stephan Jaure, Nourredine Taguelmint1) Name of editor(s) Stephan Jaure, Nourredine Taguelmint

2) Titre proposé Modification des caractéristiques des systèmes d’interconnexion des réseaux rapides : Simulation et Application à la prédiction de la performance d’applications parallèles.2) Proposed title Modification of the characteristics of fast network interconnection systems: Simulation and Application to the prediction of the performance of parallel applications.

3) Domaine technique3) Technical area

Le présent travail a été mis en œuvre dans le domaine des études de Benchmark d’applications parallèles dans le contexte du Calcul Haute Performance « HPC ». Un intérêt particulier est porté sur l’analyse et la prédiction de la performance des applications sur des machines de calcul de caractéristiques différentes.This work has been implemented in the field of Benchmark studies of parallel applications in the context of High Performance Computing "HPC". A particular interest is focused on the analysis and prediction of the performance of applications on computing machines with different characteristics.

4) Problème technique posé4) Technical problem raised

Le Calcul Haute Performance touche de plus en plus de domaines scientifiques tels que l’environnement, la météo et la physique. Dans ces domaines, le besoin de résoudre des problèmes physiques toujours plus complexes, que ce soit au niveau des modèles ou de leur résolution, entraine une forte demande de capacité de calcul.High Performance Computing is increasingly affecting scientific fields such as the environment, weather and physics. In these areas, the need to solve ever more complex physical problems, whether at the level of models or their resolution, leads to a strong demand for computing capacity.

Ces « machines » sont constitués d’un ensemble de serveurs de calcul appelés « nœuds de calcul ». Ces derniers sont connectés via un réseau rapide appelé « interconnect », ou « réseau Infiniband ». A noter qu’Infiniband désigne en réalité une des différentes technologies permettant d’interconnecter les nœuds de calculs (il existe également une technologie d’Intel nommée OmniPath). Nous nommerons ci-après un « cluster de calcul » ou « machine de calcul » l’ensemble constitué de (voir figure 1 ) :

nœud de calcul
réseau d’interconnexion
systèmes de stockage

These "machines" consist of a set of calculation servers called "calculation nodes". These are connected via a fast network called "interconnect", or "Infiniband network". Note that Infiniband actually designates one of the different technologies making it possible to interconnect the computing nodes (there is also an Intel technology called OmniPath). We will hereinafter call a “computing cluster” or “computing machine” the set made up of (see figure 1 ):

compute node
interconnection network
storage systems

Architecture générale d'un "Cluster" de calcul. General architecture of a "Cluster" of calculation.

illustration des communications entre deux processus. illustration of communications between two processes.

Possibilités d'implémentation de la solution. Possibilities of implementing the solution.

L’une des grandes difficultés de la conception d’un « cluster de calcul » est son dimensionnement. En effet, dans le cadre d’un appel d’offre, par exemple, celui-ci (Commentaire [JS1] : quelque chose est en trop ici) est généralement accompagné d’un ensemble d’applications scientifiques, donnant suite à une compagne d’analyse, de test et de projection des performances. L’objectif final de cette compagne est de dimensionner le « cluster de calcul ». Ceci se fait en deux étapes :

Maximiser la performance des applications scientifiques (codes de calculs)
Optimiser le prix de la machine proposée.

One of the great difficulties in designing a “computing cluster” is its sizing. Indeed, in the context of a call for tenders, for example, this one (Comment [JS1]: something is too much here) is generally accompanied by a set of scientific applications, following up on a companion analysis, testing and performance projection. The final objective of this campaign is to size the “computing cluster”. This is done in two steps:

Maximize the performance of scientific applications (calculation codes)
Optimize the price of the proposed machine.

L’un des problèmes récurrents des analyses de la performance des applications scientifiques est la non disponibilité de la machine de calcul cible, c’est-à-dire, celle qui sera proposée au porteur de l’appel d’offre. Ceci s’explique essentiellement par deux raisons. D’une part, avec la variété des architectures et des machines de calcul, il est impossible d’avoir à sa disposition l’ensemble des modèles existants. D’autre part, la machine cible peut simplement être en cours de développement et n’est mise sur le marché qu’après des mois voire des années à partir de la date de la proposition de l’offre. Les caractéristiques principales de ces futures machines sont néanmoins communiquées par leurs constructeurs en amont. On a alors recours à la simulation des machines cibles à partir de machines de caractéristiques matérielles différentes, appelées « machines de Benchmark ».One of the recurring problems in analyzing the performance of scientific applications is the unavailability of the target computing machine, that is to say, the one that will be offered to the bearer of the call for tenders. This is mainly due to two reasons. On the one hand, with the variety of architectures and computing machines, it is impossible to have all the existing models available. On the other hand, the target machine may simply be under development and only released to market after months or even years from the date of the offer proposal. The main characteristics of these future machines are nevertheless communicated by their manufacturers upstream. We then have recourse to the simulation of the target machines from machines with different material characteristics, called “Benchmark machines”.

Ces simulations sont menées par les équipes en charge de la performance applicative, appelées équipes de Benchmark, qui utilisent des modèles, plus au moins complexes, reposant essentiellement sur l’analyse des « sensibilités » (Commentaire [JS2] : peut etre faudrait il expliciter ce terme c'est une sorte de dérivée partielle) de la performance des applications parallèles aux différentes caractéristiques d’un « cluster de calcul ». Parmi ces caractéristiques principales analysées, nous pouvons citer :

la fréquence du processeur CPU en GHz,
la bande passante de la mémoire vive ou bande passante mémoire en Go/s,
la bande passante du système d’interconnexion (réseau rapide, Infiniband) en Go/s.

These simulations are carried out by the teams in charge of application performance, called Benchmark teams, who use models, more or less complex, based essentially on the analysis of "sensitivities" (Comment [JS2]: perhaps it should be explained this term is a kind of partial derivative) of the performance of applications parallel to the different characteristics of a "computing cluster". Among these main characteristics analyzed, we can mention:

CPU processor frequency in GHz,
the RAM bandwidth or memory bandwidth in GB/s,
the bandwidth of the interconnect system (fast network, Infiniband) in GB/s.

La fréquence, exprimée généralement en GHz, désigne le nombre d'opérations effectuées en une seconde par le processeur CPU (Central processing unit en anglais). A titre d’exemple, un processeur cadencé à 3 GHz effectue 3 milliards d’opérations par seconde.The frequency, generally expressed in GHz, designates the number of operations carried out in one second by the CPU processor (Central processing unit in English). For example, a 3 GHz processor performs 3 billion operations per second.

La bande passante mémoire, exprimé généralement en Go/s, désigne quant à elle le nombre d’octet pouvant transiter entre le processeur CPU et sa mémoire.Memory bandwidth, generally expressed in GB/s, designates the number of bytes that can pass between the CPU processor and its memory.

La bande passante du système d’interconnexion, exprimée elle aussi en Go/s, renseigne le nombre d’octet que peut transférer un réseau de connexion entre les différends nœuds de calcul.The bandwidth of the interconnection system, also expressed in GB/s, indicates the number of bytes that a connection network can transfer between the different computing nodes.

Les ingénieurs de Benchmark utilisent ces caractéristiques comme «paramètres de sensibilité » des applications étudiées. Ces paramètres sont obtenus, pour une application donnée, en l’exécutant sous différentes conditions matérielles.Benchmark engineers use these characteristics as “sensitivity parameters” of the applications studied. These parameters are obtained, for a given application, by running it under different hardware conditions.

Par exemple, l’analyse de la sensibilité à la fréquence du processeur est simple, il est tout à fait possible de faire varier la valeur de la fréquence du processeur via des paramètres systèmes. Cette modification de la fréquence a l’intérêt de ne pas engendrer d’effets non linéaires importants sur les autres paramètres de sensibilité. Ces aspects non linéaires sont très importants lors des études de performance comme nous le verrons plus bas.For example, the analysis of the sensitivity to the frequency of the processor is simple, it is quite possible to vary the value of the frequency of the processor via system parameters. This modification of the frequency has the advantage of not generating significant nonlinear effects on the other sensitivity parameters. These nonlinear aspects are very important during performance studies as we will see below.

En revanche il n’est pas simple de faire varier la bande passante mémoire ou la bande passante du réseau d’interconnexion. En effet, pour faire varier la bande passante mémoire il faut généralement modifier la configuration matérielle du cluster de calcul. Ceci implique des arrêts fréquents de la machine et un temps d’intervention conséquent (changement des barrettes mémoires, par exemple). De même, la modification de la bande passante du système d’interconnexion requiert au moins de reconfigurer l’ensemble des commutateurs réseaux (switchs en anglais), ceci nécessite de redémarrer une portion importante de la machine. Ainsi, pour des raisons pratiques et de temps, ce type de modifications n’est pas réalisable. De plus, les machines de calculs sont partagées entre plusieurs utilisateurs sur différents projets ce qui rend encore moins possible ce genres d’interventions.On the other hand, it is not easy to vary the memory bandwidth or the bandwidth of the interconnection network. Indeed, to vary the memory bandwidth it is generally necessary to modify the hardware configuration of the computing cluster. This implies frequent stops of the machine and a consequent intervention time (changing the memory modules, for example). Similarly, modifying the bandwidth of the interconnection system requires at least reconfiguring all the network switches, this requires restarting a significant portion of the machine. Thus, for practical and time reasons, this type of modification is not feasible. In addition, the calculation machines are shared between several users on different projects, which makes this kind of intervention even less possible.

De ce fait, il est nécessaire d’avoir recours à la simulation des caractéristiques de la machine de calcul qu’à la modification réelle (au niveau matériel) de celles-ci.As a result, it is necessary to resort to the simulation of the characteristics of the computing machine rather than the real modification (at the hardware level) of these.

5) Les solutions connues5) Known solutions

Pour analyser la sensibilité à la bande passante du système d’interconnexion, les ingénieurs Benchmark utilisent plusieurs méthodes indirectes :To analyze the bandwidth sensitivity of the interconnect system, benchmark engineers use several indirect methods:

- Le profilage et l’analyse de l’application, notamment par l’utilisation de la couche de communication MPI (Message Passing Interface). Cette méthode permet de connaitre le temps passé dans la partie « communication » de l’application. Cependant, ce temps n’est pas uniquement fonction de la bande passante du système d’interconnexion : ce temps prend également en compte les attentes de synchronisation entre les différents nœuds de calculs ainsi que les latences des communications, comme illustré sur la figure 2.- Profiling and analysis of the application, in particular by using the MPI (Message Passing Interface) communication layer. This method allows you to know the time spent in the “communication” part of the application. However, this time is not only a function of the bandwidth of the interconnection system: this time also takes into account the synchronization waits between the different computing nodes as well as the latencies of the communications, as illustrated in figure 2.

Ainsi, il est tout à fait possible de mesurer 30% du temps d’une application dans la partie « MPI » mais que ce temps ne soit en aucun cas exclusivement sensible à la bande passante du système d’interconnexion (en d’autres termes ce temps ne variera pas en doublant la bande passante du réseau d’interconnexion).Thus, it is quite possible to measure 30% of the time of an application in the "MPI" part but that this time is in no way exclusively sensitive to the bandwidth of the interconnection system (in other words this time will not vary by doubling the bandwidth of the interconnection network).

- Le placement des applications lors de leur exécution. Une technique consiste à utiliser le double du nombre de nœuds qu’en temps normal tout en utilisant que la moitié des cœurs de calcul disponibles. On s’assure néanmoins que, pour chaque nœud de calcul, seul les cœurs d’un seul processeur sont utilisés (il existe généralement deux processeurs par nœud). Ceci permet de doubler la bande passante du réseau d’interconnexion disponible par processus. En effet, dans ce cas de figure, nous utilisons 2 fois moins de processus par nœud de calcul et par conséquent par carte réseau. La bande passante mémoire et les capacités de calcul sont quant à elles maintenues constantes. Cependant, cette modification impacte également le parcours des messages échangés. En effet, doubler le nombre de nœuds conduit à :

la diminution de la proportion des communications intra-nœud, c’est-à-dire, celles qui s’opèrent au sein d’un même nœud et s’effectuant ainsi directement dans la mémoire du nœud (raison pour laquelle ces communications sont plus rapides que les communications inter-nœud).
l’augmentation des latences : en effet, il est possible qu’en utilisant plus de nœuds on peut être amené à traverser plus de switchs pour pouvoir communiquer entre les différents processus de l’application. Sachant que plus les messages échangés traversent des switch plus leur latence augmente (traverser un switch augmente la latence des communications).
La diminution de la bande passante. Sur certaines topologies réseau on peut observer une diminution de la bande passante. En effet, la bande passante entre les switchs peut-être faible que la bande passante au sein d’un switch. Ceci est fortement dépendant du nombre de câbles placés entre les switchs, on parle alors du « facteur de pruning ». Il est courant d’avoir des facteurs de prunings qui peuvent diviser par 2 ou 3 la bande passante lorsque l’on communique entre deux nœuds sur des switchs différents.

- The placement of applications during their execution. One technique is to use twice the number of nodes than normal while using only half the available compute cores. It is nevertheless ensured that, for each computing node, only the cores of a single processor are used (there are generally two processors per node). This makes it possible to double the bandwidth of the interconnection network available per process. Indeed, in this case, we use half as many processes per computing node and therefore per network card. Memory bandwidth and computing capacity are kept constant. However, this modification also impacts the route of the messages exchanged. Indeed, doubling the number of nodes leads to:

the decrease in the proportion of intra-node communications, that is to say, those which take place within the same node and thus taking place directly in the memory of the node (the reason why these communications are more faster than inter-node communications).
the increase in latencies: indeed, it is possible that by using more nodes it may be necessary to cross more switches in order to be able to communicate between the different processes of the application. Knowing that the more the messages exchanged cross switches, the more their latency increases (crossing a switch increases the latency of communications).
The decrease in bandwidth. On some network topologies, a decrease in bandwidth can be observed. Indeed, the bandwidth between switches may be lower than the bandwidth within a switch. This is highly dependent on the number of cables placed between the switches, we then speak of the “pruning factor”. It is common to have pruning factors that can divide the bandwidth by 2 or 3 when communicating between two nodes on different switches.

Ces différentes méthodes permettant de simuler la bande passante d’un réseau d’interconnexion engendrent des effets transverses difficilement appréhendables par les modèles de projections usuels.These different methods for simulating the bandwidth of an interconnection network generate transverse effects that are difficult to apprehend by the usual projection models.

6) Présentation de l’invention6) Presentation of the invention

La solution proposée dans le présent document consiste à simuler, toutes choses égales par ailleurs, une bande passante du système d’interconnexion plus faible en augmentant la taille des messages transmis par la couche de communication MPI. En effet, nous souhaitons faire en sorte que la modification de la couche MPI simule une bande passante plus faible tout en maintenant à la fois la latence et les caractéristiques intrinsèques au schéma de communication de l’application (la façon dont l’application est synchronisée). Pour cela nous proposons d’augmenter artificiellement la taille des messages transmis par la couche MPI, cette modification de la taille des messages doit être effectuée de façon totalement transparente pour l’application suivant la méthodologie suivante :The solution proposed in this document consists in simulating, all other things being equal, a lower bandwidth of the interconnection system by increasing the size of the messages transmitted by the MPI communication layer. Indeed, we want to make the modification of the MPI layer simulate lower bandwidth while maintaining both latency and the characteristics intrinsic to the communication scheme of the application (the way the application is synchronized ). For this we propose to artificially increase the size of the messages transmitted by the MPI layer, this modification of the size of the messages must be carried out in a completely transparent way for the application according to the following methodology:

- Lorsque le processus X de l’application parallèle veut transférer 1000 octets au processus Y qui en attend en retour 1000 octets du processus X,- When process X of the parallel application wants to transfer 1000 bytes to process Y which expects 1000 bytes from process X in return,

- la couche MPI modifiée par notre solution transmettra alors 2000 octets (Commentaire [TN3] : Est-ce qu’il ne serait pas préférable de diviser par deux la taille des messages pour simuler une bande passante double. Cela permet de remplacer un scatter à 128 nœuds par un normal ‘modifié’ à 64 nœuds) au processus Y, mais ne fera voir que 1000 octets à l’application, tout se passera comme si la bande passante du réseau d’interconnexion avait été divisée par deux.- the MPI layer modified by our solution will then transmit 2000 bytes (Comment [TN3]: Wouldn't it be better to halve the size of the messages to simulate a double bandwidth. This makes it possible to replace a scatter at 128 nodes by a normal 'modified' to 64 nodes) to process Y, but will only show 1000 bytes to the application, everything will look like the bandwidth of the interconnect network has been halved.

Il est possible d’implémenter cette invention à différents niveaux :

Sous forme d’un « wrapper MPI », c’est une librairie informatique qui encapsule le service MPI en exposant la même interface mais dans laquelle on peut intercepter les appels des primitives MPI depuis l’application et en modifier leur fonctionnement. Cependant l’implémentation à cet endroit risquerait de nécessiter des copies mémoire dans des tampons intermédiaires, ce qui pourrait fausser les résultats. Cependant, il est possible de corriger le résultat en mesurant le temps passé dans ces copies.
Sous forme d’un composant d’une implémentation MPI. MPI est un standard dans lequel il existe plusieurs implémentations (IntelMPI, OpenMPI, MPICH, etc…). Chaque implémentation traduit les primitives du standard MPI en des appels à des librairies d’interfaçage qui vont dialoguer avec les pilotes du système d’interconnexion (sous OpenMPI on parle de BTL ou Byte Transfer Layer, sous intelMPI on peut citer DAPL). Ces librairies d’interfaçage sont, de notre point de vue, un meilleur endroit pour implémenter cette modification, car l’implémentation pourrait se faire sans copies intermédiaires.

It is possible to implement this invention at different levels:

In the form of an “MPI wrapper”, it is a computer library that encapsulates the MPI service by exposing the same interface but in which one can intercept calls to MPI primitives from the application and modify their operation. However, implementing it there would risk requiring memory copies in intermediate buffers, which could skew the results. However, it is possible to correct the result by measuring the time spent in these copies.
As a component of an MPI implementation. MPI is a standard in which there are several implementations (IntelMPI, OpenMPI, MPICH, etc…). Each implementation translates the primitives of the MPI standard into calls to interfacing libraries which will dialogue with the pilots of the interconnection system (under OpenMPI we speak of BTL or Byte Transfer Layer, under intelMPI we can cite DAPL). These interfacing libraries are, from our point of view, a better place to implement this modification, because the implementation could be done without intermediate copies.

Claims

.