CN113792875B - Performance test method, device, equipment and medium of distributed communication library - Google Patents

Performance test method, device, equipment and medium of distributed communication library Download PDF

Info

Publication number
CN113792875B
CN113792875B CN202111056060.1A CN202111056060A CN113792875B CN 113792875 B CN113792875 B CN 113792875B CN 202111056060 A CN202111056060 A CN 202111056060A CN 113792875 B CN113792875 B CN 113792875B
Authority
CN
China
Prior art keywords
deep learning
training
learning model
communication
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111056060.1A
Other languages
Chinese (zh)
Other versions
CN113792875A (en
Inventor
艾珊珊
许涛
胡辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sugon Nanjing Research Institute Co ltd
Original Assignee
Sugon Nanjing Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sugon Nanjing Research Institute Co ltd filed Critical Sugon Nanjing Research Institute Co ltd
Priority to CN202111056060.1A priority Critical patent/CN113792875B/en
Publication of CN113792875A publication Critical patent/CN113792875A/en
Application granted granted Critical
Publication of CN113792875B publication Critical patent/CN113792875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a performance test method, device, equipment and medium of a distributed communication library, comprising the following steps: according to a preset topological structure and cluster scale, a cluster environment for carrying out distributed training on the deep learning model is constructed; completing distributed training tasks corresponding to the deep learning model according to the to-be-tested communication library through all nodes included in the cluster environment; and acquiring training result data of the deep learning model of the cluster environment under the condition of using the communication library to be tested, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data. The technical scheme of the embodiment of the invention can effectively verify the applicability of the distributed communication library in the heterogeneous computing environment, and is convenient for a user to select the distributed communication library matched with the heterogeneous computing environment according to the performance test result.

Description

Performance test method, device, equipment and medium of distributed communication library
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a performance test method, device, equipment and medium of a distributed communication library.
Background
Currently, deep learning techniques are commonly used in a variety of fields, such as image classification, object detection, face recognition, pose estimation, speech recognition, and the like. The amount of parameters contained in the deep learning model relied upon behind these fields is quite large. Taking the deep learning model in the field of natural language processing as an example, some deep learning networks have reached several billions of parameters. Such learning models cannot be quickly trained by combining several accelerator cards with a single machine, and the time and cost of training can be very high. Therefore, a large-scale distributed parallelization learning technology is introduced in the prior art, and distributed training of the model is completed according to a preset communication library by calling a plurality of processors and a plurality of accelerator cards simultaneously.
However, since the heterogeneous computing environment formed by a plurality of processors and a plurality of accelerator cards is not popular enough and the computing resources are limited, the prior art has relatively few cases of using the heterogeneous computing environment to perform distributed training on the model, and a means for measuring whether the distributed communication library is really suitable for a specific heterogeneous computing environment is lacking.
Disclosure of Invention
The embodiment of the invention provides a performance test method, device, equipment and medium of a distributed communication library, which can effectively verify the applicability of the distributed communication library in a heterogeneous computing environment and facilitate a user to select an optimal distributed communication library matched with the heterogeneous computing environment according to a performance test result.
In a first aspect, an embodiment of the present invention provides a performance testing method for a distributed communications library, where the method includes:
According to a preset topological structure and cluster scale, a cluster environment for carrying out distributed training on the deep learning model is constructed;
Completing distributed training tasks corresponding to the deep learning model according to a to-be-tested communication library through all nodes included in the cluster environment; deploying at least one acceleration card in each node;
And acquiring training result data of the deep learning model of the cluster environment under the condition that the communication library to be tested is used, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data.
Optionally, before completing the distributed training task corresponding to the deep learning model according to the to-be-tested communication library through all the nodes included in the cluster environment, the method further includes:
Judging whether the cluster environment supports running of a target program corresponding to the communication library to be tested;
If so, deploying communication modes among nodes in a cluster environment and between each node and a corresponding acceleration card according to the code file of the target program, so that all nodes in the cluster environment can complete distributed training tasks corresponding to the deep learning model according to the communication modes.
Optionally, after determining whether the cluster environment supports running the target program corresponding to the to-be-tested communication library, the method further includes:
if not, acquiring a code file of the target program, and converting the code file by using a preset transcoding tool to obtain a transcoding file;
Compiling the transcoding file, and migrating the transcoding file which is compiled and can run correctly to the cluster environment;
And deploying communication modes among nodes in a cluster environment and between each node and a corresponding acceleration card according to the transcoding file, so that all nodes in the cluster environment can complete distributed training tasks corresponding to the deep learning model according to the communication modes.
Optionally, completing, by all nodes included in the cluster environment, a distributed training task corresponding to the deep learning model according to a to-be-tested communication library, including:
Acquiring a plurality of training tasks corresponding to the deep learning model, and distributing each training task to a corresponding node in the cluster environment according to the topological structure and the cluster scale of the cluster environment;
Acquiring a plurality of training data corresponding to the deep learning model, and distributing the training data to corresponding nodes in the cluster environment according to the topological structure and the cluster scale of the cluster environment;
And training the deep learning model through at least one master node in the cluster environment and at least one slave node corresponding to the master node according to training tasks and training data corresponding to the nodes and a communication library to be tested so as to complete distributed training tasks corresponding to the deep learning model.
Optionally, obtaining training result data of the deep learning model by the cluster environment under the condition of using the to-be-tested communication library includes:
Acquiring the quantity of training data processed in unit time in the training process of the deep learning model by the cluster environment under the condition of using the communication library to be tested; and/or the number of the groups of groups,
And obtaining time consumption of training the deep learning model by the cluster environment under the condition of using the communication library to be tested.
Optionally, the method further comprises:
deploying a plurality of different cluster environments, wherein the topological structure and the cluster scale of each cluster environment are different;
Through all nodes included in each cluster environment, according to different communication libraries to be tested, completing distributed training tasks corresponding to various deep learning models;
and acquiring training result data of each deep learning model of each cluster environment under the condition of using each communication library to be tested, and measuring the difference of communication performance of each communication library to be tested for different cluster environments and different deep learning models according to the training result data.
Optionally, after measuring the difference of the communication performance of each to-be-measured communication library for different cluster environments and different deep learning models according to the training result data, the method further includes:
Constructing a communication library performance lookup table according to the communication performance of each communication library to be tested aiming at different cluster environments and different deep learning models;
Responding to a recommendation request of a user for a communication library, and recommending a matched communication library to the user according to the recommendation request and the communication library performance lookup table;
The recommendation request comprises a target deep learning model to be trained and a target cluster environment for carrying out distributed training on the target deep learning model.
In a second aspect, an embodiment of the present invention further provides a performance testing apparatus for a distributed communications library, where the apparatus includes:
the environment deployment module is used for constructing a cluster environment for carrying out distributed training on the deep learning model according to a preset topological structure and cluster scale;
The training module is used for completing distributed training tasks corresponding to the deep learning model according to the to-be-tested communication library through all nodes included in the cluster environment; deploying at least one acceleration card in each node;
The performance evaluation module is used for acquiring training result data of the deep learning model of the cluster environment under the condition that the communication library to be tested is used, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data.
In a third aspect, an embodiment of the present invention further provides a computer apparatus, including:
One or more processors;
a storage means for storing one or more programs;
The one or more programs, when executed by the one or more processors, cause the one or more processors to perform the performance testing method of the distributed communications library provided by any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the program when executed by a processor implements the performance testing method of the distributed communications library provided by any embodiment of the present invention.
According to the technical scheme, a cluster environment for carrying out distributed training on the deep learning model is constructed according to a preset topological structure and cluster scale, then through all nodes included in the cluster environment, a distributed training task corresponding to the deep learning model is completed according to a communication library to be tested, training result data of the deep learning model of the cluster environment under the condition that the communication library to be tested is used is obtained, finally, the communication performance of the communication library to be tested in the cluster environment and the communication performance of the deep learning model are evaluated according to the training result data, the applicability of the distributed communication library in a heterogeneous computing environment can be effectively verified, regular communication among all nodes in the cluster is realized, performance testing of various distributed communication libraries is supported, the training efficiency of the deep learning model is improved, the applicability of the communication library to be tested in the large-scale heterogeneous computing environment can be accurately judged, the optimal communication library is automatically recommended to users under the condition of different cluster environments and differences among communication performances among different deep learning models.
Drawings
FIG. 1 is a flow chart of a method of performance testing of a distributed communications library in an embodiment of the present invention;
FIG. 2 is a flow chart of a method of testing performance of a distributed communications library in an embodiment of the present invention;
FIG. 3 is a block diagram of a performance testing apparatus of a distributed communication library in an embodiment of the present invention;
Fig. 4 is a schematic structural diagram of a computer device in an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Fig. 1 is a flowchart of a performance testing method of a distributed communication library according to an embodiment of the present invention, where the present embodiment is applicable to a case of testing communication performance of the distributed communication library, and the method may be performed by a performance testing apparatus of the distributed communication library, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device having a data processing function, for example, a terminal device or a server, and specifically includes the following steps:
Step 110, constructing a cluster environment for carrying out distributed training on the deep learning model according to a preset topological structure and cluster scale.
In this embodiment, the topology may be a communication structure, such as a star, ring, bus, and tree structure, where each node included in a cluster is connected to each other, and the cluster scale is the maximum number of nodes that the cluster supports management. The larger the cluster scale is, the more nodes can be managed in the corresponding cluster.
In this step, according to a preset topology structure and cluster scale, environmental resources matched with the topology structure and cluster scale can be configured. In particular, the environmental resources may include hardware resources and software resources. The hardware resources may include a plurality of nodes matched with the topology structure and the cluster scale, and memory banks, accelerator cards, processors and the like included in each node. The software resources may include operating systems deployed in the nodes, compilers, and the like.
After the environment resources matched with the topological structure and the cluster scale are configured, whether the working states of all the nodes are normal or not can be checked, and if the working states of all the nodes are determined to be normal, the construction of the cluster environment can be considered to be completed. The working state of the node may include an operation state of an accelerator card deployed in the node, an operation state of a processor, a corresponding network environment, and the like.
In a specific embodiment, the running state of the graphics processor (Graphics Processing Unit, GPU) in each node may be queried through rocm-smi commands, the architecture information of the central processor (Central Processing Unit, CPU) in each node may be queried through lscpu commands, the running state of the operating system in each node may be queried through top, and the statistics information of the disk and CPU in each node may be queried through iostat commands.
Step 120, completing a distributed training task corresponding to the deep learning model according to a to-be-tested communication library through all nodes included in the cluster environment; at least one accelerator card is deployed in each node.
In this embodiment, the to-be-tested communication library is used for communication between nodes in the clustered environment and between each node and the corresponding accelerator card. The communication libraries to be tested include, but are not limited to, open source communication library Horovod, and communication libraries built in deep learning framework PyTorch, such as multi-card communication libraries (Nvidia Collective multi-GPU Communication Library, NCCL) and Gloo communication libraries. The deep learning model comprises, but is not limited to, an image classification model, an object detection model, an attitude estimation model and the like.
In this step, training tasks corresponding to the deep learning model may be distributed to each node, communication may be performed between each node and the corresponding accelerator card according to the to-be-tested communication library, and each node and the corresponding accelerator card may implement distributed training of the deep learning model by executing respective training tasks in the clustered environment.
And 130, acquiring training result data of the deep learning model of the cluster environment under the condition of using the communication library to be tested, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data.
In this embodiment, optionally, the training result data may be time-consuming training of the deep learning model by the cluster environment using the to-be-tested communication library. If the training time is long, the training efficiency of the cluster environment on the deep learning model can be lower, namely the communication performance of the communication library to be tested for the cluster environment and the deep learning model is also poor.
In a specific embodiment, a difference between a standard time consumption and the training time consumption may be calculated, and a communication performance of the to-be-tested communication library for the cluster environment and the deep learning model may be determined according to the difference. The standard time consumption may be training time consumption of the deep learning model by a single device in which a preset number of accelerator cards are deployed. If the difference value is greater than a preset threshold value, the communication performance of the communication library to be tested for the cluster environment and the deep learning model can be considered to be good.
In one implementation of the embodiment, optionally, the cluster environment may be built based on a non-mainstream acceleration computing environment (Radeon Open Compute platform, ROCm), so that testing of communication performance of the communication library in a non-mainstream heterogeneous computing environment may be implemented.
According to the technical scheme, a cluster environment for carrying out distributed training on the deep learning model is constructed according to a preset topological structure and cluster scale, then through all nodes included in the cluster environment, distributed training tasks corresponding to the deep learning model are completed according to a to-be-tested communication library, training result data of the deep learning model of the cluster environment under the condition that the to-be-tested communication library is used is obtained, finally, the suitability of the distributed communication library in a heterogeneous computing environment can be effectively verified according to the training result data by evaluating the communication performance of the to-be-tested communication library aiming at the cluster environment and the deep learning model, so that a user can conveniently select an optimal distributed communication library matched with the heterogeneous computing environment according to performance test results, and further training efficiency of the deep learning model can be improved.
Fig. 2 is a flowchart of a performance testing method of a distributed communications library according to an embodiment of the present invention, where, as shown in fig. 2, the method provided by the embodiment of the present invention may further include:
step 210, constructing a cluster environment for carrying out distributed training on the deep learning model according to a preset topological structure and cluster scale.
Step 220, determining whether the cluster environment supports running of the target program corresponding to the to-be-tested communication library, if yes, executing step 230, and if no, executing steps 240-260.
In this step, optionally, the code file of the target program may be compiled according to a compiler deployed in each node of the cluster environment, and if the code file is compiled and can operate correctly, the cluster environment may be considered to support to operate the target program corresponding to the to-be-tested communication library.
And step 230, deploying communication modes among nodes in a cluster environment and between each node and a corresponding acceleration card according to the code file of the target program, so that all nodes in the cluster environment can complete distributed training tasks corresponding to the deep learning model according to the communication modes.
Step 240, obtaining a code file of the target program, and converting the code file by using a preset transcoding tool to obtain a transcoded file.
In this embodiment, if the clustered environment does not support running of the target program corresponding to the to-be-tested communication library, the code file of the to-be-tested communication library needs to be converted into a code file adapted to the clustered environment.
In a specific embodiment, if the cluster environment is built based on the ROCm platform and the target program is a CUDA program, a code file of the target program may be obtained from the CUDA platform, and then a preset transcoding tool is used to convert the CUDA code in the code file into a HIP code suitable for the ROCm platform, so as to obtain a transcoded file. In particular, the transcoding tool may be hipify-perl in a HIP tool.
And 250, compiling the transcoding file, and migrating the transcoding file which is compiled and can run correctly to the cluster environment.
In this step, the transcoding file may be compiled according to a compiling rule corresponding to the transcoding file, and the transcoding file which is compiled and can be correctly operated may be migrated to the clustered environment. The advantage of this arrangement is that the problem of greater difficulty in error correction caused by the fact that the transcoded file is in error once it is run in the cluster environment can be avoided.
And step 260, deploying communication modes among nodes in the cluster environment and between each node and a corresponding acceleration card according to the transcoding file, so that all nodes in the cluster environment can complete distributed training tasks corresponding to the deep learning model according to the communication modes.
Step 270, completing a distributed training task corresponding to the deep learning model according to a to-be-tested communication library through all nodes included in the cluster environment; at least one accelerator card is deployed in each node.
In this step, training tasks corresponding to the deep learning model may be distributed to each node, communication may be performed between each node and the corresponding accelerator card according to the communication manner determined by the to-be-detected communication library, and each node and the corresponding accelerator card may implement distributed training of the deep learning model by executing respective training tasks in the cluster environment.
In one implementation manner of the embodiment of the present invention, through all the nodes included in the cluster environment, according to a to-be-tested communication library, completing a distributed training task corresponding to the deep learning model, including:
Step 271, acquiring a plurality of training tasks corresponding to the deep learning model, and distributing each training task to a corresponding node in the cluster environment according to the topology structure and the cluster scale of the cluster environment;
In this step, optionally, a plurality of training tasks corresponding to the deep learning model may be acquired according to a distributed learning framework (typically, for example, pyTorch framework) corresponding to the deep learning model, and then each training task is allocated to a master node or a slave node corresponding to the master node in the cluster environment according to a dependency relationship between each training task. If execution of one task is required to wait for completion of execution of another task, it may be determined that there is a dependency relationship between the two tasks, that is, the training task corresponding to the slave node needs to rely on the training task corresponding to the master node.
Step 272, obtaining a plurality of training data corresponding to the deep learning model, and distributing each training data to a corresponding node in the cluster environment according to the topology structure and the cluster scale of the cluster environment;
In this step, a plurality of training data corresponding to the deep learning model may be distributed to the master node in the cluster environment, and the plurality of training data may be distributed to the corresponding slave nodes by the master node.
In a specific embodiment, assuming that the deep learning model is an image classification model, training data corresponding to the deep learning model may be image sample data; if the deep learning model is a speech recognition model, training data corresponding to the deep learning model may be speech sample data.
And 273, training the deep learning model through at least one master node in the cluster environment and at least one slave node corresponding to the master node according to training tasks and training data corresponding to the nodes and a to-be-tested communication library to complete distributed training tasks corresponding to the deep learning model.
In this step, after the slave nodes execute the respective assigned training tasks, the trained model gradients may be returned to the master node, after the master node receives the model gradients fed back by the slave nodes, the master node may perform summary processing on the model gradients, calculate and update the model gradients, and then feed back the updated model gradients to the corresponding slave nodes, so that the slave nodes continue to execute the respective assigned training tasks, and perform iterative training on the deep learning model until the convergence of the deep learning model meets the preset requirement.
In a specific embodiment, the model parameters may be updated by a training optimizer (typically, e.g., adam optimizer) during training of the deep learning model by the clustered environment.
In this embodiment, each node in the cluster environment and each accelerator card deployed in each node carry unique identification information, so that each node in the cluster and each accelerator card can orderly communicate and synchronize information according to the identification information.
And 280, acquiring training result data of the deep learning model of the cluster environment under the condition of using the communication library to be tested, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data.
In one implementation manner of the embodiment of the present invention, obtaining training result data of the deep learning model by the clustered environment using the to-be-tested communication library includes: acquiring the quantity of training data processed in unit time in the training process of the deep learning model by the cluster environment under the condition of using the communication library to be tested; and/or obtaining time consumption of training the deep learning model by the cluster environment under the condition of using the to-be-tested communication library.
In a specific embodiment, assuming that the deep learning model is an image classification model, the number of image sample data processed by the clustered environment in a unit time may be acquired; if the deep learning model is a speech recognition model, the number of speech sample data processed by the clustered environment in a unit time can be obtained. The more the number of training data processed by the clustered environment in a unit time, the better the communication performance of the deep learning model and the communication library to be tested can be described for the clustered environment.
According to the technical scheme, a cluster environment for carrying out distributed training on a deep learning model is constructed according to a preset topological structure and cluster scale, whether the cluster environment supports running of a target program corresponding to a communication library to be tested is judged, if yes, a distributed training task corresponding to the deep learning model is completed according to a code file of the target program, communication modes between nodes and between the nodes and corresponding acceleration cards in the cluster environment are deployed, if no, the code file of the target program is obtained, the code file is converted by a preset transcoding tool to obtain a transcoding file, then the transcoding file is compiled, the transcoding file which can be accurately operated through compiling is transferred to the cluster environment, and the communication modes between the nodes and the corresponding acceleration cards in the cluster environment are deployed according to the transcoding file, then all the nodes included in the cluster environment are used for completing the distributed training task corresponding to the deep learning model according to the communication library to be tested, finally training result data of the deep learning model is obtained, the communication library is evaluated for the cluster environment according to the training result data of the deep learning model, and the heterogeneous communication performance of the deep learning model can be effectively matched with the heterogeneous communication model in the communication environment according to the communication mode, and the communication efficiency can be conveniently tested by the distributed training model in the communication environment is better than the communication model.
Based on the foregoing embodiments, the performance testing method of the distributed communications library provided in this embodiment further includes: deploying a plurality of different cluster environments, wherein the topological structure and the cluster scale of each cluster environment are different; through all nodes included in each cluster environment, according to different communication libraries to be tested, completing distributed training tasks corresponding to various deep learning models; and acquiring training result data of each deep learning model of each cluster environment under the condition of using each communication library to be tested, and measuring the difference of communication performance of each communication library to be tested for different cluster environments and different deep learning models according to the training result data.
In this embodiment, a plurality of different cluster environments may be deployed, and distributed training tasks corresponding to a plurality of deep learning models are completed according to different to-be-tested communication libraries through all nodes included in each cluster environment. The duty ratio between the calculation task and the communication task in the training network corresponding to different deep learning models is different.
In a specific embodiment, assuming that the deep learning model is a picture classification model, the number of computing tasks in the training tasks of the deep learning model may be less than the number of communication tasks; assuming that the deep learning model is a pose estimation model, the number of computing tasks in the training tasks of the deep learning model may be greater than the number of communication tasks.
In this embodiment, after the distributed training tasks corresponding to the multiple deep learning models are completed according to different to-be-tested communication libraries through all the nodes included in each cluster environment, training result data of each deep learning model can be obtained for each cluster environment by using each to-be-tested communication library, the training result data is compared with standard training result data, and according to the comparison result, differences of communication performances of each to-be-tested communication library for different cluster environments and between different deep learning models are measured.
The standard training result data may be training result data of a deep learning model by a single device (a preset number of accelerator cards are deployed in the device).
In one implementation manner of the embodiment of the present invention, after measuring differences of communication performance between different deep learning models and different cluster environments of each to-be-measured communication library according to the training result data, the method further includes: constructing a communication library performance lookup table according to the communication performance of each communication library to be tested aiming at different cluster environments and different deep learning models; responding to a recommendation request of a user for a communication library, and recommending a matched communication library to the user according to the recommendation request and the communication library performance lookup table; the recommendation request comprises a target deep learning model to be trained and a target cluster environment for carrying out distributed training on the target deep learning model.
In a specific embodiment, the communication performance lookup table records the communication performance of different communication libraries for different cluster environments and different training networks, and the duty ratio between the calculation tasks and the communication tasks in the different training networks is different. Table 1 is a schematic diagram of a table of a performance lookup table of a communication library according to this embodiment, as shown in table 1, for example, the training network is ResNet50 0, VGG16, inception _v3, where different communication libraries (such as horovad, NCCL, and Gloo) can be recorded in the table of the performance lookup table of the communication library for different cluster environments, and communication performances of different training networks. The advantages and disadvantages of different communication performances are measured through the numerical values of the percentages, and the communication performance is better when the numerical value of the percentages is larger.
After a recommendation request is detected, a corresponding target training network can be obtained according to a target deep learning model in the recommendation request, then according to the duty ratio between a calculation task and a communication task in the target training network, a training network with the same duty ratio is searched in a communication library performance lookup table, according to the training network and a target cluster environment, an optimal communication library (namely, the communication library with the largest percentage) is determined in the communication library performance lookup table, and the communication library is recommended to the user.
TABLE 1
The specific values of the communication performance of the communication library for the cluster environment and the training network shown in table 1 are only examples, and the present embodiment is not limited thereto.
Fig. 3 is a block diagram of a performance testing apparatus for a distributed communications library according to an embodiment of the present invention, where the apparatus includes: an environment deployment module 310, a training module 320, and a performance evaluation module 330.
The environment deployment module 310 is configured to construct a cluster environment for performing distributed training on the deep learning model according to a preset topology structure and cluster scale;
the training module 320 is configured to complete, through all nodes included in the cluster environment, a distributed training task corresponding to the deep learning model according to a to-be-tested communication library; deploying at least one acceleration card in each node;
The performance evaluation module 330 is configured to obtain training result data of the deep learning model by using the cluster environment under the to-be-tested communication library, and evaluate, according to the training result data, communication performance of the to-be-tested communication library with respect to the cluster environment and the deep learning model.
According to the technical scheme, a cluster environment for carrying out distributed training on the deep learning model is constructed according to a preset topological structure and cluster scale, then through all nodes included in the cluster environment, distributed training tasks corresponding to the deep learning model are completed according to a to-be-tested communication library, training result data of the deep learning model of the cluster environment under the condition that the to-be-tested communication library is used is obtained, finally, the suitability of the distributed communication library in a heterogeneous computing environment can be effectively verified according to the training result data by evaluating the communication performance of the to-be-tested communication library aiming at the cluster environment and the deep learning model, so that a user can conveniently select an optimal distributed communication library matched with the heterogeneous computing environment according to performance test results, and further training efficiency of the deep learning model can be improved.
Based on the above embodiments, the training module 320 may include:
The communication library judging unit is used for judging whether the cluster environment supports running of a target program corresponding to the communication library to be tested;
The first communication mode determining unit is used for deploying communication modes among all nodes in the cluster environment and among all nodes and corresponding acceleration cards according to a code file of a target program when the cluster environment supports running the target program corresponding to the communication library to be tested, so that all nodes included in the cluster environment can complete a distributed training task corresponding to the deep learning model according to the communication modes;
The transcoding unit is used for acquiring a code file of a target program corresponding to the communication library to be detected when the cluster environment does not support the running of the target program, and converting the code file by using a preset transcoding tool to obtain a transcoding file;
the compiling unit is used for compiling the transcoding file and migrating the transcoding file which is compiled and can run correctly to the cluster environment;
the second communication mode determining unit is used for deploying communication modes among all nodes in the cluster environment and among all nodes and corresponding acceleration cards according to the transcoding file, so that all nodes in the cluster environment can complete distributed training tasks corresponding to the deep learning model according to the communication modes;
The task allocation unit is used for acquiring a plurality of training tasks corresponding to the deep learning model, and allocating the training tasks to corresponding nodes in the cluster environment according to the topological structure and the cluster scale of the cluster environment;
The data distribution unit is used for acquiring a plurality of training data corresponding to the deep learning model, and distributing the training data to corresponding nodes in the cluster environment according to the topological structure and the cluster scale of the cluster environment;
The node training unit is used for training the deep learning model through at least one master node in the cluster environment and at least one slave node corresponding to the master node according to training tasks and training data corresponding to the nodes and the communication library to be tested so as to complete distributed training tasks corresponding to the deep learning model.
The performance evaluation module 330 may include:
the quantity acquisition unit is used for acquiring the quantity of training data processed in unit time in the training process of the deep learning model by the cluster environment under the condition of using the communication library to be tested;
the time consumption acquisition unit is used for acquiring the training time consumption of the deep learning model by the cluster environment under the condition of using the communication library to be tested.
The performance testing device of the distributed communication library may further include:
the cluster deployment modules are used for deploying a plurality of different cluster environments, and the topological structure and the cluster scale of each cluster environment are different;
the plurality of cluster training modules are used for completing distributed training tasks corresponding to various deep learning models according to different communication libraries to be tested through all nodes included in each cluster environment;
the performance measurement module is used for acquiring training result data of each deep learning model under the condition that each cluster environment uses each communication library to be measured, and measuring the difference of communication performance of each communication library to be measured aiming at different cluster environments and different deep learning models according to the training result data;
The performance measurement module may further include:
the query table construction unit is used for constructing a communication library performance query table according to the communication performance of different cluster environments and different deep learning models of each communication library to be tested;
The communication library recommending unit is used for responding to a recommending request of a user to the communication library and recommending the matched communication library to the user according to the recommending request and the communication library performance lookup table;
The recommendation request comprises a target deep learning model to be trained and a target cluster environment for carrying out distributed training on the target deep learning model.
The performance testing device of the distributed communication library provided by the embodiment of the invention can execute the performance testing method of the distributed communication library provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and as shown in fig. 4, the computer device includes a processor 410, a memory 420, an input device 430 and an output device 440; the number of processors 410 in the computer device may be one or more, one processor 410 being taken as an example in fig. 4; the processor 410, memory 420, input device 430, and output device 440 in the computer device may be connected by a bus or other means, for example in fig. 4. The memory 420 is used as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as program instructions/modules corresponding to a performance testing method of a distributed communication library in any embodiment of the present invention (e.g., an environment deployment module 310, a training module 320, and a performance evaluation module 330 in a performance testing apparatus of a distributed communication library). The processor 410 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 420, i.e., implements a distributed communications library performance testing method as described above. That is, the program, when executed by the processor, implements:
According to a preset topological structure and cluster scale, a cluster environment for carrying out distributed training on the deep learning model is constructed;
Completing distributed training tasks corresponding to the deep learning model according to a to-be-tested communication library through all nodes included in the cluster environment; deploying at least one acceleration card in each node;
And acquiring training result data of the deep learning model of the cluster environment under the condition that the communication library to be tested is used, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data.
Memory 420 may include primarily a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 420 may further include memory remotely located relative to processor 410, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The input device 430 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the computer device, which may include a keyboard, mouse, and the like. The output 440 may include a display device such as a display screen.
The embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of the embodiments of the present invention. Of course, the computer readable storage medium provided by the embodiments of the present invention may perform the related operations in the performance test method of the distributed communication library provided by any of the embodiments of the present invention. That is, the program, when executed by the processor, implements:
According to a preset topological structure and cluster scale, a cluster environment for carrying out distributed training on the deep learning model is constructed;
Completing distributed training tasks corresponding to the deep learning model according to a to-be-tested communication library through all nodes included in the cluster environment; deploying at least one acceleration card in each node;
And acquiring training result data of the deep learning model of the cluster environment under the condition that the communication library to be tested is used, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.
It should be noted that, in the embodiment of the performance testing apparatus of a distributed communication library, each unit and module included in the performance testing apparatus are only divided according to the functional logic, but are not limited to the above-mentioned division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1. A method for testing performance of a distributed communications library, comprising:
According to a preset topological structure and cluster scale, a cluster environment for carrying out distributed training on the deep learning model is constructed;
Completing distributed training tasks corresponding to the deep learning model according to a to-be-tested communication library through all nodes included in the cluster environment; deploying at least one acceleration card in each node;
acquiring training result data of the deep learning model of the cluster environment under the condition of using the communication library to be tested, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data;
the distributed training task corresponding to the deep learning model is completed according to the to-be-tested communication library through all nodes included in the cluster environment, and the distributed training task comprises the following steps:
And distributing training tasks corresponding to the deep learning model to each node so as to enable the corresponding training tasks to be executed between each node and the corresponding acceleration card according to the to-be-tested communication library.
2. The method of claim 1, further comprising, prior to completing the distributed training task corresponding to the deep learning model from the library of communications under test by all nodes included in the clustered environment:
Judging whether the cluster environment supports running of a target program corresponding to the communication library to be tested;
If so, deploying communication modes among nodes in a cluster environment and between each node and a corresponding acceleration card according to the code file of the target program, so that all nodes in the cluster environment can complete distributed training tasks corresponding to the deep learning model according to the communication modes.
3. The method according to claim 2, further comprising, after determining whether the clustered environment supports running a target program corresponding to the communication library to be tested:
if not, acquiring a code file of the target program, and converting the code file by using a preset transcoding tool to obtain a transcoding file;
Compiling the transcoding file, and migrating the transcoding file which is compiled and can run correctly to the cluster environment;
And deploying communication modes among nodes in a cluster environment and between each node and a corresponding acceleration card according to the transcoding file, so that all nodes in the cluster environment can complete distributed training tasks corresponding to the deep learning model according to the communication modes.
4. The method according to claim 1, wherein completing, by all nodes included in the cluster environment, distributed training tasks corresponding to the deep learning model according to a to-be-tested communication library, comprises:
Acquiring a plurality of training tasks corresponding to the deep learning model, and distributing each training task to a corresponding node in the cluster environment according to the topological structure and the cluster scale of the cluster environment;
Acquiring a plurality of training data corresponding to the deep learning model, and distributing the training data to corresponding nodes in the cluster environment according to the topological structure and the cluster scale of the cluster environment;
And training the deep learning model through at least one master node in the cluster environment and at least one slave node corresponding to the master node according to training tasks and training data corresponding to the nodes and a communication library to be tested so as to complete distributed training tasks corresponding to the deep learning model.
5. The method of claim 1, wherein obtaining training result data of the deep learning model by the clustered environment using the to-be-tested communication library comprises:
Acquiring the quantity of training data processed in unit time in the training process of the deep learning model by the cluster environment under the condition of using the communication library to be tested; and/or the number of the groups of groups,
And obtaining time consumption of training the deep learning model by the cluster environment under the condition of using the communication library to be tested.
6. The method according to claim 1, wherein the method further comprises:
deploying a plurality of different cluster environments, wherein the topological structure and the cluster scale of each cluster environment are different;
Through all nodes included in each cluster environment, according to different communication libraries to be tested, completing distributed training tasks corresponding to various deep learning models;
and acquiring training result data of each deep learning model of each cluster environment under the condition of using each communication library to be tested, and measuring the difference of communication performance of each communication library to be tested for different cluster environments and different deep learning models according to the training result data.
7. The method of claim 6, further comprising, after measuring differences in communication performance between each of the communication libraries under test for different cluster environments and different deep learning models according to the training result data:
Constructing a communication library performance lookup table according to the communication performance of each communication library to be tested aiming at different cluster environments and different deep learning models;
Responding to a recommendation request of a user for a communication library, and recommending a matched communication library to the user according to the recommendation request and the communication library performance lookup table;
The recommendation request comprises a target deep learning model to be trained and a target cluster environment for carrying out distributed training on the target deep learning model.
8. A performance testing apparatus for a distributed communications library, comprising:
the environment deployment module is used for constructing a cluster environment for carrying out distributed training on the deep learning model according to a preset topological structure and cluster scale;
The training module is used for completing distributed training tasks corresponding to the deep learning model according to the to-be-tested communication library through all nodes included in the cluster environment; deploying at least one acceleration card in each node;
The performance evaluation module is used for acquiring training result data of the deep learning model of the cluster environment under the condition of using the communication library to be tested, and evaluating the communication performance of the communication library to be tested for the cluster environment and the deep learning model according to the training result data;
the distributed training task corresponding to the deep learning model is completed according to the to-be-tested communication library through all nodes included in the cluster environment, and the distributed training task comprises the following steps:
And distributing training tasks corresponding to the deep learning model to each node so as to enable the corresponding training tasks to be executed between each node and the corresponding acceleration card according to the to-be-tested communication library.
9. A computer device, comprising:
One or more processors;
a storage means for storing one or more programs;
The performance testing method of a distributed communications library as claimed in any one of claims 1 to 7 when said one or more programs are executed by said one or more processors, such that said one or more processors execute said programs.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of performance testing of a distributed communications library according to any of claims 1-7.
CN202111056060.1A 2021-09-09 2021-09-09 Performance test method, device, equipment and medium of distributed communication library Active CN113792875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111056060.1A CN113792875B (en) 2021-09-09 2021-09-09 Performance test method, device, equipment and medium of distributed communication library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111056060.1A CN113792875B (en) 2021-09-09 2021-09-09 Performance test method, device, equipment and medium of distributed communication library

Publications (2)

Publication Number Publication Date
CN113792875A CN113792875A (en) 2021-12-14
CN113792875B true CN113792875B (en) 2024-07-23

Family

ID=78879753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111056060.1A Active CN113792875B (en) 2021-09-09 2021-09-09 Performance test method, device, equipment and medium of distributed communication library

Country Status (1)

Country Link
CN (1) CN113792875B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116258199B (en) * 2022-12-14 2023-10-10 浙江大学 Distributed training time prediction method and device for large-scale GPU cluster

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112866059A (en) * 2021-01-18 2021-05-28 中国信息通信研究院 Nondestructive network performance testing method and device based on artificial intelligence application

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025205B (en) * 2016-01-30 2021-06-22 华为技术有限公司 Method and equipment for training model in distributed system
CN109062692A (en) * 2018-07-24 2018-12-21 郑州云海信息技术有限公司 A kind of optimization method and system of recognition of face deep learning training platform
US11790239B2 (en) * 2018-12-29 2023-10-17 International Business Machines Corporation Deep learning testing
WO2021056043A1 (en) * 2019-09-23 2021-04-01 Presagen Pty Ltd Decentralised artificial intelligence (ai)/machine learning training system
CN112364897B (en) * 2020-10-27 2024-05-28 曙光信息产业(北京)有限公司 Distributed training method and device, storage medium and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112866059A (en) * 2021-01-18 2021-05-28 中国信息通信研究院 Nondestructive network performance testing method and device based on artificial intelligence application

Also Published As

Publication number Publication date
CN113792875A (en) 2021-12-14

Similar Documents

Publication Publication Date Title
CN110515739B (en) Deep learning neural network model load calculation method, device, equipment and medium
CN105956021B (en) A kind of automation task suitable for distributed machines study parallel method and its system
CN110554958B (en) Graph database testing method, system, device and storage medium
Balaprakash et al. Active-learning-based surrogate models for empirical performance tuning
US20130151535A1 (en) Distributed indexing of data
CN109146081B (en) Method and device for creating model project in machine learning platform
CN103092751B (en) Web application performance test system based on customer behavior model in cloud environment
CN105122234A (en) Deploying trace objectives using cost analyses
CN105283848A (en) Application tracing by distributed objectives
CN105283866A (en) Optimization analysis using similar frequencies
CN111708641A (en) Memory management method, device and equipment and computer readable storage medium
CN115169810A (en) Artificial intelligence system construction method and device for power grid regulation
CN107704235A (en) The analytic method of data flowchart, system and storage medium in mathematics library
CN109409746A (en) A kind of production scheduling method and device
CN113792875B (en) Performance test method, device, equipment and medium of distributed communication library
CN116127899A (en) Chip design system, method, electronic device, and storage medium
CN112433853A (en) Heterogeneous sensing data partitioning method for parallel application of supercomputer data
CN112463532B (en) Method for constructing SNN workload automatic mapper and automatic mapper
CN106844024B (en) GPU/CPU scheduling method and system of self-learning running time prediction model
CN112306452A (en) Method, device and system for processing service data by merging and sorting algorithm
CN110177006B (en) Node testing method and device based on interface prediction model
CN115994086A (en) Image processing method and device, storage medium and terminal
CN109344079A (en) Placement-and-routing's regression testing method, system, equipment and storage medium
Ewald et al. A non-fragmenting partitioning algorithm for hierarchical models
CN114780443A (en) Micro-service application automatic test method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220530

Address after: 211100 No. 519 Chengxin Avenue, Jiangning District, Nanjing City, Jiangsu Province

Applicant after: SUGON NANJING RESEARCH INSTITUTE Co.,Ltd.

Address before: 100193 No. 36 Building, No. 8 Hospital, Wangxi Road, Haidian District, Beijing

Applicant before: Dawning Information Industry (Beijing) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant