KR20220001008A

KR20220001008A - Distributed training method and system, device and storage medium

Info

Publication number: KR20220001008A
Application number: KR1020200164799A
Authority: KR
Inventors: 동 닥시앙; 공 웨이바오; 리우 이; 위 디안하이; 마 얀쥔; 왕 하이펑
Original assignee: 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디.
Priority date: 2020-06-28
Filing date: 2020-11-30
Publication date: 2022-01-04
Also published as: JP7138150B2; EP3929825A1; JP2022008781A; CN111753997B; CN111753997A; US20210406767A1

Abstract

Disclosed in the present application is a method, a system, a device, a storage medium and a program for distributed training, and the present invention relates to the field of artificial intelligence technology, and specifically, relates to the field of deep learning and cloud computing technology. The method including: sending, by a job information server, a first training request and available first computing server information to at least a first data server among a plurality of data servers; sending, by the first data server, first training data to the first computing server based on the first training request; and performing, by the first computing server, model training based on the first training data, and sending and storing model parameters after the completion of training to the first data server, and sending and recording identification information of the first training data to the job information server, wherein each computing server does not store the model parameters. When the embodiment of the present application is used, a high-efficiency training process in which computing resources change with elasticity is realized.

Description

DISTRIBUTED TRAINING METHOD AND SYSTEM, DEVICE AND STORAGE MEDIUM

본 발명은 인공지능 기술분야에 관한 것으로서, 구체적으로 딥러닝 및 클라우드 컴퓨팅 분야에 관한 것이며, 더욱 구체적으로는 분산 훈련 방법, 시스템, 기기, 저장 매체 및 프로그램에 관한것이다.The present invention relates to the field of artificial intelligence technology, specifically to the field of deep learning and cloud computing, and more specifically to a distributed training method, system, device, storage medium and program.

일반적으로, 빅 데이터 환경하에서, 딥러닝 모델의 훈련은 분산식의 훈련방식을 채용할수 있으며, 훈련 속도를 향상시킬수 있어, 종래의 대부분 딥러닝 시스템에서는 고정 클러스터 리소스를 사용하여, 모델이 수렴될 때까지 병렬 훈련을 진행한다. 그런데, 클라우드 훈련의 경우에는, 훈련 리소스의 할당수가 일반적으로 전체 클러스터의 배치에 따라 동적으로 변화하기에, 일반적인 딥러닝 구성은 동적 컴퓨팅 리소스의 조건하에서 정상적인 훈련을 진행할수 없으며, 훈련 효율에 지장주고 있다.In general, under the big data environment, training of a deep learning model can adopt a distributed training method, and the training speed can be improved. parallel training. However, in the case of cloud training, since the allocation of training resources generally changes dynamically according to the arrangement of the entire cluster, a general deep learning configuration cannot perform normal training under the conditions of dynamic computing resources, impairing training efficiency, and have.

본 출원은 분산 훈련 방법, 시스템, 기기, 저장 매체 및 프로그램을 제공한다.The present application provides a distributed training method, system, apparatus, storage medium and program.

본 출원의 제1 측면에 따르면, 분산 훈련 방법을 제공하며, 상기 방법은 분산 훈련 시스템이 기반하였으며, 여기서, 상기 분산 훈련 시스템은 훈련 데이터에 기반하여 모델 훈련을 진행하기 위한 것이며, 상기 분산 훈련 시스템은 작업 정보 서버, 데이터 서버 및 컴퓨팅 서버를 포함하며, 그 중 상기 데이터 서버의 수량은 복수이며, 상기 컴퓨팅 서버의 수량은 변화가능하며; 상기 분산 훈련 방법은, 상기 작업 정보 서버가 복수의 데이터 서버중의 적어도 제1데이터 서버에 제1훈련 청구 및 사용가능한 제1컴퓨팅 서버 정보를 송신하는것; 상기 제1데이터 서버가 상기 제1훈련 청구에 기반하여 상기 제1컴퓨팅 서버에 첫번째 훈련 데이터를 송신하는것; 상기 제1컴퓨팅 서버가 상기 첫번째 훈련 데이터에 기반하여 모델 훈련을 진행하고, 훈련 완료후 모델 파라미터를 상기 제1데이터 서버에 송신하여 보존하고, 및 상기 첫번째 훈련 데이터의 식별 정보를 상기 작업 정보 서버에 송신하여 기록하는것, 을 포함하며; 여기서, 각 컴퓨팅 서버에는 모델 파라미터를 저장하지 않는다.According to a first aspect of the present application, there is provided a distributed training method, the method is based on a distributed training system, wherein the distributed training system is for training a model based on training data, and the distributed training system includes a job information server, a data server and a computing server, wherein the number of the data servers is plural, and the number of the computing servers is variable; The distributed training method includes: the job information server sending a first training request and available first computing server information to at least a first data server among a plurality of data servers; sending, by the first data server, first training data to the first computing server based on the first training request; The first computing server performs model training based on the first training data, transmits and preserves model parameters to the first data server after completion of training, and stores identification information of the first training data to the job information server sending and recording; Here, model parameters are not stored in each computing server.

본 출원의 제2 측면에 따르면, 본 출원은 작업 정보 서버, 데이터 서버 및 컴퓨팅 서버를 포함하는 분산 훈련 시스템을 제공하며, 그 중 상기 데이터 서버의 수량은 복수이며, 상기 컴퓨팅 서버의 수량은 변화가능하며, 상기 분산 훈련 시스템은 훈련 데이터에 기반하여 모델 훈련을 진행하기 위한 것이며; 여기서, 상기 작업 정보 서버는 각 데이터 서버에 훈련 청구 및 사용가능한 컴퓨팅 서버 정보를 송신하기 위한 것이며; 상기 데이터 서버는 수신한 훈련 청구에 기반하여 사용가능한 컴퓨팅 서버에 훈련 데이터를 송신하기 위한 것이며; 상기 컴퓨팅 서버는 수신한 훈련 데이터에 기반하여 모델 훈련을 진행하고, 훈련 완료후 모델 파라미터를 상기 데이터 서버에 송신하여 보존하고, 및 훈련 완료된 상기 훈련 데이터의 식별 정보를 상기 작업 정보 서버에 송신하여 기록하기 위한 것이며; 여기서, 각 컴퓨팅 서버에는 모델 파라미터를 저장하지 않는다.According to a second aspect of the present application, the present application provides a distributed training system including a job information server, a data server and a computing server, wherein the number of the data servers is plural, and the number of the computing servers is variable and the distributed training system is for performing model training based on training data; Here, the job information server is for sending training request and available computing server information to each data server; the data server is for sending training data to an available computing server based on the received training request; The computing server performs model training based on the received training data, transmits and preserves model parameters to the data server after training is completed, and transmits and records the identification information of the training data that has been trained to the job information server to do; Here, model parameters are not stored in each computing server.

본 출원의 제3측면에 따르면, 적어도 하나의 프로세서 및 적어도 하나의 프로세서와 통신 연결된 메모리를 포함하는 전자 기기를 제공하며, 여기서,According to a third aspect of the present application, there is provided an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein:

메모리는 적어도 하나의 프로세서에 의해 실행되는 명령들을 저장하며, 명령들은 적어도 하나의 프로세서에 의해 실행되여, 적어도 하나의 프로세서가 상기의 분산 훈련 방법을 실행하도록 한다.The memory stores instructions to be executed by the at least one processor, which are executed by the at least one processor to cause the at least one processor to execute the distributed training method.

본 출원의 제4측면에 따르면, 컴퓨터 명령들을 저장하는 비 일시적 컴퓨터 판독가능한 저장 매체를 제공하며, 컴퓨터 명령들은 컴퓨터로 하여금 상기의 분산 훈련 방법을 실행하도록 한다.According to a fourth aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions, the computer instructions causing a computer to execute the distributed training method.

본 출원의 실시예가 제공한 분산 훈련 방법은 작업 정보 서버, 데이터 서버 및 컴퓨팅 서버 각자의 기능에 대해 합리적인 설정을 진행하는것을 통하여, 그들 상호지간의 협동 방식에 대해 합리적으로 설계하여, 탄성있는 분산 훈련 시스템의 컴퓨팅 노드에 대한 적시 신속한 확장 및 축소 조정을 실현할수 있으며, 시스템 전반적인 컴퓨닝 성능을 최적화할수 있다.The distributed training method provided by the embodiment of the present application is rationally designed for the cooperative method between them through rational setting for each function of the work information server, data server and computing server, and elastic distributed training It can realize timely and rapid scaling up and down scaling for the computing nodes of the system, and optimize the overall computing performance of the system.

이해하여야 할 것은 본 부분에 기재된 내용은 본 출원의 실시예의 관건적 또는 중요한 특징을 표시하는것이 아니며, 본 출원의 범위를 한정하기 위한것이 아니다. 본 출원의 기타 특징은 이하의 설명을 통하여 더욱 쉽게 이해하게 될것이다.It should be understood that the content described in this section does not indicate key or important features of the embodiments of the present application, and is not intended to limit the scope of the present application. Other features of the present application will be more easily understood through the following description.

본 출원은 분산 훈련 방법, 시스템, 기기, 저장 매체 및 프로그램을 제공하는 효과가 있다.The present application has the effect of providing a distributed training method, system, device, storage medium, and program.

도면은 본 기술방안을 더욱 잘 이해할수 있기 위한것이지, 본 출원에 대한 한정이 아니다.
도1은 본 출원의 실시예에 따른 분산 훈련 방법의 흐름도이다.
도2는 본 출원의 실시예에 따른 분산 훈련 시스템의 구조블록도이다.
도3은 본 출원의 다른 실시예에 따른 분산 훈련 시스템의 구성도이다.
도4는 본 출원의 실시예의 탄성있는 분산 훈련 방법을 실현하는 전자 기기의 블록도이다.The drawings are for better understanding of the present technical solution, and are not limited to the present application.
1 is a flowchart of a distributed training method according to an embodiment of the present application.
2 is a structural block diagram of a distributed training system according to an embodiment of the present application.
3 is a block diagram of a distributed training system according to another embodiment of the present application.
4 is a block diagram of an electronic device for realizing the elastic dispersion training method of the embodiment of the present application.

이하에서는 도면과 결합하여 본 출원의 시범적인 실시예를 설명하고자 한다. 그 중에는 이해에 도움이 되도록 본 출원의 실시예의 각종 상세한 내용을 포함하였으며, 이런 내용은 단지 시범적인 것으로 간주되여야 할것이다. 그러므로, 해당 분야 기술자들은 본 출원의 범위와 정신을 벗어나지 않은 한, 여기에서 기재된 실시예에 대해 각종 변형 및 수정을 할수 있다는것을 인식하여야 한다. 마찬가지로, 간단하고 명확하게 하기 위하여, 이하의 기재에서는 공지적인 기능 및 구조에 관한 기재를 생략하고자 한다.Hereinafter, an exemplary embodiment of the present application will be described in conjunction with the drawings. Among them, various details of the embodiments of the present application are included to help understanding, and these contents should be regarded as exemplary only. Therefore, those skilled in the art should recognize that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present application. Likewise, for the sake of simplicity and clarity, descriptions of well-known functions and structures are omitted in the following description.

도1은 본 출원의 실시예에서 제공한 분산 훈련 방법의 흐름도이며, 해당 방법은 분산 훈련 시스템이 기반하였으며, 여기서, 해당 분산 훈련 시스템은 훈련 데이터에 기반하여 모델 훈련을 진행하기 위한 것이며, 해당 분산 훈련 시스템은 작업 정보 서버, 데이터 서버 및 컴퓨팅 서버를 포함하며, 그 중 해당 데이터 서버의 수량은 복수이며, 해당 컴퓨팅 서버의 수량은 변화가능하며; 해당 분산 훈련 방법은,1 is a flowchart of a distributed training method provided in an embodiment of the present application, and the method is based on a distributed training system, where the distributed training system is for model training based on training data, The training system includes a job information server, a data server and a computing server, wherein the number of corresponding data servers is plural, and the number of corresponding computing servers is variable; The distributed training method is

단계(S101), 해당 작업 정보 서버가 복수의 데이터 서버중의 적어도 제1데이터 서버에 제1훈련 청구 및 사용가능한 제1컴퓨팅 서버 정보를 송신하는것;Step S101, the job information server sending a first training request and available first computing server information to at least a first data server among the plurality of data servers;

단계(S102), 해당 제1데이터 서버가 해당 제1훈련 청구에 기반하여 해당 제1컴퓨팅 서버에 첫번째 훈련 데이터를 송신하는것; 및Step S102, the first data server sending the first training data to the first computing server based on the first training request; and

단계(S103), 해당 제1컴퓨팅 서버가 해당 첫번째 훈련 데이터에 기반하여 모델 훈련을 진행하고, 훈련 완료후 모델 파라미터를 해당 제1데이터 서버에 송신하여 보존하고, 및 해당 첫번째 훈련 데이터의 식별 정보를 해당 작업 정보 서버에 송신하여 기록하는것, 을 포함하며; 여기서, 각 컴퓨팅 서버에는 모델 파라미터를 저장하지 않는다.Step S103, the first computing server performs model training based on the first training data, transmits and preserves model parameters to the first data server after training is completed, and the identification information of the first training data is stored sending and recording to the corresponding job information server; Here, model parameters are not stored in each computing server.

본 출원의 실시예에 따르면, 작업 정보 서버는 데이터 서버에 훈련 청구 및 사용가능한 컴퓨팅 서버 정보를 송신하며, 예를 들어 사용가능한 컴퓨팅 서버의 인터넷 프로토콜 IP주소 및/또는 포트 정보를 송신하며, 데이터 서버는 컴퓨팅 서버에 훈련 데이터를 송신하고, 컴퓨팅 서버에 의해 훈련 과정을 완료하며, 또한, 각 컴퓨팅 서버에는 모델 파라미터를 저장하지 않고, 모델 파라미터를 데이터 서버에 송신하여 보존하여, 컴퓨팅 노드로 하여금 될수록 경량화하고, 시스템에 로그인 또는 로그아웃 할때 시스템 전체에 대한 영향을 적게 하며, 또한, 각 컴퓨팅 서버에 모델 파라미터를 저장하지 않기에, 차지하는 리소스가 적고, 컴퓨팅 리소스가 모델 훈련에 사용되도록 하고, 컴퓨팅 리소스의 계산 능력을 향상시키고, 이외, 컴퓨팅 서버는 훈련 데이터의 식별 정보를 작업 정보 서버에 송신하여 기록하고, 다시 말하면 작업 정보 서버가 훈련의 진도를 기록함으로써, 시스템중의 각 컴퓨팅 노드에 훈련 작업을 배치할수 있으며, 분산 훈련 시스템의 전체적인 고효율적인 운행을 실현할수 있다.According to an embodiment of the present application, the job information server sends the training request and available computing server information to the data server, for example, sends Internet protocol IP address and/or port information of the available computing server, and the data server sends the training data to the computing server, completes the training process by the computing server, and does not store the model parameters in each computing server, but transmits and preserves the model parameters to the data server, thereby making the computing node lighter When logging in or out of the system, the effect on the system as a whole is small. Also, since model parameters are not stored in each computing server, the resource occupies less, and the computing resource is used for model training, and the computing resource In addition, the computing server sends and records the identification information of the training data to the job information server, that is, the job information server records the training progress, so that each computing node in the system performs a training job. It can be deployed, and the overall high-efficiency operation of the distributed training system can be realized.

본 출원의 실시예에서 제공한 분산 훈련 방법은 작업 정보 서버, 데이터 서버 및 컴퓨팅 서버 각자의 처리 및 그들상호지간의 협동 방식에 대해 합리적으로 설계하는 것을 통하여, 분산 훈련 과정에서의 컴퓨팅 노드에 대한 신속한 조정을 실현할수 있으며, 시스템 계산 능력을 모델 훈련에 집중시켜, 시스템 전체의 훈련 효율을 향상시킬수 있다.The distributed training method provided in the embodiment of the present application provides a quick method for computing nodes in the distributed training process by rationally designing the processing of each job information server, data server, and computing server and the cooperative method between them. Adjustment can be realized, and the system computational power can be concentrated on model training, improving the overall training efficiency of the system.

도2는 본 출원의 실시예의 분산 훈련 시스템의 구조 블록도를 나타내며, 해당 시스템은 작업 정보 서버(100), 데이터 서버(200) 및 컴퓨팅 서버(300)을 포함하며, 여기서, 해당 데이터 서버(200)의 수량은 복수이며, 해당 컴퓨팅 서버(300)의 수량은 변화가능하며, 해당 분산 훈련 시스템은 훈련 데이터 에 기반하여 모델 훈련을 진행하기 위한 것이며; 여기서,2 shows a structural block diagram of a distributed training system according to an embodiment of the present application, wherein the system includes a job information server 100, a data server 200, and a computing server 300, where the data server 200 ) is plural, the quantity of the corresponding computing server 300 is variable, and the distributed training system is for training the model based on the training data; here,

해당 작업 서버(100)는 각 데이터 서버(200)에 훈련 청구 및 사용가능한 컴퓨팅 서버 정보를 송신하기 위한 것이며;The corresponding job server 100 is for sending training billing and available computing server information to each data server 200;

해당 데이터 서버(200)는 수신한 훈련 청구에 기반하여 사용가능한 컴퓨팅 서버(300)에 훈련 데이터를 송신하기 위한 것이며; 및the data server 200 is for sending training data to the available computing server 300 based on the received training request; and

해당 컴퓨팅 서버(300)은 수신한 훈련 데이터에 기반하여 모델 훈련을 진행하고, 훈련 완료후 모델 파라미터를 해당 데이터 서버(200)에 송신하여 보존하고, 훈련 완료한 해당 훈련 데이터의 식별 정보를 해당 작업 정보 서버(100)에 송신하여 기록하기 위한 것이며; 여기서, 각 컴퓨팅 서버(300)에는 모델 파라미터를 저장하기 위한 것이 아니다.The computing server 300 performs model training based on the received training data, transmits and preserves model parameters to the data server 200 after training is completed, and stores identification information of the training data that has been trained for the corresponding task. for sending to the information server 100 for recording; Here, each computing server 300 is not intended to store model parameters.

본 출원의 실시예의 분산 훈련 시스템에서, 작업 정보 서버, 데이터 서버 및 컴퓨팅 서버 각자의 기능 설정은 합리적이고, 그들 상호지간의 협동 방식에 대해 합리적으로 설계함으로써, 분산 훈련 과정에서 컴퓨팅 노드에 대한 신속한 조정을 실현할수 있으며, 시스템 계산 능력을 최적화할수 있도록 한다.In the distributed training system of the embodiment of the present application, the function settings of the work information server, the data server and the computing server are reasonable, and by designing rationally for the mutual cooperation method between them, rapid adjustment of the computing node in the distributed training process can be realized and the system's computational power can be optimized.

도3은 본 출원의 실시예의 분산 훈련 시스템의 구성도를 나타내며, 도면에서는 작업 정보 서버, 데이터 서버 및 컴퓨팅 서버 등의 로직 연결 관계를 모식적으로 기재하였다. 도3은 3개의 정적 노드를 포함하며, 매개의 정적 노드는 데이터 서버와 파라미터 서버를 포함하며, 도3에는 4개의 탄성 노드를 포함하며, 컴퓨팅 노드(즉 컴퓨팅 서버)에 대응된다.3 is a diagram showing the configuration of a distributed training system according to an embodiment of the present application, and the diagram schematically describes a logical connection relationship of a job information server, a data server, and a computing server. 3 includes three static nodes, each static node includes a data server and a parameter server, and FIG. 3 includes four elastic nodes, corresponding to a computing node (ie, computing server).

이하, 각 도면에 결합하여, 본 출원의 실시예의 여러가지 실시 방식에 대해 상세히 설명하기로 한다.Hereinafter, in combination with each drawing, various implementation methods of the embodiments of the present application will be described in detail.

본 출원의 실시예에 있어서, 시스템이 훈련을 시작하기 전에, 각 데이터 서버는 분산 파일 시스템으로부터 훈련 데이터 및 훈련할 모델의 정보를 다운로드한다.In the embodiment of the present application, before the system starts training, each data server downloads training data and information of a model to be trained from a distributed file system.

이렇게 처리하는 장점은, 데이터 서버에 의해 훈련 데이터 및 훈련할 모델의 정보를 다운로드하여 보존하여, 컴퓨팅 서버가 훈련 데이터를 보존하는것을 피면할수 있으며, 컴퓨팅 서버는 데이터 서버로부터만 훈련에 필요한 데이터를 수신하여 훈련을 진행하며, 훈련 완료후 다시 모델 파라미터를 데이터 서버에 반환하여 보존함으로써, 모델 파라미터의 업데이트를 유지할수 있을뿐만아니라, 컴퓨팅 노드가 시스템에 로그아웃/로그인할 때 시스템에 대한 영향을 감소할수 있다.The advantage of this processing is that the data server downloads and preserves the training data and the information of the model to be trained, so that the computing server can avoid retaining the training data, and the computing server only receives the data needed for training from the data server. By returning the model parameters to the data server after training is completed and saving them, not only can the update of model parameters be maintained, but also the effect on the system when the computing node logs out/logs into the system can be reduced. have.

본 출원의 실시예에 있어서, 해당 각 데이터 서버는 하나의 파라미터 서버를 포함하며; 해당 제1 컴퓨팅 서버는 훈련후의 모델 파라미터를 해당 제1 데이터 서버에 송신한 후, 해당 방법은 해당 훈련후의 모델 파라미터를 해당 제1 데이터 서버중의 제1 파라미터 서버에 보존하는것을 더 포함한다.In the embodiment of the present application, each data server includes one parameter server; After the first computing server sends the model parameters after training to the first data server, the method further includes storing the model parameters after training in a first parameter server of the first data servers.

다시 말하면, 훈련후의 모델 파라미터를 파리미터 서버에 보존하고, 데이터 서버는 훈련 데이터의 송신과 훈련 결과의 회수를 담당하기에, 처리가 고효율적이다.In other words, the model parameters after training are stored in the parameter server, and the data server is responsible for sending the training data and retrieving the training results, so the processing is highly efficient.

본 출원의 실시예에 있어서, 해당 작업 정보 서버는 시스템중의 각 컴퓨팅 서버에 대해 라이브 검측을 진행하며, 시스템중에 사용가능한 컴퓨팅 서버의 수량이 변하지 않으면, 각 데이터 서버중의 파라미터 서버로 하여금 최신의 모델 파라미터를 보존하게 한다.In the embodiment of the present application, the corresponding job information server performs live detection for each computing server in the system, and if the number of available computing servers in the system does not change, the parameter server in each data server causes the latest Preserve model parameters.

작업 정보 서버의 라이브 검측을 통하여, 시스템중의 사용가능한 수량에 대해 라이브 탐지 및 업데이트할수 있을뿐만아니라, 라이브 검측시 시스템 현재의 모델 파라미터가 전부의 컴퓨팅 노드에 대하여 유효한지 여부를 확정할수 있으며, 여기서, 노드 수량에 변화가 없으면, 시스템이 계속하여 평온하게 훈련할수 있다는것을 설명하며, 이때 현재 업데이트한 모델 파라미터를 파라미터 서버에 보존하고, 후속의 시스템의 노드가 변화할때 후퇴의 기초를 제공할수 있다.Through the live detection of the work information server, it is possible not only to live detection and update of the available quantity in the system, but also to determine whether the current model parameters of the system are valid for all computing nodes during the live detection, where , explaining that if there is no change in the number of nodes, the system can continue to train peacefully, at this time keeping the currently updated model parameters in the parameter server, and providing a basis for retreat when the nodes of subsequent systems change .

본 출원의 실시예에 있어서, 해당 작업 정보 서버는 시스템중의 각 컴퓨팅 서버에 대해 라이브 검측을 진행하여, 시스템중의 사용가능한 컴퓨팅 서버의 수량에 변화가 있으면, 사용가능한 컴퓨팅 서버 리스트를 업데이트하고, 각 데이터 서버중의 파라미터 서버로 하여금 이전의 라이브 검측시의 모델 파라미터를 다시 로딩하게 한다.In the embodiment of the present application, the job information server performs live detection for each computing server in the system, and if there is a change in the number of available computing servers in the system, the list of available computing servers is updated; It causes the parameter server in each data server to reload the model parameters of the previous live detection.

작업 정보 서버의 라이브 검측을 통하여, 노드 수량에 변화가 있으면, 라이브 검측 이전의 시스템 데이터 정보가 이미 무효라는것을 설명하며, 이때, 각 데이터 서버중의 파라미터 서버로 하여금 이전의 라이브 검측시의 모델 파라미터를 다시 로딩하게 하며, 다시 말하면 이전의 라이브 검측시의 데이터 버전으로 되돌아가게 함으로써, 훈련 과정에 착오가 없도록 보증할수 있다.Through the live detection of the work information server, if there is a change in the number of nodes, it is explained that the system data information before the live detection is already invalid, and at this time, the parameter server in each data server causes the model parameters at the time of the previous live detection By reloading , that is, reverting to the data version of the previous live detection, it is possible to ensure that there are no errors in the training process.

본 출원의 실시예에 있어서, 해당 작업 정보 서버가 라이브 검측을 진행할 때, 시스템은 일시적으로 훈련 처리를 정지하며, 라이브 검측을 완료한 후, 해당 작업 정보 서버는 현재의 모델 파라미터 및 기록된 훈련 완료후의 훈련 데이터의 식별 정보에 기반하여, 각 데이터 서버에 새로운 훈련 청구를 송신한다.In the embodiment of the present application, when the corresponding job information server performs live detection, the system temporarily stops the training process, and after completing the live detection, the corresponding job information server sets the current model parameters and recorded training Based on the identification information of the later training data, a new training request is sent to each data server.

작업 정보 서버가 라이브 검측을 진행할 때 일시적으로 훈련 처리를 정지하며, 모델 파라미터가 업데이트 완료되기를 기다리렸다가, 다시 새로운 훈련 작업을 계속함으로써, 훈련 과정이 평온하고 신속할 것을 확보한다.When the job information server performs live detection, it temporarily suspends the training process, waits for the model parameters to be updated, and then resumes the new training job to ensure that the training process is smooth and fast.

본 출원의 실시예에 있어서, 각 컴퓨팅 서버사이에는 정보 교환이 존재하지 않으며, 각 컴퓨팅 노드는 모두 파라미터 서버로부터 훈련 데이터를 획득함으로써, 최대한 컴퓨팅 리소스를 모델 훈련에 이용되도록 할수 있다.In the embodiment of the present application, there is no information exchange between each computing server, and each computing node acquires training data from the parameter server, so that computing resources can be used for model training as much as possible.

본 출원의 실시예에 있어서, 해당 작업 정보 서버는 정적 노드이다.In the embodiment of the present application, the corresponding job information server is a static node.

작업 정보 서버가 시스템중의 컴퓨팅 노드에 대한 정기적인 라이브 탐지를 담당하고 있기에, 사용가능한 컴퓨팅 서버 리스트를 유지하고, 또한 파라미터 서버중의 모델 파라미터가 유효하도록 유지할수 있기에, 작업 정보 서버는 분산 시스템의 중앙 노드이며, 그는 응당 정적 노드여야 하며, 종료될수 없는 또는 정지될수 없는 컴퓨팅 노드에 위치한다고 이해할수 있으며, 작업 정보 서버로 하여금 사용가능성이 높고, 시스템의 안정을 확보할수 있다.Since the job information server is in charge of regular live detection of the computing nodes in the system, it maintains a list of available computing servers, and can also keep the model parameters in the parameter server valid, so that the job information server is a distributed system It is a central node, it should be a static node, and it can be understood that it is located in a computing node that cannot be shut down or stopped, and it can make the work information server highly available and ensure the stability of the system.

본 출원의 실시예의 탄성 분산 훈련 방법은 각종 머신러닝 모델의 훈련 학습 과정에 이용될수 있으며, 예를 들어 신경망 딥러닝 구성에 있어서, 리소스가 탄성으로 변화하는 클라우드에서 고효율적인 훈련을 진행할수 있으며, 컴퓨팅 서버의 경량화를 확보하고, 신속한 동적 조정할수 있는 능력을 확보하고, 중요한 응용 의미와 가치를 구비한다.The elastic distributed training method of the embodiment of the present application can be used in the training and learning process of various machine learning models, for example, in a neural network deep learning configuration, high-efficiency training can be performed in the cloud in which resources change to elasticity, and computing It secures the light weight of the server, secures the ability to quickly and dynamically adjusts, and has important application meaning and value.

이상, 복수의 실시예를 통하여 부동한 측면으로부터 본 출원의 실시예의 구체적인 설정과 실현 방식에 대하여 기재하였다. 본 출원의 실시예의 탄성 분산 훈련 방법은 상기한 분산 시스템에 기반하였으며, 해당 방법의 처리 과정은 상기 실시예중의 대응된 기재를 참조할수 있기에, 여기서 반복하여 기재하는것을 생략한다.In the above, detailed settings and implementation methods of the embodiments of the present application from different aspects through a plurality of embodiments have been described. The elastic dispersion training method of the embodiment of the present application is based on the dispersion system described above, and the processing of the method can refer to the corresponding description in the embodiment, and thus repeated description is omitted here.

본 출원의 실시예에 따르면, 본 출원은 전자 기기와 판독가능한 저장 매체를 더 제공한다. 도4에서 나타낸 바와 같이, 본 출원의 실시예에 따른 탄성 분산 훈련 방법의 전자 기기의 블록도이다. 전자 기기는 여러가지 형식의 디지털 컴퓨터를 의미하며, 예를 들면, 랩톱 컴퓨터, 디스크톱 컴퓨터, 워크벤치, 개인 정보 단말기, 서버, 블레이드 서버, 대형 컴퓨터, 및 기타 적합한 컴퓨터가 있을수 있다. 전자 기기는 또한 여러가지 형식의 모바일 장치를 의미할수 있으며, 예를 들면, 개인 정보 단말기, 휴대 전화, 스마트 폰, 웨어러블 디바이스 및 기타 유사한 컴퓨팅 장치가 있을수 있다. 본 문에서 기재한 부재, 그들간의 연결 및 관계, 그리고 그들의 기능은 단지 예시적인 것이며, 본 문에서 기재한 및/또는 요구한 본 출원의 실현을 한정하는것이 아니다.According to an embodiment of the present application, the present application further provides an electronic device and a readable storage medium. 4, it is a block diagram of an electronic device of the elastic dispersion training method according to an embodiment of the present application. The electronic device means a digital computer of various types, and may include, for example, a laptop computer, a disk top computer, a workbench, a personal digital assistant, a server, a blade server, a large computer, and other suitable computers. Electronic devices may also refer to various types of mobile devices, for example, personal digital assistants, mobile phones, smart phones, wearable devices, and other similar computing devices. The members, connections and relationships therebetween, and their functions described in this text are merely exemplary and do not limit the realization of the present application as described and/or required in this text.

도4에서 나타낸 바와 같이, 해당 전자 기기는 하나 또는 복수의 프로세서(1001), 메모리(1002) 및 각 부재를 연결하기 위한 인터페이스를 포함하며, 고속 인터페이스와 저속 인터페이스를 포함한다. 각 부재는 부동한 버스를 이용하여 상호 연결되였으며, 공용 메인보드에 설치되거나 또는 수요에 따라 기타 형식으로 설치될수 있다. 프로세서는 전자 기기내에서 실행되는 명령들에 대해 처리할수 있으며, 메모리에 저장된 또는 메모리에서 외부 입력/출력 장치(예를 들어, 인터페이스에 결합된 디스플레이 장치)상에 그래픽 사용자 인터페이스(Graphical User Interface, GUI)를 표시한 그래픽 정보의 명령들을 포함한다. 기타 실시 방식에 있어서, 필요한 경우, 복수의 프로세서 및/또는 복수의 버스를 복수의 메모리와 함께 사용할수 있다. 상술한 바와 같이, 복수의 전자 기기를 연결하여, 각 기기가 일부 필요한 조작(예를 들어, 서버 어레이, 블레이드 서버 세트 또는 멀티 프로세서 시스템)을 제공할수 있다. 도4에서는 하나의 프로세서(1001)인 경우를 예로 나타낸다.As shown in FIG. 4 , the electronic device includes one or a plurality of processors 1001 , a memory 1002 , and an interface for connecting each member, and includes a high-speed interface and a low-speed interface. Each member is interconnected using a different bus, and it can be installed on a common main board or other types according to demand. The processor may process instructions executed within the electronic device, and may be stored in or in memory on an external input/output device (eg, a display device coupled to the interface) on a graphical user interface (GUI). ), which includes instructions for graphic information. In other implementations, multiple processors and/or multiple buses may be used with multiple memories as needed. As described above, by connecting a plurality of electronic devices, each device may provide some necessary operation (eg, a server array, a blade server set, or a multi-processor system). In FIG. 4, the case of one processor 1001 is shown as an example.

메모리(1002)는 본 출원이 제공한 비 일시적 컴퓨터 판독가능한 저장 매체에 상당하다. 여기서, 상기 메모리는 적어도 하나의 프로세서가 실행할수 있는 명령을 저장하며, 적어도 하나의 프로세서가 본 출원이 제공한 탄성 분산 훈련 방법을 실행하도록 한다. 본 출원의 비 일시적 컴퓨터 판독가능한 저장 메체는 컴퓨터 명령들을 저장하며, 해당 컴퓨터 명령은 컴퓨터가 본 출원이 제공한 탄성 분산 훈련 방법을 실행하도록 하기 위한 것이다.The memory 1002 corresponds to a non-transitory computer-readable storage medium provided by the present application. Here, the memory stores instructions executable by at least one processor, and causes the at least one processor to execute the elastic dispersion training method provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are for causing the computer to execute the elastic dispersion training method provided by the present application.

메모리(1002)는 비 일시적 컴퓨터 판독가능한 저장 매체로서, 비 일시적 소프트웨어 프로그램, 비 일시적 컴퓨터 실행 가능한 프로그램 및 모듈들을 저장하기 위한 것이며, 본 출원의 실시예중의 탄성 분산 훈련 방법에 대응되는 프로그램 명령/모듈들이 있다. 프로세서(1001)은 메모리(1002)에 저장된 비 일시적 소프트웨어 프로그램, 명령 및 모듈을 운행하는 것을 통하여, 서버의 각종 기능 응용 및 데이터 처리를 실행하며, 다시 말하면 상기 방법 실시예중의 탄성 분산 훈련 방법을 실현할수 있다.The memory 1002 is a non-transitory computer-readable storage medium, for storing non-transitory software programs, non-transitory computer-executable programs and modules, and program instructions/modules corresponding to the elastic dispersion training method in the embodiments of the present application. there are The processor 1001 executes various functional applications and data processing of the server by running the non-transitory software programs, commands and modules stored in the memory 1002, in other words, realizes the elastic dispersion training method in the above method embodiment. can do.

메모리(1002)는 프로그램 저장 구역 및 데이터 저장 구역을 포함하며, 그중, 프로그램 저장 구역은 오에이스, 적어도 하나의 기능이 필요로 하는 응용 프로그램을 저장할수 있으며; 데이터 저장 구역은 검색 결과의 분석 처리 전자 기기의 사용에 기반하여 생성된 데이터 등을 저장할수 있다. 그리고, 메모리(1002)는 고속 랜덤 액세스 메모리를 포함할수 있으며, 또한 비 일시적 메모리를 포함할수 있다. 예를 들어, 적어도 하나의 디스크 저장 기기, 플래시 메모리 기기, 또는 기타 비 일시적 솔리드 메모리 기기가 있다. 일부 실시예에 있어서, 메모리(1002)는 프로세서(1001)과 원격으로 설치된 메모리를 선택적으로 포함할수 있으며, 이런 원격 메모리는 네트워크를 통하여 검색 결과의 분석 처리 전자 기기와 연결될수 있다. 상기 네트워크의 실예로서, 인터넷, 기업내부네트, LAN, 모바일 통신망 및 그들의 조합을 포함하지만, 이에 한정되는것은 아니다.The memory 1002 includes a program storage area and a data storage area, wherein the program storage area can store the OS, an application program required by at least one function; The data storage area may store data generated based on the use of an electronic device for analyzing and processing search results. And, the memory 1002 may include a high-speed random access memory, and may also include a non-transitory memory. For example, at least one disk storage device, a flash memory device, or other non-transitory solid memory device. In some embodiments, the memory 1002 may optionally include the processor 1001 and a remotely installed memory, and such remote memory may be connected to an electronic device for analyzing and processing search results through a network. Examples of the network include, but are not limited to, the Internet, a corporate intranet, a LAN, a mobile communication network, and combinations thereof.

본 출원의 실시예의 탄성 분산 훈련 방법에 대응되는 전자 기기는 입력 장치(1003) 및 출력 장치(1004)를 더 포함할수 있다. 프로세서(1001), 메모리(1002), 입력 장치(1003) 및 출력 장치(1004)는 버스를 통하여 연결되거나 기타 방식으로 연결될수 있으며, 본 출원의 도4의 실시예에서는 버스를 통해 연결된 예이다.The electronic device corresponding to the elastic dispersion training method of the embodiment of the present application may further include an input device 1003 and an output device 1004 . The processor 1001 , the memory 1002 , the input device 1003 , and the output device 1004 may be connected via a bus or in other ways, and in the embodiment of FIG. 4 of the present application, they are connected via a bus.

입력 장치(1003)은 입력된 수자 혹은 문자 정보를 수신할수 있고, 검색 결과의 분석 처리 전자 기기의 사용자 설정 및 기능 제어와 관련된 키 신호 입력을 생성 할 수 있다. 예를 들어, 터치 스크린, 작은 키보드, 마우스, 트랙 패드, 터치 패널, 조정 로드, 하나 혹은 복수의 마우스 버튼, 트랙 볼, 조이스틱 등 입력 장치일수 있다. 출력 장치(1004)는 디스플레이 장치, 보조 조명 장치(예를 들어, LED) 및 촉각 피드백 장치(예를 들어, 진동 모터) 등을 포함할수 있다. 해당 디스플레이 장치는 액정 디스플레이(Liquid Crystal Display，LCD), 발광 다이오드(Light Emitting Diode，LED) 디스플레이 및 프라즈마 디스플레이를 포함할수 있으며, 이에 한정되는것은 아니다. 일부 실시예에 있어서, 디스플레이 장치는 터치 스크린일수 있다.The input device 1003 may receive input numerical or text information, and may generate a key signal input related to user setting and function control of an electronic device that analyzes and processes a search result. For example, it may be an input device such as a touch screen, small keyboard, mouse, track pad, touch panel, control rod, one or more mouse buttons, a track ball, a joystick, and the like. The output device 1004 may include a display device, an auxiliary lighting device (eg, an LED), and a tactile feedback device (eg, a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.

여기에 기재된 시스템 및 기술의 각종 실시방식은 디지털 전자 회로 시스템, 집적 회로 시스템, 전용 집적 회로(Application Specific Integrated Circuits，ASIC), 컴퓨터 하드웨어, 펌웨어, 소프트웨어 및/또는 그들의 조합에서 구현될수 있다. 이런 각종 실시방식은 하나 또는 복수의 컴퓨터 프로그램중에서 실시하는것을 포함할수 있으며, 해당 하나 또는 복수의 컴퓨터 프로그램은 적어도 하나의 프로그래머블 프로세서를 포함한 프로그래머블 시스템에서 실행 및/또는 해석할수 있으며, 해당 프로그래머블 프로세서는 전용 또는 일반 프로그래머블 프로세서일수 있으며, 저장 시스템, 적어도 하나의 입력장치 및 적어도 하나의 출력장치로부터 데이터와 명령을 수신할수 있고, 또한 데이터와 명령을 해당 저장 시스템, 해당 적어도 하나의 입력장치 및 해당 적어도 하나의 출력장치에 전송할수 있다.Various embodiments of the systems and techniques described herein may be implemented in digital electronic circuit systems, integrated circuit systems, application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof. These various implementation methods may include implementation in one or a plurality of computer programs, and the one or plurality of computer programs may be executed and/or interpreted in a programmable system including at least one programmable processor, and the programmable processor is dedicated or a general programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and sending data and instructions to the storage system, at least one input device, and at least one output device. It can be sent to an output device.

이런 컴퓨팅 프로그램(프로그램, 소프트웨어, 소프트웨어 응용 또는 코드라고도 함)은 프로그래머블 프로세서의 기계명령들을 포함하며, 고급 과정 및/또는 객체 지향 프로그래밍 언어 및/또는 어셈블리/기계 언어를 이용하여 이런 컴퓨팅 프로그램을 실시할수 있다. 본문에서 사용한 바와 같이, 용어“기계 판독가능한 매체” 및 “컴퓨터 판독가능한 매체”는 기계 명령 및/또는 데이터를 프로그래머블 프로세서에 제공하는 모든 컴퓨터 프로그램 제품, 디바이스 및/또는 장치(예를 들어, 자기 디스크, 시디롬, 메모리, 프로그래머블 로직 장치(programmable logic device，PLD))일수 있으며, 기계 판독가능한 시그널로서의 기계명령을 수신하는 기계 판독가능한 매체를 포함할수 있다. 용어“기계 판독가능한 시그널”은 기계명령 및/또는 데이터를 프로그래머블 프로세서에 제공하는 모든 시그널을 의미한다.Such computing programs (also referred to as programs, software, software applications or code) include machine instructions for programmable processors, and may implement such computing programs using high-level and/or object-oriented programming languages and/or assembly/machine languages. have. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, device and/or apparatus (eg, a magnetic disk) that provides machine instructions and/or data to a programmable processor. , CD-ROM, memory, programmable logic device (PLD)), and may include a machine-readable medium for receiving machine instructions as machine-readable signals. The term “machine-readable signal” means any signal that provides machine instructions and/or data to a programmable processor.

사용자와의 인터랙션을 제공하기 위하여, 컴퓨터에서 여기에 기재된 시스템 및 기술을 실시할수 있으며, 해당 컴퓨터는 사용자에게 정보를 디스플레이하는 디스플레이 장치(예를 들어, CRT(Cathe Ray Tube, 음극선관) 또는 LCD(액정 디스플레이)모니터), 그리고 키보드와 포인팅 장치(예를 들어, 마우스 또는 트랙 볼)를 포함할수 있으며, 사용자는 해당 키보드와 해당 포인팅 장치를 통하여 입력을 컴퓨터에 제공해줄수 있다. 기타 종류의 장치는 사용자와의 인터랙션을 제공하는데 사용될수 있으며, 예를 들어, 사용자에게 제공된 피드백은 모든 형식의 감각 피드백(예를 들어, 시각 피드백, 청각 피드백 또는 촉각 피드백)일수 있으며; 모든 형식(소리 입력, 음성 입력 또는 촉각 입력을 포함)을 통하여 사용자로부터의 입력을 수신할수 있다.To provide interaction with a user, a computer may implement the systems and techniques described herein, and the computer may include a display device (eg, a Cathe Ray Tube (CRT) or an LCD (Cathode Ray Tube) or LCD) that displays information to the user. liquid crystal display) monitor), and a keyboard and a pointing device (eg, a mouse or a track ball), and the user can provide input to the computer through the keyboard and the corresponding pointing device. Other types of devices may be used to provide interaction with a user, for example, the feedback provided to the user may be any form of sensory feedback (eg, visual feedback, auditory feedback or tactile feedback); It can receive input from the user in any form (including sound input, voice input, or tactile input).

여기에 기재된 시스템과 기술은 백그라운드 위젯을 포함하는 컴퓨팅 시스템(예를 들어, 데이터 서버로서), 또는 중간 위젯을 포함하는 컴퓨팅 시스템(예를 들어, 애플리케이션 서버), 또는 프론트 위젯을 포함하는 컴퓨팅 시스템(예를 들어, 그래픽 사용자 인터페이스 혹은 네트워크 브라우저를 갖고있는 사용자 컴퓨터, 사용자는 해당 그래픽 사용자 인터페이스 또는 해당 네트워크 브라우저를 통하여 여기에 기재된 시스템 및 기술의 실시방식과 인터랙션을 진행할수 있다), 또는 이런 백그라운드 위젯, 중간 위젯, 프론트 위젯의 모든 조합을 포함하는 컴퓨팅 시스템에서 실시될수 있다. 모든 형식 또는 매체의 디지털 데이터 통신(예를 들어, 통신 네트워크)을 통하여, 시스템의 부재를 상호 연결할수 있다. 통신 네트워크의 실예로서는, LAN(Local Area Network), WAN(Wide Area Networ) 및 인터넷을 포함할수 있다.The systems and techniques described herein include a computing system that includes a background widget (eg, as a data server), or a computing system that includes an intermediate widget (eg, an application server), or a computing system that includes a front widget ( For example, a user's computer having a graphical user interface or network browser through which the user may interact with the implementations of the systems and technologies described herein), or such background widgets; It can be implemented in a computing system including any combination of intermediate widgets and front widgets. Any form or medium of digital data communication (eg, a communication network) may interconnect the members of the system. Examples of the communication network may include a local area network (LAN), a wide area network (WAN), and the Internet.

컴퓨터 시스템은 클라이언트 단말 및 서버를 포함할수 있다. 클라이언트 단말 및 서버는 일반적으로 서로 원격으로 설치되여 있으며, 통상적으로는 통신 네트워크를 통하여 인터랙션을 진행한다. 대응되는 컴퓨터에서 운행됨과 동시에 클라이언트 단말-서버관계를 갖고 있는 컴퓨터 프래그램을 통하여 클라이언트 단말과 서버사이의 관계를 발생한다.The computer system may include a client terminal and a server. The client terminal and the server are generally installed remotely from each other, and usually interact through a communication network. A relationship between the client terminal and the server is created through a computer program that runs on the corresponding computer and has a client terminal-server relationship.

이해하여야 할것은, 이상에 기재된 각종 형식의 프로세스를 이용하여, 단계를 다시 순서배열, 증가 또는 삭제할수 있다. 예를 들어, 본 출원에 기재된 각 단계는 병행하여 실행될수도 있고, 순서대로 실행될수도 있고, 다른 부동한 순서대로 실행될수도 있으며, 본 출원에 개시된 기술방안이 기대하는 결과만 실현할수 있다면, 본문에서는 여기서 한정을 하지 않는다.It should be understood that steps can be rearranged, incremented, or deleted using the various types of processes described above. For example, each step described in this application may be performed in parallel, may be performed in order, or may be performed in any other order, and if only the expected result of the technical solution disclosed in this application can be realized, the text herein do not limit

이상의 구체적인 실시방식은, 본 출원의 보호범위에 대한 한정이 아니다. 해당 분야 기술자들이 명백해야 할것은, 설계 요구와 기타 요소에 기반하여, 각종 수정, 조합, 서브 조합 및 교체를 진행할수 있다는 점이다. 본 출원의 정신 및 원칙내에서 진행된 모든 수정, 균등 교체와 개량 등은 모두 본 출원의 보호 범위내에 포함되여야 할 것이다.The above specific implementation method is not limited to the protection scope of the present application. It should be clear to those skilled in the art that, based on design requirements and other factors, various modifications, combinations, sub-combinations and replacements may be made. All modifications, equivalent replacements, and improvements made within the spirit and principle of the present application shall be included within the protection scope of the present application.

Claims

In a distributed training method based on a distributed training system,
The distributed training system is for training a model based on training data, and the distributed training system includes a job information server, a data server and a computing server, wherein the number of the data servers is plural, and the computing server The quantity of is variable;
The distributed training method is
sending, by the job information server, a first training request and available first computing server information to at least a first data server of the plurality of data servers;
sending, by the first data server, first training data to the first computing server based on the first training request; and
The first computing server performs model training based on the first training data, transmits and preserves model parameters to the first data server after completion of training, and stores identification information of the first training data to the job information server sending and recording;
Here, not to store model parameters in each computing server,
A distributed training method characterized by

According to claim 1,
before starting training, each data server further comprising downloading training data and information of the model to be trained from a distributed file system;
A distributed training method characterized by

According to claim 1,
each data server includes one parameter server;
After the first computing server sends the post-training model parameters to the first data server, the method further comprises storing the post-training model parameters in a first parameter server of the first data server;
A distributed training method characterized by

According to claim 1,
The job information server performs live detection for each computing server in the system, and if the number of available computing servers in the system does not change, causing the parameter server in each data server to keep the latest model parameters. including more,
A distributed training method characterized by

According to claim 1,
The job information server performs live detection for each computing server in the system, and when the number of available computing servers in the system changes, the list of available computing servers is updated, and the parameter server in each data server is transferred. further comprising reloading the model parameters during live detection of
A distributed training method characterized by

6. The method according to claim 4 or 5,
When the job information server performs live detection, the system temporarily stops the training process, and after completing the live detection, the job information system is based on the current model parameters and the identification information of the recorded training data after completion of training. and sending a new training request to each data server;
A distributed training method characterized by

According to claim 1,
Distributed training method, characterized in that there is no information exchange between the respective computing servers.

According to claim 1,
The distributed training method, characterized in that the job information server is a static node.

A distributed training system comprising a job information server, a data server and a computing server,
the quantity of the data servers is plural, the quantity of the computing servers is variable, and the distributed training system is for performing model training based on training data;
the job information server is for sending training billing and available computing server information to each data server;
the data server is for sending training data to an available computing server based on the received training request;
The computing server performs model training based on the received training data, transmits and preserves model parameters to the data server after training is completed, and transmits the identification information of the training data that has been trained to the job information server, to record;
Here, not to store model parameters in each computing server,
A distributed training system characterized by

10. The method of claim 9,
Each data server is further configured to download training data and information of the model to be trained from the distributed file system before the system starts training;
Distributed training system characterized by

10. The method of claim 9,
each data server includes one parameter server, wherein the parameter server is for preserving the model parameters after training;
A distributed training system characterized by

10. The method of claim 9,
The job information server is for performing live detection for each computing server in the system, and if the number of available computing servers in the system does not change, the job information server causes the parameter server in each data server to be updated with the latest version. to preserve model parameters,
A distributed training system characterized by

10. The method of claim 9,
The job information server is for performing live detection for each computing server in the system, and when the number of available computing servers in the system changes, the job information server updates the available computing server list, and each data server causing the parameter server to reload the model parameters from the previous live detection;
A distributed training system characterized by

14. The method of claim 12 or 13,
When the job information server performs live detection, the system temporarily stops training processing,
The job information server is configured to, after completing live detection, send a new training request to each data server based on the current model parameters and the identification information of the recorded training data after training completion;
A distributed training system characterized by

10. The method of claim 9,
Distributed training system, characterized in that there is no information exchange between the respective computing servers.

10. The method of claim 9,
The distributed training system, characterized in that the job information server is a static node.

An electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor,
wherein the memory stores instructions to be executed by the at least one processor, the instructions being executed by the at least one processor to cause the at least one processor to execute the method of any one of claims 1-8;
Electronic device characterized in that.

A non-transitory computer-readable storage medium storing computer instructions, comprising:
The storage medium of claim 1 , wherein the computer instructions are for causing the computer to execute the method of any one of claims 1-8.

A program, characterized in that it realizes the distributed training method according to any one of claims 1-8 when executed by a processor in a computer.