KR20220054861A

KR20220054861A - Training methods for neural network models and related products

Info

Publication number: KR20220054861A
Application number: KR1020227010791A
Authority: KR
Inventors: 잉루이 왕; 저우양 리; 위안보 왕; 싱청 장
Original assignee: 상하이 센스타임 인텔리전트 테크놀로지 컴퍼니 리미티드
Priority date: 2020-06-03
Filing date: 2021-05-25
Publication date: 2022-05-03
Also published as: TW202147188A; CN111723933B; CN111723933A; WO2021244354A1

Abstract

본 발명의 실시예는 신경망 모델의 트레이닝 방법 및 관련 제품을 개시하는 바, 상기 방법은, 제1 작업 노드가 신경망 모델에 대해 수행하는 현재 반복에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 얻는 단계; 및 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 제1 작업 노드가 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트하는 단계를 포함한다. 본 발명의 실시예에서, 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트한다.An embodiment of the present invention discloses a method for training a neural network model and a related product, wherein the method comprises: localizing at least one network layer of a neural network model based on a current iteration performed by a first working node on the neural network model. obtaining gradient information; and in the process of transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the first working node updating parameters of a second network layer of the neural network model in parallel. includes In an embodiment of the present invention, in the process of the first working node transmitting local gradient information of the first network layer in the neural network model together with at least one second working node, parameters of the second network layer in the neural network model are set in parallel update

Description

Training methods for neural network models and related products

[관련 출원의 교차 인용][Cross Citation of Related Applications]

본 특허 출원은 2020년 6월 3일에 제출된, 출원번호가 202010496921.7이고 발명의 명칭이 “신경망 모델의 트레이닝 방법 및 관련 제품”인 중국 특허 출원의 우선권을 주장하는 바, 상기 출원의 전체 내용은 인용을 통해 본문에 통합된다.This patent application claims the priority of the Chinese patent application filed on June 3, 2020, the application number of which is 202010496921.7, and the title of the invention is “training method for neural network model and related products”, the entire content of the application is It is incorporated into the text by citation.

본 발명은 모델 트레이닝 분야에 관한 것으로, 특히 신경망 모델의 트레이닝 방법 및 관련 제품에 관한 것이다.The present invention relates to the field of model training, and more particularly to a method for training a neural network model and related products.

딥 러닝은 많은 사회 분야에 거대한 발전을 가져오는 바, 모델 트레이닝은 이의 중요한 일환이다. 모델 트레이닝 과정에서는 대량의 샘플 데이터를 판독하여 대량의 수학적 연산을 수행하기에 시간이 많이 소요된다. 업계는 ImageNet 데이터 세트의 기준(benchmark) 테스트에서 계속하여 혁신을 이루고 있지만, 통상적인 트레이닝 플랫폼 측면에서 볼 때, 고효율적인 분산형 모델 트레이닝 방식은 여전히 어려운 실제 문제이다. 따라서, 보다 효율적인 분산형 모델 트레이닝 기술방안에 대한 연구가 필요하다.Deep learning brings huge advances to many social fields, and model training is an important part of it. In the model training process, it takes a lot of time to read a large amount of sample data and perform a large amount of mathematical operation. Although the industry continues to innovate in benchmark testing of ImageNet data sets, in terms of common training platforms, efficient distributed model training methods are still a real challenge. Therefore, a study on a more efficient distributed model training technique is needed.

본 발명의 실시예는 신경망 모델의 트레이닝 방법 및 관련 제품을 개시한다. An embodiment of the present invention discloses a training method of a neural network model and related products.

제1양태에서, 본 발명의 실시예는 신경망 모델의 트레이닝 방법을 제공하는 바, 상기 방법은, 제1 작업 노드가 신경망 모델에 대해 수행하는 현재 반복에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 얻는 단계; 및 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 제1 작업 노드가 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트하는 단계를 포함한다. In a first aspect, an embodiment of the present invention provides a method for training a neural network model, wherein the method comprises at least one network layer of a neural network model based on a current iteration performed by a first working node on the neural network model. obtaining local gradient information of and in the process of transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the first working node updating parameters of a second network layer of the neural network model in parallel. includes

신경망 모델은 여러 층(Layer)을 포함할 수 있고, 그 분산형 병렬 트레이닝 과정은 각 층의 순방향 계산(Forward Pass), 역방향 계산(Backward Pass), 그래디언트 데이터 동기화(예를 들어, Allreduce Gradients) 및 파라미터 업데이트로 나뉠 수 있다. 일부 실시예에서, 순방향 계산은 순차의 층별(layer-by-layer) 조작이고, 역방향 계산은 역순의 층별 조작이며; 그래디언트 데이터 동기화는 주로 네트워크 대역폭 자원을 점용하고, 다른 조작은 프로세서의 계산 자원을 점용한다. 본 발명의 실시예에서, 제1 작업 노드는 파라미터 업데이트와 그래디언트 데이터 동기화를 병렬로 수행함으로써 통신 오버헤드를 은폐시키고, 모델 트레이닝 과정에서의 중첩 가능한 부분을 충분히 발굴하고, 통신으로 인한 지연을 감소시키고, 모델 트레이닝 효율을 향상시킬 수 있다. A neural network model may include multiple layers, and its distributed parallel training process includes forward pass, backward pass, and gradient data synchronization (e.g., Allreduce Gradients) for each layer. It can be divided into parameter updates. In some embodiments, the forward computation is a sequential layer-by-layer operation, and the backward computation is a reverse-order, layer-by-layer operation; Gradient data synchronization mainly occupies network bandwidth resources, and other operations occupies computational resources of the processor. In an embodiment of the present invention, the first working node conceals communication overhead by performing parameter update and gradient data synchronization in parallel, sufficiently discovers overlapping parts in the model training process, reduces delay due to communication, and , it can improve model training efficiency.

본 발명의 실시예에서, 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트하고; 신경망 모델의 파라미터를 업데이트하는 과정 및 로컬 그래디언트 정보를 전송하는 과정을 중첩시켜 모델 트레이닝 효율을 향상시킬 수 있다. In an embodiment of the present invention, in the process of the first working node transmitting local gradient information of the first network layer in the neural network model together with at least one second working node, parameters of the second network layer in the neural network model are set in parallel update; Model training efficiency can be improved by overlapping the process of updating parameters of the neural network model and the process of transmitting local gradient information.

가능한 일 실시형태에서, 상기 방법은, 상기 제1 작업 노드가 상기 신경망 모델의 복수 개의 네트워크 층의 연결 관계에 기반하여 상기 현재 반복의 복수 개의 조작 사이의 의존 관계를 결정하고, 상기 복수 개의 조작은 적어도 상기 신경망 모델 중 적어도 하나의 네트워크 층의 로컬 그래디언트 정보의 전송 조작 및 파라미터 업데이트 조작을 포함하는 단계를 더 포함하되; 여기서, 상기 제1 작업 노드는 상기 복수 개의 조작 사이의 의존 관계에 기초하여 상기 복수 개의 조작을 수행한다. In one possible embodiment, the method comprises: the first working node determines a dependency relationship between a plurality of operations of the current iteration based on a connection relationship of a plurality of network layers of the neural network model, wherein the plurality of operations are The method further comprising the steps of at least including a transmission operation and a parameter update operation of local gradient information of at least one network layer of the neural network model; Here, the first operation node performs the plurality of operations based on a dependency relationship between the plurality of operations.

상기 실시형태에서, 신경망 모델의 복수 개의 네트워크 층의 연결 관계에 기반하여 현재 반복의 복수 개의 조작 사이의 의존 관계를 정확하게 결정할 수 있고, 상기 복수 개의 조작 사이의 의존 관계에 기반하여 상기 복수 개의 조작의 각 조작을 연속적으로 수행한다. In the above embodiment, it is possible to accurately determine a dependency relationship between a plurality of manipulations of a current iteration based on a connection relationship of a plurality of network layers of the neural network model, and based on the dependency relationship between the plurality of manipulations, Each operation is performed successively.

가능한 일 실시형태에서, 상기 제1 작업 노드는 역순으로 상기 신경망 모델 중 복수 개의 네트워크 층의 파라미터를 층별로 업데이트하거나; 및/또는, 상기 제2 네트워크 층의 네트워크 깊이는 상기 제1 네트워크 층의 네트워크 깊이보다 크다. 선택적으로, 상기 제1 작업 노드와 적어도 하나의 제2 작업 노드는 역순으로 상기 신경망 모델 중의 복수 개의 네트워크 층의 로컬 그래디언트 정보를 층별로 전송하고; 상기 제1 작업 노드는 역순으로 상기 신경망 모델 중의 복수 개의 네트워크 층의 로컬 그래디언트 정보(역방향 계산이 역순 층별 조작에 대응됨)를 층별로 계산한다. In one possible embodiment, the first working node updates parameters of a plurality of network layers in the neural network model layer by layer in a reverse order; and/or, a network depth of the second network layer is greater than a network depth of the first network layer. Optionally, the first working node and the at least one second working node transmit layer-by-layer local gradient information of a plurality of network layers in the neural network model in a reverse order; The first working node calculates the local gradient information of the plurality of network layers in the neural network model in the reverse order (reverse calculation corresponds to the reverse-order layer-by-layer operation) layer by layer.

가능한 일 실시형태에서, 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 제1 작업 노드가 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트하는 단계는:In one possible embodiment, in the process of transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the first working node determines the parameters of a second network layer of the neural network model. The steps to update in parallel are:

상기 제1 작업 노드가 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 제2 네트워크 층의 파라미터 업데이트 조작이 의존하는 조작이 완료된 것을 결정할 경우, 상기 제2 네트워크 층의 파라미터를 병렬로 업데이트하는 단계를 포함하되, 여기서, 상기 파라미터 업데이트 조작이 의존하는 조작은 상기 적어도 하나의 제2 작업 노드와 함께 상기 제2 네트워크 층의 로컬 그래디언트 정보를 전송한다. In the process in which the first working node transmits the local gradient information of the first network layer in the neural network model together with the at least one second working node, the operation on which the parameter update operation of the second network layer depends is completed updating a parameter of the second network layer in parallel when determining that transmit information

상기 실시형태에서, 제2 네트워크 층의 파라미터를 업데이트하는 조작이 성공적으로 구현될 수 있음을 확보할 수 있다. In the above embodiment, it can be ensured that the operation of updating the parameter of the second network layer can be implemented successfully.

가능한 일 실시형태에서, 상기 방법은, 상기 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 신경망 모델 중의 제3 네트워크 층의 로컬 그래디언트 정보를 계산하는 단계를 더 포함한다. In one possible embodiment, the method comprises, in the course of the first working node transmitting local gradient information of the first network layer of the neural network model together with at least one second working node, a third one of the neural network models The method further includes calculating local gradient information of the network layer.

상기 실시형태에서, 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 신경망 모델 중의 제3 네트워크 층의 로컬 그래디언트 정보를 계산하고; 신경망 모델 중의 네트워크 층의 로컬 그래디언트 정보를 계산하는 과정 및 로컬 그래디언트 정보를 전송하는 과정을 중첩(즉 통신 및 계산 중첩)시켜 모델 트레이닝 효율을 향상시킬 수 있다. In the above embodiment, in the process of the first working node transmitting the local gradient information of the first network layer of the neural network model together with the at least one second working node, calculating the local gradient information of the third network layer of the neural network model, ; Model training efficiency can be improved by overlapping the process of calculating the local gradient information of the network layer in the neural network model and the process of transmitting the local gradient information (ie, communication and computation overlapping).

가능한 일 실시형태에서, 상기 제1 작업 노드가 신경망 모델에 대해 현재 반복을 수행하기 전에, 상기 방법은, 상기 제1 작업 노드가 상기 신경망 모델에 대해 내부 층 반복을 적어도 한번 수행하여 상기 적어도 하나의 내부 층 반복에 대응하는 중간 융합 그래디언트 정보를 얻는 단계를 더 포함하고; 제1 작업 노드가 신경망 모델에 대해 수행하는 현재 반복에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 얻는 단계는, 상기 제1 작업 노드가 상기 중간 융합 그래디언트 정보 및 상기 현재 반복에 대응하는 로컬 그래디언트 정보에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻는 단계; 및 상기 제1 작업 노드와 상기 적어도 하나의 제2 작업 노드가 전송하는 상기 제1 네트워크 층의 로컬 그래디언트 정보는 상기 제1 네트워크 층의 목표 융합 그래디언트 정보를 포함하는 단계를 포함한다. In one possible embodiment, before the first working node performs a current iteration on the neural network model, the method comprises: the first working node performs an inner layer iteration on the neural network model at least once such that the at least one obtaining intermediate fusion gradient information corresponding to the inner layer iteration; The step of obtaining local gradient information of at least one network layer of the neural network model based on a current iteration performed by the first working node on the neural network model may include: obtaining target fusion gradient information of at least one network layer of the neural network model based on the corresponding local gradient information; and the local gradient information of the first network layer transmitted by the first working node and the at least one second working node including target fusion gradient information of the first network layer.

상기 제1 작업 노드가 상기 신경망 모델에 대해 내부 층 반복을 적어도 한번 수행하여 하나의 로컬 그래디언트 정보 세트를 얻을 수 있다. 하나의 로컬 그래디언트 정보 세트는 제1 작업 노드가 신경망 모델 중의 각 네트워크 층의 순방향 계산 및 역방향 계산을 완료하여 얻은 모든 로컬 그래디언트 정보로 이해될 수 있다. 신경망 모델의 하나의 네트워크 층의 목표 융합 그래디언트 정보는 복수의 내부 층 반복을 통해 얻은 상기 네트워크 층의 여러 로컬 그래디언트 정보 세트를 융합하여 얻은 그래디언트 정보로 이해될 수 있다. The first working node may perform inner layer iteration on the neural network model at least once to obtain one set of local gradient information. One set of local gradient information may be understood as all local gradient information obtained by the first working node completing forward calculation and backward calculation of each network layer in the neural network model. The target fusion gradient information of one network layer of the neural network model may be understood as gradient information obtained by fusing several sets of local gradient information of the network layer obtained through iteration of a plurality of inner layers.

상기 실시형태에서, 제1 작업 노드와 적어도 하나의 제2 작업 노드는 네트워크 층의 목표 융합 그래디언트 정보를 전송하고; 그래디언트 정보의 전송 횟수 및 총 통신량을 감소시킬 수 있다. In the above embodiment, the first working node and the at least one second working node transmit target fusion gradient information of the network layer; It is possible to reduce the number of times of transmission of the gradient information and the total amount of communication.

가능한 일 실시형태에서, 상기 제1 작업 노드가 상기 중간 융합 그래디언트 정보 및 상기 현재 반복에 대응하는 로컬 그래디언트 정보에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻는 단계는, 상기 제1 작업 노드가 상기 중간 융합 그래디언트 정보 및 상기 현재 반복을 통해 얻은 로컬 그래디언트 정보를 누적 처리하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻는 단계를 포함한다. In one possible embodiment, the step of the first working node obtaining target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration comprises: and obtaining, by a first working node, target fusion gradient information of at least one network layer of the neural network model by accumulating the intermediate fusion gradient information and the local gradient information obtained through the current iteration.

가능한 일 실시형태에서, 상기 방법은, 상기 제1 작업 노드가 상기 중간 융합 그래디언트 정보 및 상기 현재 반복에 대응하는 로컬 그래디언트 정보에 기반하여 상기 신경망 모델의 제3 네트워크 층의 목표 융합 그래디언트 정보를 얻는 과정에서, 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델의 제4 네트워크 층의 목표 융합 그래디언트 정보를 전송하는 단계를 더 포함한다. 선택적으로, 상기 제4 네트워크 층의 네트워크 깊이는 상기 제3 네트워크 층의 네트워크 깊이보다 크다. In one possible embodiment, the method comprises: obtaining, by the first working node, target fusion gradient information of a third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration The method further includes transmitting target fusion gradient information of a fourth network layer of the neural network model together with the at least one second working node. Optionally, a network depth of the fourth network layer is greater than a network depth of the third network layer.

상기 실시형태에서, 신경망 모델의 네트워크 층의 목표 융합 그래디언트 정보를 계산하는 과정 및 네트워크 층의 목표 융합 그래디언트 정보를 전송하는 과정을 중첩시켜(즉 계산 및 통신 중첩), 모델 트레이닝 효율을 향상시킬 수 있다. In the above embodiment, the process of calculating the target fusion gradient information of the network layer of the neural network model and the process of transmitting the target fusion gradient information of the network layer are superimposed (that is, computation and communication superimposition), thereby improving model training efficiency .

가능한 일 실시형태에서, 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하기 전에, 상기 방법은, 상기 제1 작업 노드가 상기 제1 네트워크 층의 로컬 그래디언트 정보 중의 각 수치를 모두 M배 증폭하고, 증폭된 각 수치를 반정밀도로 변환하는 단계를 더 포함하되; 상기 M은 1보다 큰 실수이다. In one possible embodiment, before transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the method comprises that the first working node determines the local gradient of the first network layer Further comprising the step of amplifying all values in the information by M times, and converting each amplified number to half precision; The M is a real number greater than 1.

상기 실시형태에서, 로컬 그래디언트 정보 중의 각 수치에 대하여 저정밀도 저장을 사용하는 것을 통해 로컬 그래디언트 정보의 데이터양을 감소시킬 수 있다. In the above embodiment, it is possible to reduce the data amount of the local gradient information through using low-precision storage for each numerical value in the local gradient information.

가능한 일 실시형태에서, 상기 제1 작업 노드가 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트하는 단계 이전에, 상기 방법은, 상기 제1 작업 노드는 획득된 상기 제2 네트워크 층의 로컬 그래디언트 정보에 포함된 각 수치를 단일 정밀도로 변환하고, 상기 변환을 통해 얻은 각 수치를 M배 축소하여 처리 그래디언트 정보를 얻는 단계를 더 포함하되, 상기 M은 1보다 큰 실수이며; 상기 제1 작업 노드가 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트하는 단계는, 상기 제1 작업 노드가 상기 처리 그래디언트 정보를 사용하여 상기 신경망 모델 중의 상기 제2 네트워크 층의 파라미터를 업데이트하는 단계를 포함한다. In one possible embodiment, before the step of the first working node updating a parameter of a second network layer in the neural network model in parallel, the method comprises: the first working node converting each numerical value included in the gradient information with single precision, and reducing each numerical value obtained through the transformation by M times to obtain processing gradient information, wherein M is a real number greater than 1; The step of updating, by the first working node, a parameter of a second network layer in the neural network model in parallel, includes, by the first working node, updating a parameter of the second network layer in the neural network model using the processing gradient information. including the steps of

가능한 일 실시형태에서, 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하기 전에, 상기 방법은, 상기 제1 작업 노드가 상기 제1 네트워크 층에 대응하는 오프셋에 기반하여 계산하여 얻은 상기 제1 네트워크 층의 로컬 그래디언트 정보를 미리 할당된 목표 저장 공간에 저장하되, 여기서, 상기 목표 저장 공간은 상기 신경망 모델의 복수 개의 네트워크 층의 로컬 그래디언트 정보를 저장하기 위한 것인 단계를 더 포함하고; 여기서, 상기 제1 작업 노드를 통해 발송된 상기 제1 네트워크 층의 로컬 그래디언트 정보는 상기 제1 네트워크 층에 대응하는 오프셋에 기반하여 상기 목표 저장 공간으로부터 획득한 것이거나, 및/또는, 상기 제1 작업 노드가 상기 적어도 하나의 제2 작업 노드로부터 수신된 상기 제1 네트워크 층의 로컬 그래디언트 정보에 기반하여 상기 목표 저장 공간에 저장된 상기 제1 네트워크 층의 로컬 그래디언트 정보를 업데이트한다. In one possible embodiment, before transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the method comprises: the first working node corresponding to the first network layer; Store the local gradient information of the first network layer calculated based on the offset in a pre-allocated target storage space, wherein the target storage space is for storing local gradient information of a plurality of network layers of the neural network model further comprising the step of being; Here, the local gradient information of the first network layer sent through the first working node is obtained from the target storage space based on an offset corresponding to the first network layer, and/or the first The working node updates the local gradient information of the first network layer stored in the target storage space based on the local gradient information of the first network layer received from the at least one second working node.

상기 실시형태에서, 제1 네트워크 층에 대응하는 오프셋에 기반하여 목표 저장 공간으로부터 제1 네트워크 층의 로컬 그래디언트 정보를 신속하고 정확하게 획득하거나, 및/또는 목표 저장 공간에 저장된 제1 네트워크 층의 로컬 그래디언트 정보를 업데이트할 수 있다. In the above embodiment, quickly and accurately obtain the local gradient information of the first network layer from the target storage space based on the offset corresponding to the first network layer, and/or the local gradient of the first network layer stored in the target storage space Information can be updated.

가능한 일 실시형태에서, 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하기 전에, 상기 방법은, 상기 제1 작업 노드는 계산하여 얻은 상기 신경망 모델의 복수 개의 네트워크 층의 로컬 그래디언트 정보를 미리 할당된 목표 저장 공간에 저장하고, 메모리 관리자를 통해 상기 복수 개의 네트워크 층 중의 각각의 네트워크 층에 대응하는 오프셋을 결정하는 단계를 더 포함하며; 상기 목표 저장 공간은 하나의 연속적인 저장 공간이고; 상기 제1 작업 노드는 상기 복수 개의 네트워크 층의 각각의 네트워크 층에 대응하는 오프셋에 기반하여 상기 목표 저장 공간으로부터 상기 복수 개의 네트워크 층 중의 적어도 두 개의 네트워크 층의 로컬 그래디언트 정보를 획득하며; 상기 적어도 두 개의 네트워크 층은 상기 제1 네트워크 층을 포함하고; 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 단계는, 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 적어도 두 개의 네트워크 층의 로컬 그래디언트 정보를 전송하는 단계를 포함한다. In one possible embodiment, before transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the method comprises: the first working node computes and obtains a plurality of the neural network model storing local gradient information of the network layers in a pre-allocated target storage space, and determining an offset corresponding to each network layer of the plurality of network layers through a memory manager; the target storage space is one continuous storage space; the first working node obtains local gradient information of at least two network layers of the plurality of network layers from the target storage space based on an offset corresponding to each network layer of the plurality of network layers; the at least two network layers include the first network layer; The step of transmitting local gradient information of a first network layer of the neural network model with at least one second working node comprises: a local gradient of the at least two network layers of the neural network model with the at least one second working node. transmitting information.

실시형태의 주요 원리는, 여러 네트워크 층의 로컬 그래디언트 정보를 큰 어레이에 병합한 후, 글로벌 통신을 시작하고; 이러한 방법으로 글로벌 통신 효율을 향상시키며, 글로벌 통신 횟수를 감소시킬 수 있음을 이해해야 한다. The main principle of the embodiment is to merge the local gradient information of several network layers into a large array, and then start global communication; It should be understood that in this way, the global communication efficiency can be improved, and the number of global communication can be reduced.

제2양태에서, 본 발명의 실시예는 이미지 예측 방법을 제공하는 바, 상기 방법은, 처리될 이미지를 획득하는 단계; 상기 제1양태 및 임의의 가능한 실시형태를 통해 트레이닝하여 얻은 신경망 모델을 사용하여 상기 처리될 이미지에 대해 예측 처리를 수행하여 예측 결과를 얻는 단계를 포함한다. In a second aspect, an embodiment of the present invention provides an image prediction method, the method comprising: obtaining an image to be processed; and performing prediction processing on the image to be processed using the neural network model obtained by training through the first aspect and any possible embodiments to obtain a prediction result.

제3양태에서, 본 발명의 실시예는 데이터 처리 장치를 제공하는 바, 신경망 모델에 대해 수행하는 현재 반복에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 얻기 위한 처리 모듈; 및 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하기 위한 송수신 모듈을 포함하고; 상기 처리 모듈은 또한 상기 송수신 모듈이 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트한다. In a third aspect, an embodiment of the present invention provides a data processing apparatus, comprising: a processing module for obtaining local gradient information of at least one network layer of the neural network model based on a current iteration performed on the neural network model; and a transmit/receive module for transmitting local gradient information of a first network layer in the neural network model together with at least one second working node; The processing module is further configured to update parameters of a second network layer in the neural network model in parallel, in the process in which the transmitting/receiving module transmits the local gradient information of the first network layer of the neural network model together with the at least one second working node do.

제3양태 또는 다양한 가능한 실시형태의 기술적 효과는 제1양태 또는 상응한 실시형태의 기술적 효과의 소개를 참조할 수 있다. The technical effect of the third aspect or various possible embodiments may refer to the introduction of the technical effect of the first aspect or the corresponding embodiment.

제4양태에서, 본 발명의 실시예는 데이터 처리 장치를 제공하는 바, 처리될 이미지를 획득하기 위한 획득 모듈; 및 상기 제1양태 및 임의의 가능한 실시형태를 통해 트레이닝하여 얻은 신경망 모델을 사용하여 상기 처리될 이미지에 대해 예측 처리를 수행하여 예측 결과를 얻기 위한 처리 모듈을 포함한다. In a fourth aspect, an embodiment of the present invention provides a data processing apparatus, comprising: an acquisition module for acquiring an image to be processed; and a processing module for performing prediction processing on the image to be processed using the neural network model obtained by training through the first aspect and any possible embodiments to obtain a prediction result.

제5양태에서, 본 발명의 실시예는 전자기기를 제공하는 바, 상기 전자기기는 프로세서 및 메모리를 포함하고, 상기 메모리는 명령을 저장하기 위한 것이고, 상기 프로세서는 상기 메모리에 저장된 명령을 실행하여 상기 프로세서로 하여금 상기 제1양태 및 임의의 가능한 실시형태의 방법을 수행하도록 한다. In a fifth aspect, an embodiment of the present invention provides an electronic device, wherein the electronic device includes a processor and a memory, the memory is for storing instructions, and the processor executes the instructions stored in the memory to cause the processor to perform the method of the first aspect and any possible embodiments.

제6양태에서, 본 발명의 실시예는 전자기기를 제공하는 바, 상기 전자기기는 프로세서 및 메모리를 포함하고, 상기 메모리는 명령을 저장하기 위한 것이고, 상기 프로세서는 상기 메모리에 저장된 명령을 실행하여 상기 프로세서로 하여금 상기 제2양태 및 임의의 가능한 실시형태의 방법을 수행하도록 한다.In a sixth aspect, an embodiment of the present invention provides an electronic device, wherein the electronic device includes a processor and a memory, the memory is for storing instructions, and the processor executes the instructions stored in the memory to cause the processor to perform the method of the second aspect and any possible embodiments.

제7양태에서, 본 발명의 실시예는 칩을 제공하는 바, 상기 칩은 데이터 인터페이스 및 프로세서를 포함하고, 상기 프로세서는 제1양태 또는 제1양태의 임의의 가능한 실시형태의 방법을 수행하도록 한다. In a seventh aspect, an embodiment of the present invention provides a chip, said chip comprising a data interface and a processor, said processor to perform the method of the first aspect or any possible embodiment of the first aspect .

제8양태에서, 본 발명의 실시예는 칩을 제공하는 바, 상기 칩은 데이터 인터페이스 및 프로세서를 포함하고, 상기 프로세서는 제2양태 또는 제2양태의 임의의 가능한 실시형태의 방법을 수행하도록 한다.In an eighth aspect, an embodiment of the present invention provides a chip, said chip comprising a data interface and a processor, said processor to perform the method of the second aspect or any possible embodiment of the second aspect .

제9양태에서, 본 발명의 실시예는 컴퓨터 판독 저장매체를 제공하는 바, 상기 컴퓨터 저장매체에는 컴퓨터 프로그램이 저장되어 있고, 상기 컴퓨터 프로그램은 프로그램 명령을 포함하며, 상기 프로그램 명령이 프로세서를 통해 실행될 경우 상기 프로세서가 상기 제1양태 및 임의의 가능한 실시형태의 방법을 수행하도록 한다.In a ninth aspect, an embodiment of the present invention provides a computer-readable storage medium, wherein a computer program is stored in the computer storage medium, the computer program including program instructions, wherein the program instructions are executed by a processor. cause the processor to perform the method of the first aspect and any possible embodiments.

제10양태에서, 본 발명의 실시예는 컴퓨터 판독 저장매체를 제공하는 바, 상기 컴퓨터 저장매체에는 컴퓨터 프로그램이 저장되어 있고, 상기 컴퓨터 프로그램은 프로그램 명령을 포함하며, 상기 프로그램 명령이 프로세서를 통해 실행될 경우 상기 프로세서가 상기 제2양태 및 임의의 가능한 실시형태의 방법을 수행하도록 한다.In a tenth aspect, an embodiment of the present invention provides a computer-readable storage medium, wherein a computer program is stored in the computer storage medium, the computer program including program instructions, wherein the program instructions are executed by a processor. cause the processor to perform the method of the second aspect and any possible embodiment.

제11양태에서, 본 발명의 실시예는 컴퓨터 프로그램 제품을 제공하는 바, 상기 컴퓨터 프로그램 제품은 프로그램 명령을 포함하고, 상기 프로그램 명령이 프로세서를 통해 실행될 경우 상기 프로세서가 상기 제1양태 및 임의의 가능한 실시형태의 방법을 수행하도록 한다. In an eleventh aspect, an embodiment of the present invention provides a computer program product, wherein the computer program product comprises program instructions, wherein the program instructions, when executed by a processor, cause the processor to perform the steps of the first aspect and any possible The method of the embodiment is carried out.

제12양태에서, 본 발명의 실시예는 컴퓨터 프로그램 제품을 제공하는 바, 상기 컴퓨터 프로그램 제품은 프로그램 명령을 포함하고, 상기 프로그램 명령이 프로세서를 통해 실행될 경우 상기 프로세서가 상기 제2양태 및 임의의 가능한 실시형태의 방법을 수행하도록 한다.In a twelfth aspect, an embodiment of the present invention provides a computer program product, wherein the computer program product comprises program instructions, and when the program instructions are executed by a processor, the processor can cause the second aspect and any possible The method of the embodiment is carried out.

본 발명의 실시예 또는 배경기술의 기술적 해결수단을 보다 명확하게 설명하기 위하여 이하 본 발명의 실시예 또는 배경기술에 사용되는 첨부 도면을 설명한다.
도 1은 본 발명의 실시예를 통해 제공되는 분산형 트레이닝 흐름도의 예시이다;
도 2는 본 발명의 실시예를 통해 제공되는 신경망 모델의 트레이닝 방법의 흐름도이다;
도 3은 본 발명의 실시예를 통해 제공되는 계산과 통신이 중첩되는 예시의 모식도이다;
도 4는 본 발명의 실시예를 통해 제공되는 다른 계산과 통신이 중첩되는 예시의 모식도이다;
도 5는 본 발명의 실시예를 통해 제공되는 내부 층 반복 방법의 흐름도이다;
도 6은 본 발명의 실시예를 통해 제공되는 통신 융합 책략의 일 예시의 모식도이다;
도 7은 본 발명의 실시예를 통해 제공되는 이미지 예측 방법의 흐름도이다;
도 8은 본 발명의 실시예를 통해 제공되는 데이터 처리 장치의 구조 모식도이다;
도 9는 본 발명의 실시예를 통해 제공되는 다른 데이터 처리 장치의 구조 모식도이다;
도 10은 본 발명의 실시예를 통해 제공되는 서버의 구조 모식도이다;
도 11은 본 발명의 실시예를 통해 제공되는 단말 기기의 구조 모식도이다.In order to more clearly explain the technical solutions of the embodiments or the background of the present invention, the accompanying drawings used in the embodiments or the background of the present invention will be described below.
1 is an illustration of a distributed training flow diagram provided through an embodiment of the present invention;
2 is a flowchart of a training method of a neural network model provided through an embodiment of the present invention;
3 is a schematic diagram of an example in which calculation and communication provided through an embodiment of the present invention overlap;
4 is a schematic diagram of an example in which other calculations and communication provided through an embodiment of the present invention overlap;
5 is a flowchart of a method for repeating an inner layer provided through an embodiment of the present invention;
6 is a schematic diagram of an example of a communication convergence strategy provided through an embodiment of the present invention;
7 is a flowchart of an image prediction method provided through an embodiment of the present invention;
8 is a structural schematic diagram of a data processing apparatus provided through an embodiment of the present invention;
9 is a structural schematic diagram of another data processing apparatus provided through an embodiment of the present invention;
10 is a structural schematic diagram of a server provided through an embodiment of the present invention;
11 is a structural schematic diagram of a terminal device provided through an embodiment of the present invention.

본 발명의 명세서 실시예와 청구범위 및 상기 도면에서의 용어 "제1", "제2" 및 "제3" 등은 유사한 오브젝트를 구별하기 위한 것일 뿐 반드시 특정된 순서 또는 선후 순서를 설명하기 위한 것이 아니다. 이 외에, 용어 "포함", "구비" 및 이들의 임의의 변형은 배타적이지 않는 포함을 커버하기 위한 것인 바, 예를 들면, 일련의 단계 또는 유닛을 포함한다. 과정, 시스템, 제품 또는 기기는 뚜렷하게 나열된 그러한 단계 또는 유닛에 한정되지 않고, 뚜렷하게 나열되지 않거나 또는 이러한 과정, 방법, 제품 또는 기기에 고유한 기타 단계 또는 유닛을 포함할 수 있다.The terms "first", "second", and "third" in the specification, embodiments and claims of the present invention and the drawings are only for distinguishing similar objects, and are necessarily used to describe a specified order or precedence order. it is not In addition, the terms “comprising”, “comprising” and any variations thereof are intended to cover non-exclusive inclusions, eg, including a series of steps or units. A process, system, product, or device is not limited to those steps or units explicitly listed, and may include other steps or units not explicitly listed or unique to such a process, method, product, or device.

고효율적인 분산형 모델 트레이닝 방식은 어려운 실제 문제이다. 본 발명은 분산형 모델 트레이닝 시나리오에 적용되는 신경망 모델의 트레이닝 방법을 제공하여 모델 트레이닝 효율을 향상시킬 수 있다. 이하 본 발명의 실시예를 통해 제공되는 신경망 모델의 트레이닝 방법 각각에 적용되는 시나리오를 간략하게 소개한다. Efficient distributed model training schemes are a difficult real-world problem. The present invention can improve model training efficiency by providing a training method of a neural network model applied to a distributed model training scenario. Hereinafter, scenarios applied to each training method of a neural network model provided through an embodiment of the present invention will be briefly introduced.

분산형 모델 트레이닝 시나리오: 분산형 트레이닝 시스템은 복수 개의 작업 노드를 포함하고, 각각의 작업 노드의 기능은 기본적으로 동일하며, 각 작업 노드가 신경망 모델을 여러 번 반복 트레이닝하여 트레이닝된 신경망 모델을 얻는다. 한번의 반복에서, 각각의 작업 노드는 각각의 트레이닝 샘플을 사용하여 신경망 모델을 트레이닝하여 각각의 로컬 그래디언트 정보를 얻고; 다음, 복수 개의 작업 노드 사이에서 데이터 동기화를 수행하여 복수 개의 작업 노드 중의 각각의 작업 노드로 하여금 모든 작업 노드의 로컬 그래디언트 정보를 얻도록 한 후, 얻은 모든 작업 노드의 로컬 그래디언트 정보를 융합하여 글로벌 그래디언트 정보를 얻거나, 또는, 복수 개의 작업 노드의 각각의 작업 노드는 모든 다른 작업 노드의 로컬 그래디언트 정보를 융합하여 융합 그래디언트 정보를 얻은 후, 자체 로컬 그래디언트 정보와 융합 그래디언트 정보를 융합하여 글로벌 그래디언트 정보를 얻는다. 하나의 예로서, 각각의 작업 노드는 자체 계산하여 얻은 로컬 그래디언트 정보 및/또는 적어도 하나의 다른 작업 노드로부터 수신된 로컬 그래디언트 정보를 다른 작업 노드에 발송하거나, 또는 자체 계산하여 얻은 로컬 그래디언트 정보를 적어도 하나의 다른 작업 노드로부터 수신된 로컬 그래디언트 정보와 융합하여 얻은 융합 그래디언트 정보를 발송하고, 예를 들어, 각각의 작업 노드 모두 모두가 모든 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보, 융합 그래디언트 정보 또는 글로벌 그래디언트 정보를 얻을 때까지 자신의 왼쪽 또는 오른쪽 작업 노드에 발송하며; 다음, 각각의 작업 노드는 모든 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보를 융합하여 얻은 글로벌 그래디언트 정보를 사용하여 신경망 모델의 파라미터를 업데이트한다. 이러한 반복을 여러 번 수행하고, 각각의 작업 노드는 매번 반복에서 예컨대, 신경망 모델이 수렴 또는 트레이닝 횟수가 미리 설정된 횟수에 도달하는 것과 같은 트레이닝 마감 조건에 도달할 때까지 이전의 조작을 중복하여 수행한다. 상기 분산형 모델 트레이닝 시나리오에서, 일부 실시예에서, 각각의 작업 노드가 사용하는 신경망 모델은 동일하고, 각 작업 노드가 신경망 모델의 파라미터를 동기화 업데이트하며, 상이한 작업 노드가 신경망 모델 트레이닝에 사용하는 트레이닝 샘플은 상이하다. 다시 말해서, 각 작업 노드가 사용하는 신경망 모델은 항상 동일하다. 일부 실시예에서, 복수 개의 작업 노드는 동일한 단말 기기 또는 서버 상의 복수 개의 프로세서일 수 있다. 예를 들어, 일부 서버 상의 8개의 GPU는 8개의 작업 노드로 사용되는 바, 즉 하나의 GPU는 하나의 작업 노드에 대응된다. 일부 실시예에서, 하나의 작업 노드 또는 적어도 두 개의 작업 노드는 단말 기기 또는 서버와 같은 하나의 하드웨어 엔티티에 대응된다. 예를 들어, 8개의 노트북은 8개의 작업 노드로 사용되는 바, 즉 하나의 노트북은 하나의 작업 노드로 사용된다. 또 예를 들어, 32개의 서버의 256개의 GPU는 256개의 작업 노드로 사용된다. 또 예를 들어, 분산형 트레이닝 시스템에 포함된 복수 개의 작업 노드는 각각 하나 또는 복수 개의 기기(예를 들어 서버)에서 실행되는 복수 개의 가상 컴퓨터이다. Distributed model training scenario: A distributed training system includes a plurality of task nodes, and the function of each task node is basically the same, and each task node repeatedly trains the neural network model several times to obtain a trained neural network model. In one iteration, each task node trains the neural network model using each training sample to obtain respective local gradient information; Next, data synchronization is performed between the plurality of work nodes so that each work node among the plurality of work nodes obtains the local gradient information of all the work nodes, and then the local gradient information of all the obtained work nodes is fused to obtain a global gradient. To obtain information, or, each work node of a plurality of work nodes obtains fusion gradient information by fusing the local gradient information of all other work nodes, and then fuses its own local gradient information and fusion gradient information to obtain global gradient information get As an example, each work node sends local gradient information obtained by its own calculation and/or local gradient information received from at least one other work node to another work node, or at least local gradient information obtained by self-calculation Sends fusion gradient information obtained by fusion with local gradient information received from one other work node, for example, local gradient information obtained by calculating all of each work node through all work nodes, fusion gradient information, or global It sends gradient information to its left or right working node until it gets it; Next, each task node updates the parameters of the neural network model using the global gradient information obtained by fusing the local gradient information obtained by calculating through all the task nodes. This iteration is performed several times, and each task node repeats the previous operation in each iteration until a training end condition is reached, for example, the neural network model converges or the number of training reaches a preset number of times. . In the distributed model training scenario, in some embodiments, the neural network model used by each task node is the same, each task node synchronously updates the parameters of the neural network model, and the training used by different task nodes to train the neural network model The samples are different. In other words, the neural network model used by each task node is always the same. In some embodiments, the plurality of work nodes may be a plurality of processors on the same terminal device or server. For example, 8 GPUs on some servers are used as 8 work nodes, that is, one GPU corresponds to one work node. In some embodiments, one working node or at least two working nodes corresponds to one hardware entity, such as a terminal device or a server. For example, eight notebooks are used as eight work nodes, that is, one notebook is used as one work node. Also, for example, 256 GPUs in 32 servers are used as 256 task nodes. Also, for example, a plurality of work nodes included in the distributed training system are a plurality of virtual computers each running on one or a plurality of devices (eg, servers).

상기 시나리오에서, 본 발명의 실시예를 통해 제공되는 신경망 모델의 트레이닝 방법을 통해, 작업 노드가 신경망 모델의 파라미터를 업데이트하는 과정 및 작업 노드의 그래디언트 데이터 동기화 과정을 병렬로 수행하여 트레이닝 효율을 향상시킬 수 있다. In the above scenario, through the training method of the neural network model provided through the embodiment of the present invention, the process of updating the parameters of the neural network model by the task node and the process of synchronizing the gradient data of the task node are performed in parallel to improve training efficiency. can

이하 하나의 분산형 트레이닝 흐름도의 예시를 결부하여 본 발명의 실시예를 통해 제공되는 신경망 모델의 트레이닝 방법을 설명한다. Hereinafter, a training method of a neural network model provided through an embodiment of the present invention will be described in conjunction with an example of a distributed training flow chart.

도 1은 본 발명의 실시예를 통해 제공되는 분산형 트레이닝 흐름도의 예시이다. 도 1에 도시된 바와 같이, GPU 0, GPU 1, GPU 2 및 GPU 3은 각각 분산형 트레이닝 시스템 중의 하나의 작업 노드이고, 신경망 모델은 여러 층(Layer)을 포함하고, GPU 0, GPU 1, GPU 2 및 GPU 3의 병렬 트레이닝 과정은, 각 층의 순방향 계산(Forward Pass), 역방향 전파(Backward Pass), 그래디언트 데이터 동기화(그래디언트 프로토콜 통신과 같음) 및 파라미터 업데이트를 포함할 수 있다. 여기서, 순방향 계산에서, 신경망 모델의 각 층은 신경망 모델에 입력되는 이미지를 순차적으로 처리하여 상기 이미지의 처리 결과를 얻는다. 다음, 처리 결과 및 특정 계산 규칙에 기반하여 신경망 모델의 마지막 층의 그래디언트를 얻을 수 있고; 역방향 전파에서, 마지막 층의 그래디언트를 역순으로 역방향 전파하여 신경망 모델의 각 층의 그래디언트를 순차적으로 계산할 수 있다. 그래디언트 데이터 동기화에서, 복수 개의 작업 노드 사이에서 그래디언트 데이터의 동기화를 수행할 수 있다. 본 발명의 실시예에서, 그래디언트 데이터 동기화의 목적은 각각의 작업 노드로 하여금 모두 모든 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보를 융합하여 얻은 글로벌 그래디언트 정보를 획득하도록 하고, 본 발명은 이러한 목적을 달성하는 방법을 한정하지 않는다. 파라미터 업데이트에서, 각 작업 노드는 그래디언트 데이터 동기화를 통해 얻은 글로벌 그래디언트 정보를 사용하여 신경망 모델의 네트워크 파라미터(예를 들어 가중치 등)를 업데이트한다. 1 is an illustration of a distributed training flow diagram provided through an embodiment of the present invention. 1, GPU 0, GPU 1, GPU 2, and GPU 3 are each one working node in a distributed training system, and the neural network model includes several layers, GPU 0, GPU 1, The parallel training process of GPU 2 and GPU 3 may include forward computation (forward pass), backward pass (backward pass), gradient data synchronization (same as gradient protocol communication) and parameter update of each layer. Here, in the forward calculation, each layer of the neural network model sequentially processes the image input to the neural network model to obtain the processing result of the image. Then, the gradient of the last layer of the neural network model can be obtained based on the processing result and the specific calculation rule; In backward propagation, the gradient of each layer of the neural network model can be calculated sequentially by backward propagating the gradient of the last layer in reverse order. In the gradient data synchronization, synchronization of the gradient data may be performed between a plurality of work nodes. In an embodiment of the present invention, the purpose of the gradient data synchronization is to cause each working node to acquire global gradient information obtained by fusing the local gradient information obtained by calculating all through all the working nodes, and the present invention achieves this purpose do not limit how In parameter update, each task node uses global gradient information obtained through gradient data synchronization to update network parameters (eg weights, etc.) of the neural network model.

도 1에 도시된 예에서, 상이한 작업 노드는 상이한 트레이닝 샘플을 신경망 모델에 입력하여 순방향 계산 및 역방향 계산(즉, 역방향 전파)을 수행하여 각각의 로컬 그래디언트 정보를 얻는다. 각 작업 노드는 글로벌 그래디언트 데이터 동기화를 한번 완료한 후, 모두 모든 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보를 융합하여 얻은 글로벌 그래디언트 정보 또는 전부 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보를 획득할 수 있고; 각 작업 노드는 전부의 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보를 융합하여 얻은 글로벌 그래디언트 정보를 사용하여 각각의 신경망 모델에 대해 파라미터 업데이트를 수행한다. 여기서, 각 작업 노드는 동일한 방식으로 신경망 모델에 대해 파라미터 업데이트를 수행할 수 있다. In the example shown in Fig. 1, different task nodes input different training samples to the neural network model to perform forward and backward computations (ie, backward propagation) to obtain respective local gradient information. After each work node completes global gradient data synchronization once, global gradient information obtained by fusing all local gradient information obtained by calculating through all work nodes or local gradient information obtained by calculating through all work nodes can be obtained. ; Each task node performs parameter update for each neural network model using global gradient information obtained by fusing local gradient information obtained by calculating through all task nodes. Here, each task node may perform parameter update on the neural network model in the same way.

일부 실시예에서, 그래디언트 데이터 동기화는 주로 네트워크 대역폭 자원을 점용하고, 다른 조작은 GPU 계산 자원을 점용한다. 통신 오버헤드를 은폐하기 위해, 본 발명의 실시예는 그래디언트 데이터 동기화 및 파라미터 업데이트가 중첩(즉 병렬)되는 신경망 모델의 트레이닝 방법을 제공한다. 이하 첨부 도면을 결부하여 본 발명의 실시예를 통해 제공되는 신경망 모델의 트레이닝 방법을 소개한다. In some embodiments, gradient data synchronization mainly occupies network bandwidth resources, and other operations occupies GPU computational resources. In order to conceal the communication overhead, embodiments of the present invention provide a method of training a neural network model in which gradient data synchronization and parameter updates are superimposed (ie in parallel). Hereinafter, in conjunction with the accompanying drawings, a training method of a neural network model provided through an embodiment of the present invention will be introduced.

도 2는 본 발명의 실시예를 통해 제공되는 신경망 모델의 트레이닝 방법의 흐름도이다. 도 2에 도시된 바와 같이, 상기 방법은 다음과 같은 단계를 포함한다: 2 is a flowchart of a training method of a neural network model provided through an embodiment of the present invention. 2 , the method includes the following steps:

단계201에서, 제1 작업 노드가 신경망 모델에 대해 수행하는 현재 반복에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 얻는다. In step 201, local gradient information of at least one network layer of the neural network model is obtained based on the current iteration performed by the first working node on the neural network model.

상기 제1 작업 노드는 노트북, 데스크톱, 태블릿, 휴대폰 등과 같은 단말 기기일 수 있거나; 서버일 수 있거나; 서버 또는 단말 기기에서 실행되는 가상 컴퓨터일 수 있거나; 단말 기기 또는 그래픽 처리 장치(Graphics Processing Unit, GPU), 중앙 처리 장치(Central Processing Unit, CPU), 네트워크 처리 장치(Neural-network Processing Unit, NPU)와 같은 서버 상의 프로세서일 수 있다. 도 1에 도시된 바와 같이, 각각의 GPU는 역방향 계산을 통해 각 네트워크 층의 로컬 그래디언트 정보를 얻을 수 있다. 일부 실시예에서, 역방향 계산은 역순 층별 조작이고, 제1 작업 노드는 역순으로 신경망 모델 중의 각 네트워크 층의 로컬 그래디언트 정보를 층별로 계산할 수 있으며, 이는 도 1을 참조한다. The first working node may be a terminal device such as a laptop, desktop, tablet, mobile phone, or the like; may be a server; It may be a virtual computer running on a server or terminal device; It may be a processor on a server, such as a terminal device or a graphics processing unit (GPU), a central processing unit (CPU), or a neural-network processing unit (NPU). As shown in FIG. 1 , each GPU may obtain local gradient information of each network layer through backward calculation. In some embodiments, the backward calculation is a reverse-order layer-by-layer operation, and the first working node may calculate the local gradient information of each network layer in the neural network model layer-by-layer in the reverse order, see FIG. 1 .

일부 실시예에서, 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보 전송(단계202를 수행)을 수행하기 전에, 또한 다음 조작을 수행할 수 있다: 상기 제1 작업 노드가 상기 제1 네트워크 층의 로컬 그래디언트 정보 중의 각 수치를 모두 M배 증폭하고, 증폭된 각 수치를 반정밀도로 변환하되; 상기 M은 1보다 큰 실수이다. 상기 실시예에서, 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하기 전에, 먼저 제1 네트워크 층의 로컬 그래디언트 정보를 반정밀도 플로트(half-precision float) 데이터로 변환하고, 이러한 방법으로 이가 점용하는 저장 공간이 단일 정밀도 플로트(single-precision float) 데이터보다 절반 감소되며; 다음 그래디언트 프로토콜 통신을 수행하고; 그래디언트 프로토콜 통신이 종료된 후, 프로토콜 통신으로 얻은 반정밀도 그래디언트를 먼저 단일 정밀도로 변환한 후, 파라미터를 업데이트한다. 이러한 방식으로 통신 오버헤드를 절반 감소시킬 수 있다. In some embodiments, before the first working node performs the local gradient information transfer (performing step 202) of the first network layer in the neural network model together with the at least one second working node, the following operation may also be performed: : the first working node amplifies each value in the local gradient information of the first network layer by M times, and converts each amplified value to half precision; The M is a real number greater than 1. In the above embodiment, before the first working node transmits the local gradient information of the first network layer in the neural network model together with at least one second working node, first, the local gradient information of the first network layer is plotted as a half-precision plot ( converting to half-precision float data, in this way the storage space it occupies is reduced by half compared to single-precision float data; perform the following gradient protocol communication; After the gradient protocol communication is finished, the half-precision gradient obtained by the protocol communication is first converted to single precision, and then the parameters are updated. In this way, the communication overhead can be reduced in half.

그러나, 반정밀도 플로트 데이터 포맷이 표시할 수 있는 양수의 범위는 6.1*e-5에서 65504로, 단일 정밀도 플로트 데이터의 표시 범위보다 훨씬 작다는 점에 유의해야 하고, 신경망 모델의 그래디언트는 종종 아주 작은 값이므로 통신 전에 제1 작업 노드는 먼저 로컬 그래디언트 정보를 증폭하며, 통신이 완료된 후 축소하여 로컬 그래디언트 정보의 전달 과정에서의 정밀도 손실을 감소시킨다. However, it should be noted that the range of positive numbers that the half-precision plot data format can represent is from 6.1*e-5 to 65504, which is much smaller than that of single-precision float data, and the gradients of neural network models are often very small. Since it is a value, the first working node first amplifies the local gradient information before communication, and reduces the loss of precision in the transfer process of the local gradient information by reducing it after the communication is completed.

단계202에서, 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트한다. In step 202, in the process of the first working node transmitting local gradient information of the first network layer of the neural network model together with at least one second working node, the parameter of the second network layer of the neural network model is updated in parallel do.

상기 제1 네트워크 층과 상기 제2 네트워크 층은 상이하다. 일부 실시예에서, 상기 적어도 하나의 제2 작업 노드 중의 각각의 제2 작업 노드는 제1 작업 노드와 유사한 조작을 수행한다. 일부 실시예에서, 상기 제1 작업 노드는 역순으로 상기 신경망 모델 중 복수 개의 네트워크 층의 파라미터를 층별로 업데이트하거나; 및/또는, 상기 제2 네트워크 층의 네트워크 깊이는 상기 제1 네트워크 층의 네트워크 깊이보다 크다. 일부 실시예에서, 제1 작업 노드가 그래디언트 데이터 동기화를 구현하는 방식은 역순 층별 조작이고, 파라미터 업데이트를 구현하는 방식은 역순 층별 조작이다. 예를 들어, 신경망 모델은 N개의 층을 포함하고, 제1 작업 노드와 적어도 하나의 제2 작업 노드는 제N 네트워크 층으로부터 제1 네트워크 층까지의 로컬 그래디언트 정보(역순 층별 조작이 그래디언트 데이터 동기화를 구현하는 것에 대응됨)를 연속적으로 전송한다. 여기서 "전송"은 "발송" 및 "수신"을 나타내는 바, 예를 들어, 제1 작업 노드가 적어도 하나의 제2 작업 노드에 제1 작업 노드를 통해 계산하여 얻은 제N 네트워크 층의 로컬 그래디언트 정보를 발송하는 동시에, 적어도 하나의 제2 작업 노드로부터의 제N 네트워크 층의 로컬 그래디언트 정보를 수신한다. 다음, 상기 제1 작업 노드는 제N 네트워크 층으로부터 제1 네트워크 층까지의 파라미터(역순 층별 조작하여 파라미터 업데이트를 구현하는 것에 대응됨)를 연속적으로 업데이트한다. 도 3은 본 발명의 실시예를 통해 제공되는 계산과 통신이 중첩되는 예시의 모식도이다. 도 3에 도시된 바와 같이, 301은 역순으로 층별 조작하여 그래디언트 데이터 동기화를 구현하는 데이터 흐름(stream)1을 나타내고, 302은 역순으로 층별 조작하여 파라미터 업데이트를 구현하는 데이터 흐름(stream)2를 나타내며, 데이터 흐름1 및 데이터 흐름2는 병렬되고; 301의 각각의 직사각형 프레임은 제1 작업 노드와 다른 작업 노드가 하나의 네트워크 층의 로컬 그래디언트 정보를 전송(또는 통신, 동기화)하는 조작을 나타내는 바, 예를 들어 제n 네트워크 층은 제1 작업 노드와 다른 작업 노드가 제n 네트워크 층의 로컬 그래디언트 정보를 전송하는 조작을 나타내고; 302 중의 각각의 직사각형 프레임은 제1 작업 노드가 하나의 네트워크 층의 파라미터를 업데이트하는 조작을 나타내는 바, 예를 들어 제n 네트워크 층은 제1 작업 노드가 제n 네트워크 층의 파라미터를 업데이트하는 조작을 나타내고; 화살표는 시간 축을 나타낸다. n은 1보다 큰 정수이다. 도 3에서, 제1 작업 노드와 다른 작업 노드는 선후 순서대로 제n 네트워크 층의 로컬 그래디언트 정보, 제(n-1) 네트워크 층의 로컬 그래디언트 정보,..., 제1 네트워크 층의 로컬 그래디언트 정보를 전송하고; 제1 작업 노드는 선후 순서대로 제n 네트워크 층의 파라미터, 제(n-1) 네트워크 층의 파라미터, ..., 제1 네트워크 층의 파라미터를 업데이트하며; 제1 작업 노드와 다른 작업 노드가 제(n-i) 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 제(n-i+1) 네트워크 층의 파라미터를 병렬로 업데이트한다. 여기서, i는 n보다 작은 정수이다. 제1 작업 노드가 그래디언트 데이터 동기화를 구현하는 방식은 역순 층별 조작이고, 또한 파라미터 업데이트를 구현하는 방식은 역순 층별 조작이므로 제1 작업 노드는 그래디언트 데이터를 동기화하는 과정에서, 획득된 네트워크 층의 로컬 그래디언트 정보를 병렬로 사용하여 일부 파라미터를 업데이트하는 조작을 구현한다. 도 3을 참조하면, 제1 작업 노드가 제(n-1) 네트워크 층의 로컬 그래디언트 정보를 수신하는 조작을 수행하기 전에, 제n 네트워크 층의 로컬 그래디언트 정보를 수신하였으므로 상기 제1 작업 노드는 제(n-1) 네트워크 층의 로컬 그래디언트 정보를 수신하는 조작을 수행하는 과정에서, 제n 네트워크 층의 파라미터를 업데이트하는 조작을 병렬로 수행할 수 있다. The first network layer and the second network layer are different. In some embodiments, each second working node of the at least one second working node performs an operation similar to that of the first working node. In some embodiments, the first working node updates parameters of a plurality of network layers in the neural network model layer by layer in a reverse order; and/or, a network depth of the second network layer is greater than a network depth of the first network layer. In some embodiments, the manner in which the first working node implements the gradient data synchronization is a reverse-order-by-layer operation, and the manner in which the parameter update is implemented is a reverse-order-by-layer operation. For example, a neural network model includes N layers, and a first working node and at least one second working node have local gradient information from the N-th network layer to the first network layer (reverse layer-by-layer manipulation is the gradient data synchronization). corresponding to the implementation) is transmitted continuously. Here, "transmission" denotes "sending" and "receiving", for example, the local gradient information of the Nth network layer obtained by the first working node calculating through the first working node to at least one second working node. , and at the same time receiving local gradient information of the Nth network layer from the at least one second working node. Next, the first working node continuously updates parameters from the Nth network layer to the first network layer (corresponding to implementing parameter update by manipulating each layer in reverse order). 3 is a schematic diagram of an example in which calculation and communication provided through an embodiment of the present invention overlap. As shown in Fig. 3, reference numeral 301 denotes a data stream 1 that implements gradient data synchronization by operating layer by layer in the reverse order, and 302 denotes a data stream 2 that implements parameter update by operating layer by layer in the reverse order. , data flow 1 and data flow 2 are parallel; Each rectangular frame of 301 represents an operation in which the first working node and another working node transmit (or communicate, synchronize) local gradient information of one network layer, for example, the nth network layer is the first working node and another operation node to transmit local gradient information of the n-th network layer; Each rectangular frame in 302 represents an operation of a first working node to update a parameter of one network layer, for example, an nth network layer may indicate an operation of the first working node to update a parameter of an nth network layer. indicate; Arrows indicate the time axis. n is an integer greater than 1. In FIG. 3 , the first working node and other working nodes have local gradient information of the nth network layer, local gradient information of the (n-1)th network layer, ..., local gradient information of the first network layer, in the order of precedence. send; The first working node updates the parameter of the n-th network layer, the parameter of the (n-1)-th network layer, ..., the parameter of the first network layer in the order of precedence; In a process in which the first working node and other working nodes transmit local gradient information of the (n-i)-th network layer, parameters of the (n-i+1)-th network layer are updated in parallel. Here, i is an integer less than n. Since the way the first working node implements the gradient data synchronization is the reverse-order layer-by-layer operation, and the way to implement the parameter update is the reverse-order layer-by-layer operation, the first job node performs the gradient data synchronization process, and the obtained local gradient of the network layer Implement an operation to update some parameters using the information in parallel. Referring to FIG. 3 , before the first working node performs an operation of receiving the local gradient information of the (n-1)th network layer, since the first working node has received the local gradient information of the nth network layer, the first working node is (n-1) In the process of performing the operation of receiving the local gradient information of the network layer, the operation of updating the parameter of the n-th network layer may be performed in parallel.

일부 실시예에서, 상기 제1 작업 노드가 상기 신경망 모델의 복수 개의 네트워크 층의 연결 관계에 기반하여 상기 현재 반복의 복수 개의 조작 사이의 의존 관계를 결정하고, 상기 복수 개의 조작은 적어도 상기 신경망 모델 중 적어도 하나의 네트워크 층의 로컬 그래디언트 정보의 전송 조작 및 파라미터 업데이트 조작을 포함하며; 상기 제1 작업 노드는 상기 복수 개의 조작 사이의 의존 관계에 기초하여 상기 복수 개의 조작을 수행한다. 다시 말해서, 제1 작업 노드는 현재 반복의 복수 개의 조작이 속하는 네트워크 층의 선후 관계에 근거하여 현재 반복의 복수 개의 조작 사이의 의존 관계를 결정할 수 있는 바, 즉 각 조작의 구체적인 수행 시점은 의존 관계를 통해 구동될 수 있다. 예시적으로, 제1 작업 노드가 그래디언트 데이터 동기화를 구현하는 방식은 역순 층별 조작이고, 파라미터 업데이트를 구현하는 방식은 역순 층별 조작이며, 신경망 모델 중 임의의 네트워크 층의 로컬 그래디언트 정보의 전송 조작이 의존하는 조작은 상기 임의의 네트워크 층 이후의 각 네트워크 층의 로컬 그래디언트 정보의 전송 조작이 모두 완료된 것이고, 신경망 모델 중 임의의 네트워크 층의 파라미터 업데이트 조작이 의존하는 조작은 상기 임의의 네트워크 층의 로컬 그래디언트 정보의 전송 조작이 모두 완료된 것이다. 예를 들어, 제1 작업 노드가 신경망 모델 중 제n 네트워크 층의 로컬 그래디언트 정보의 전송 조작을 완료한 후, 제(n-1) 네트워크 층의 로컬 그래디언트 정보의 전송 조작 및 제n 네트워크 층의 파라미터 업데이트 조작을 수행할 수 있다. In some embodiments, the first working node determines a dependency relationship between a plurality of manipulations of the current iteration based on a connection relationship of a plurality of network layers of the neural network model, wherein the plurality of manipulations are at least among the neural network model. a transmission operation and a parameter update operation of local gradient information of at least one network layer; The first working node performs the plurality of operations based on a dependency relationship between the plurality of operations. In other words, the first work node can determine the dependency relationship between the plurality of operations of the current iteration based on the precedence relationship of the network layer to which the plurality of operations of the current iteration belong, that is, the specific execution time of each operation is the dependency relationship. can be driven through Illustratively, the manner in which the first task node implements gradient data synchronization is reverse-order layer-by-layer operation, and the manner in which parameter update is implemented is reverse-order layer-by-layer operation, and the transmission operation of local gradient information of any network layer in the neural network model depends The operation to be performed is that the transmission operation of the local gradient information of each network layer after the arbitrary network layer is all completed, and the operation on which the parameter update operation of any network layer in the neural network model depends is the local gradient information of the arbitrary network layer All transfer operations have been completed. For example, after the first working node completes the transmission operation of the local gradient information of the nth network layer in the neural network model, the transmission operation of the local gradient information of the (n-1)th network layer and the parameter of the nth network layer An update operation can be performed.

일부 실시예에서, 단계202의 구현 방식은 다음과 같다: 상기 제1 작업 노드가 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 제2 네트워크 층의 파라미터 업데이트 조작이 의존하는 조작이 완료된 것을 결정할 경우, 상기 제1 네트워크 층의 로컬 그래디언트 정보의 전송과 함께 상기 제2 네트워크 층의 파라미터를 병렬로 업데이트하되, 여기서, 상기 파라미터 업데이트 조작이 의존하는 조작은 상기 적어도 하나의 제2 작업 노드와 함께 상기 제2 네트워크 층의 로컬 그래디언트 정보를 전송하는 것을 포함한다. 일부 실시예에서, 제1 작업 노드를 통해 수행되는 각각의 조작은 하나의 이벤트(event)에 바인딩되고, 각 조작 사이의 의존 관계에 근거하여 각각의 조작이 기다려야 하는 event를 결정하며; 각각의 데이터 흐름은 경량 차단 인터페이스(예를 들어cudaStreamWaitEvent)를 통해 현재 조작의 관련 event가 완료될 때까지 기다린 후, 현재 조작을 다시 시작한다. In some embodiments, the implementation manner of step 202 is as follows: in the process of the first working node transmitting local gradient information of the first network layer in the neural network model together with the at least one second working node, When determining that the operation on which the parameter update operation of the second network layer depends is completed, update the parameter of the second network layer in parallel with the transmission of the local gradient information of the first network layer, wherein: The operation upon which the operation depends includes transmitting local gradient information of the second network layer with the at least one second working node. In some embodiments, each operation performed through the first operation node is bound to an event, and each operation determines an event to wait for based on a dependency relationship between each operation; Each data flow waits for the relevant event of the current operation to complete through a lightweight blocking interface (eg cudaStreamWaitEvent), and then resumes the current operation.

일 실시예에서, 제1 작업 노드가 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 업데이트하기 전에, 다음과 같은 조작을 수행할 수 있다. 제1 작업 노드는 획득된 상기 제2 네트워크 층의 로컬 그래디언트 정보에 포함된 각 수치를 단일 정밀도로 변환하고, 상기 변환을 통해 얻은 각 수치를 M배 축소하여 처리 그래디언트 정보를 얻으며, 상기 M은 1보다 큰 실수이고; 상기 제1 작업 노드가 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트하는 것은, 제1 작업 노드가 상기 처리 그래디언트 정보를 사용하여 상기 신경망 모델 중의 상기 제2 네트워크 층의 파라미터를 업데이트하는 것일 수 있다. In an embodiment, before the first working node updates the parameter of the second network layer in the neural network model, the following operation may be performed. The first working node converts each numerical value included in the obtained local gradient information of the second network layer with single precision, and reduces each numerical value obtained through the transformation by M times to obtain processing gradient information, wherein M is 1 It's a bigger mistake; When the first working node updates the parameter of the second network layer in the neural network model in parallel, the first working node uses the processing gradient information to update the parameter of the second network layer in the neural network model. can

본 발명의 실시예에서, 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트하고; 신경망 모델의 파라미터를 업데이트하는 과정 및 로컬 그래디언트 정보를 전송하는 과정을 중첩(즉 파라미터 업데이트 및 계산 중첩)시켜 모델 트레이닝 효율을 향상시킬 수 있다. In an embodiment of the present invention, in the process of the first working node transmitting local gradient information of the first network layer in the neural network model together with at least one second working node, parameters of the second network layer in the neural network model are set in parallel update; Model training efficiency can be improved by overlapping the process of updating parameters of the neural network model and the process of transmitting local gradient information (that is, updating parameters and overlapping calculations).

통신 오버헤드를 더 은폐하기 위해, 또한, 제1 작업 노드는 그래디언트 데이터 동기화 및 역방향 계산을 중첩할 수 있다. 이하 첨부 도면을 결부하여 그래디언트 데이터 동기화 및 역방향 계산 중첩의 가능한 구현 방식을 소개한다. In order to further conceal the communication overhead, the first working node may also overlap the gradient data synchronization and reverse computation. Hereinafter, in conjunction with the accompanying drawings, a possible implementation manner of gradient data synchronization and backward computational superposition is introduced.

일 실시예에서, 제1 작업 노드는 도 1의 방법 흐름을 수행하는 기초 상에서, 또한 다음과 같은 조작을 수행할 수 있다. 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 신경망 모델 중의 제3 네트워크 층의 로컬 그래디언트 정보를 계산한다. 상기 제3 네트워크 층의 네트워크 깊이는 상기 제1 네트워크 층의 네트워크 깊이보다 작다. 일부 실시예에서, 역방향 계산은 역순 층별 조작이고, 제1 작업 노드가 그래디언트 데이터 동기화를 구현하는 방식은 역순 층별 조작이며, 제1 작업 노드가 역방향 계산을 구현하는 과정은 그래디언트 데이터 동기화를 구현하는 과정과 중첩되고, 역방향 계산 및 그래디언트 데이터 동기화를 병렬로 구현한다. In one embodiment, the first working node may also perform the following operations on the basis of performing the method flow of FIG. 1 . In a process in which a first working node transmits local gradient information of the first network layer in the neural network model together with at least one second working node, local gradient information of a third network layer in the neural network model is calculated. A network depth of the third network layer is smaller than a network depth of the first network layer. In some embodiments, the backward calculation is a reverse-order layer-by-layer operation, the manner in which the first operation node implements the gradient data synchronization is the reverse-order layer-by-layer operation, and the process of the first operation node implementing the backward calculation is a process of implementing the gradient data synchronization. , and implements backward computation and gradient data synchronization in parallel.

도 4는 본 발명의 실시예를 통해 제공되는 다른 하나의 계산과 통신이 중첩되는 예시의 모식도이다. 도 4에 도시된 바와 같이, 401은 역순으로 층별 조작하여 역방향 계산을 구현하는 데이터 흐름3을 나타내고, 301은 역순으로 층별 조작하여 그래디언트 데이터 동기화를 구현하는 데이터 흐름1을 나타내며, 302는 역순으로 층별 조작하여 파라미터 업데이트를 구현하는 데이터 흐름2를 나타내고, 데이터 흐름1, 데이터 흐름2 및 데이터 흐름3은 병렬되며; 401에서 각각의 직사각형 프레임은 제1 작업 노드가 하나의 네트워크 층의 로컬 그래디언트 정보를 계산하는 조작(역방향 조작에 대응됨)을 나타내고, 예를 들어 제n 네트워크 층은 제1 작업 노드가 제n 네트워크 층의 로컬 그래디언트 정보를 계산하는 조작을 나타내며; 301 에서 각각의 직사각형 프레임은 제1 작업 노드와 다른 작업 노드가 하나의 네트워크 층의 로컬 그래디언트 정보를 전송하는 조작을 나타내는 바, 예를 들어 제n 네트워크 층은 제1 작업 노드와 다른 작업 노드가 제n 네트워크 층의 로컬 그래디언트 정보를 전송하는 조작을 나타내며; 302에서 각각의 직사각형 프레임은 제1 작업 노드가 하나의 네트워크 층의 파라미터를 업데이트하는 조작을 나타내는 바, 예를 들어 제n 네트워크 층은 제1 작업 노드가 제n 네트워크 층의 파라미터를 업데이트하는 조작을 나타낸다. n은 1보다 큰 정수이다. 도 4에서, 제1 작업 노드는 제n 네트워크 층의 로컬 그래디언트 정보, 제(n-1) 네트워크 층의 로컬 그래디언트 정보,..., 제1 네트워크 층의 로컬 그래디언트 정보를 선후 순서대로 계산하고; 제1 작업 노드와 다른 작업 노드는 제n 네트워크 층의 로컬 그래디언트 정보, 제(n-1) 네트워크 층의 로컬 그래디언트 정보,..., 제1 네트워크 층의 로컬 그래디언트 정보를 선후 순서대로 전송하며; 제1 작업 노드는 제n 네트워크 층의 파라미터, 제(n-1) 네트워크 층의 파라미터,..., 제1 네트워크 층의 파라미터를 선후 순서대로 업데이트하고; 제1 작업 노드가 제(n-i) 네트워크 층의 로컬 그래디언트 정보를 수신하는 과정에서, 병렬로 제(n-i+1) 네트워크 층의 파라미터를 업데이트하고 제(n-i-1) 네트워크 층의 로컬 그래디언트 정보를 계산한다. 여기서, i는 (n-1)보다 작은 정수이다. 4 is a schematic diagram of an example in which another calculation and communication provided through an embodiment of the present invention overlap. As shown in Fig. 4, reference numeral 401 denotes data flow 3 that implements reverse calculation by operating layer by layer in reverse order, 301 denotes data flow 1 that implements gradient data synchronization by operating layer by layer in reverse order, and 302 denotes layer by layer in reverse order. data flow 2 that operates to implement parameter update, data flow 1 , data flow 2 and data flow 3 are parallel; In 401 , each rectangular frame represents an operation (corresponding to a reverse operation) in which the first working node calculates local gradient information of one network layer, for example, the nth network layer indicates that the first working node is the nth network layer. represents the operation of calculating the layer's local gradient information; In 301, each rectangular frame represents an operation in which the first working node and other working nodes transmit local gradient information of one network layer, for example, the nth network layer is the first working node and the other working node. n represents the operation of sending the local gradient information of the network layer; In 302 , each rectangular frame represents an operation in which the first working node updates a parameter of one network layer, for example, the nth network layer indicates an operation in which the first working node updates a parameter of the nth network layer. indicates. n is an integer greater than 1. In Fig. 4, the first working node calculates the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer, ..., the local gradient information of the first network layer in the order of precedence; The first working node and the other working node transmit the local gradient information of the nth network layer, the local gradient information of the (n-1)th network layer, ..., the local gradient information of the first network layer in the order of precedence; The first working node updates the parameter of the n-th network layer, the parameter of the (n-1)-th network layer, ..., the parameter of the first network layer in the order of precedence; In the process of the first working node receiving the local gradient information of the (n-i)-th network layer, it updates the parameters of the (n-i+1)-th network layer in parallel and local gradient information of the (n-i-1)-th network layer to calculate Here, i is an integer less than (n-1).

상기 실시예에서, 제1 작업 노드가 적어도 하나의 제2 작업 노드와 함께 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 신경망 모델 중의 제3 네트워크 층의 로컬 그래디언트 정보를 계산하고; 신경망 모델 중의 네트워크 층의 로컬 그래디언트 정보를 계산하는 과정 및 로컬 그래디언트 정보를 전송하는 과정을 중첩시켜 모델 트레이닝 효율을 향상시킬 수 있다. In the above embodiment, in the process of the first working node transmitting the local gradient information of the first network layer in the neural network model together with the at least one second working node, calculating the local gradient information of the third network layer in the neural network model, ; Model training efficiency can be improved by overlapping the process of calculating the local gradient information of the network layer in the neural network model and the process of transmitting the local gradient information.

전술한 실시예는 계산과 통신이 중첩되는 기술적 방안을 설명한다. 상기 계산과 통신의 중첩 기술적 방안의 본질은 파라미터 업데이트 시간 및/또는 역방향 계산 시간으로 통신 시간을 은폐하는 것이지만 신경망 모델의 계산 시간이 통신 시간보다 작을 경우, 통신 오버헤드를 충분하게 은폐할 수 없다. 따라서 통신 오버헤드를 추가 압축하기 위한 통신 축소 방식을 연구할 필요가 있다. The above-described embodiment describes a technical method in which calculation and communication overlap. The essence of the overlapping technical solution of the calculation and communication is to conceal the communication time with the parameter update time and/or the backward calculation time, but when the calculation time of the neural network model is smaller than the communication time, the communication overhead cannot be sufficiently concealed. Therefore, it is necessary to study a communication reduction method to further compress the communication overhead.

본 발명의 실시예는 내부 층 반복 책략을 도입한다. 매 번의 내부 층 반복은 한 번의 완전한 순방향 계산(Forward) 및 역방향 계산(Backward)을 한번 수행하고, 로컬 그래디언트 정보를 누적하지만 그래디언트 데이터 동기화 및 파라미터 업데이트를 수행하지 않는 바, 즉 각 작업 노드의 그래디언트 데이터를 동기화하지 않고 신경망 모델의 파라미터를 업데이트하지 않는다. 복수 차의 내부 층 반복은 한 번의 글로벌 통신에 대응되고, 여기서, 마지막 내부 층 반복 중 로컬 그래디언트 정보에 대해 프로토콜 통신을 수행하며 파라미터 값을 업데이트한다. 일부 실시예에서, 글로벌 통신 조작은 마지막 내부 층 반복의 역방향 계산과 중첩될 수 있다. 내부 층 반복 책략은 본질적으로 매번 반복의 배치 크기(Batch size)를 증가시키는 것이고, 이는 전체 트레이닝 과정 중의 총 통신량을 감소시키는 것에 해당된다. 이하 첨부 도면을 결부하여 본 발명의 실시예를 통해 제공되는 내부 층 반복 방법을 소개한다. Embodiments of the present invention introduce an inner layer repeat strategy. Each inner layer iteration performs one complete forward and backward computation once, and accumulates local gradient information, but does not perform gradient data synchronization and parameter update, that is, gradient data of each task node. It does not synchronize and update the parameters of the neural network model. Multiple inner layer iterations correspond to one global communication, where protocol communication is performed for local gradient information during the last inner layer iteration and parameter values are updated. In some embodiments, the global communication operation may overlap with the backward calculation of the last inner layer iteration. The inner layer iteration tactic is essentially to increase the batch size of each iteration, which corresponds to reducing the total amount of communication during the entire training process. Hereinafter, in conjunction with the accompanying drawings, an inner layer repeating method provided through an embodiment of the present invention will be introduced.

도 5는 본 발명의 실시예를 통해 제공되는 내부 층 반복 방법의 흐름도이다. 도 5에 도시된 바와 같이, 상기 내부 층 반복 방법은 다음과 같은 단계를 포함한다. 5 is a flowchart of a method for repeating an inner layer provided through an embodiment of the present invention. As shown in FIG. 5 , the inner layer repeating method includes the following steps.

단계501에서, 제1 작업 노드는 트레이닝 샘플을 신경망 모델에 입력하여 순방향 계산을 수행하여 처리 결과를 얻는다. In step 501, the first working node inputs the training sample to the neural network model to perform forward calculation to obtain a processing result.

단계502, 제1 작업 노드가 상기 처리 결과 및 상기 신경망 모델을 사용하여 역방향 계산을 수행하여 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 얻는다. Step 502, the first working node performs backward calculation using the processing result and the neural network model to obtain local gradient information of at least one network layer of the neural network model.

단계502 및 단계501은 상기 제1 작업 노드가 상기 신경망 모델에 대하여 내부 층 반복을 한 번 수행하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 얻는 구현 방식으로 이해될 수 있다. 일부 실시예에서, 단계502는, 제1 작업 노드가 상기 처리 결과 및 상기 신경망 모델을 사용하여 역방향 계산을 수행하여 신경망 모델의 각 네트워크 층의 로컬 그래디언트 정보를 얻는 것으로 대체될 수 있다. 예를 들어, 제1 작업 노드는 역순 층별 조작을 사용하여 역방향 계산을 구현하여 신경망 모델의 각 네트워크 층의 로컬 그래디언트 정보를 얻는다. Steps 502 and 501 may be understood as an implementation manner in which the first working node performs an inner layer iteration on the neural network model once to obtain local gradient information of at least one network layer of the neural network model. In some embodiments, step 502 may be replaced by the first task node performing backward calculation using the processing result and the neural network model to obtain local gradient information of each network layer of the neural network model. For example, the first working node implements the backward calculation using the inverse layer-by-layer operation to obtain the local gradient information of each network layer of the neural network model.

단계503에서, 제1 작업 노드가 중간 융합 그래디언트 정보 및 현재 반복(즉 이번 내부 층 반복)에 대응하는 로컬 그래디언트 정보에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻는다. In step 503, the first working node obtains target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration (ie, this inner layer iteration).

일부 실시예에서, 상기 중간 융합 그래디언트 정보는 제1 작업 노드가 상기 신경망 모델에 대해 내부 층 반복을 적어도 한번 수행하여 얻은 상기 적어도 하나의 내부 층 반복에 대응하는 중간 융합 그래디언트 정보일 수 있다. 예시적으로, 상기 중간 융합 그래디언트 정보는 제1 작업 노드가 내부 층 반복을 한 번 수행하여 얻은 신경망 모델의 각 네트워크 층의 로컬 그래디언트 정보일 수 있거나; 제1 작업 노드를 통해 내부 층 반복을 적어도 두 번 수행하여 얻은 적어도 두 그룹의 로컬 그래디언트 정보를 융합하여 얻을 수 있다. 이해해야 할 것은, 제1 작업 노드가 단계503를 처음 수행할 경우, 상기 중간 융합 그래디언트 정보는 존재하지 않고, 단계503의 구현 방식은 단계502에서 얻은 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 중간 융합 그래디언트 정보로 사용하고 저장하는 것일 수 있으며; 제1 작업 노드가 단계503을 두 번째로 수행할 경우, 단계503의 구현 방식은 현재 중간 융합 그래디언트 정보 및 이번 내부 층 반복에 대응하는 로컬 그래디언트 정보(즉 단계502를 두 번째로 수행하여 얻은 그래디언트 정보)에 기반하여 얻은 새로운 중간 융합 그래디언트 정보(중간 융합 그래디언트에 대응됨)일 수 있고; 이와 같이 유추하면, 제1 작업 노드가 단계503을 K번 수행한 후, 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻는다. 여기서, K는 1보다 큰 정수이다. 이해해야 할 것은, 제1 작업 노드는 단계503을 처음 수행하여 초기 중간 융합 그래디언트(단계502를 처음 수행하여 얻은 그래디언트 정보)를 얻을 수 있고, 후속 매번 단계503을 한번 수행하는 것은 현재 중간 융합 그래디언트 정보 및 현재 반복(즉 이번 내부 층 반복)에 대응하는 로컬 그래디언트 정보를 사용하여 새로운 중간 융합 그래디언트 정보를 얻는다. In some embodiments, the intermediate fusion gradient information may be intermediate fusion gradient information corresponding to the at least one inner layer iteration obtained by a first working node by performing inner layer iteration on the neural network model at least once. Exemplarily, the intermediate fusion gradient information may be local gradient information of each network layer of the neural network model obtained by the first working node performing one inner layer iteration; It can be obtained by fusing at least two groups of local gradient information obtained by performing the inner layer iteration at least twice through the first working node. It should be understood that when the first working node performs step 503 for the first time, the intermediate fusion gradient information does not exist, and the implementation manner of step 503 is local gradient information of at least one network layer of the neural network model obtained in step 502. may be used and stored as intermediate fusion gradient information; When the first working node performs step 503 a second time, the implementation manner of step 503 is the current intermediate fusion gradient information and local gradient information corresponding to this inner layer iteration (that is, the gradient information obtained by performing step 502 for the second time) ) may be new intermediate fusion gradient information (corresponding to the intermediate fusion gradient) obtained based on; In this analogy, after the first working node performs step 503 K times, the target fusion gradient information of at least one network layer of the neural network model is obtained. Here, K is an integer greater than 1. It should be understood, that the first working node may first perform step 503 to obtain an initial intermediate fusion gradient (gradient information obtained by first performing step 502), and then perform step 503 once for each subsequent intermediate fusion gradient information and New intermediate fusion gradient information is obtained using the local gradient information corresponding to the current iteration (ie this inner layer iteration).

일부 실시예에서, 제1 작업 노드는 한 번의 내부 층 반복을 수행하여 하나의 그룹의 로컬 그래디언트 파라미터를 얻고, 각각의 로컬 그래디언트 파라미터는 신경망 모델의 각 네트워크 층의 로컬 그래디언트 정보를 포함하며; 제1 작업 노드가 이에 대해 내부 층 반복을 적어도 두 번 수행하여 얻은 적어도 두 그룹의 로컬 그래디언트 정보를 융합하는 것은, 상기 적어도 두 그룹의 로컬 그래디언트 정보에 각각 포함된 각 네트워크 층의 로컬 그래디언트 정보를 융합하여 각 네트워크 층의 중간 융합 그래디언트를 얻는 것일 수 있다. 예를 들어, 제1 작업 노드는 적어도 두 그룹의 로컬 그래디언트 정보에 각각 포함된 제1 네트워크 층의 로컬 그래디언트 정보를 융합하여 제1 네트워크 층의 중간 융합 그래디언트를 얻는다. 예시적으로, 제1 작업 노드는 적어도 두 그룹의 로컬 그래디언트 정보에 각각 포함된 제1 네트워크 층의 로컬 그래디언트 정보를 융합하는 것은 두 그룹의 로컬 그래디언트 정보에 각각 포함된 제1 네트워크 층의 상응한 파라미터를 순서대로 융합하는 것일 수 있다. 예를 들어, 제1 그룹의 로컬 그래디언트 정보에 포함된 제1 네트워크 층의 특정 파라미터의 값은 a이고, 제2 그룹의 로컬 그래디언트 정보에 포함된 상기 파라미터의 값은 b이며, 제3 그룹의 로컬 그래디언트 정보에 포함된 상기 파라미터의 값은 c이고; 상기 파라미터를 예로 들면, 제1 작업 노드가 이 3그룹의 로컬 그래디언트 정보에 각각 포함된 제1 네트워크 층의 로컬 그래디언트 정보를 융합하는 것은, 먼저 (a+b)를 계산한 후, ((a+b)+c)를 계산하는 것일 수 있다. 상기 예에서, 상기 파라미터의 제1 네트워크 층의 중간 융합 그래디언트 정보 중 대응하는 값은 ((a+b)+c)이다. In some embodiments, the first working node performs one inner layer iteration to obtain a group of local gradient parameters, each local gradient parameter including local gradient information of each network layer of the neural network model; The first task node fuses the local gradient information of at least two groups obtained by performing the inner layer iteration at least twice on this, fusing the local gradient information of each network layer respectively included in the local gradient information of the at least two groups. This may be to obtain an intermediate fusion gradient of each network layer. For example, the first working node fuses the local gradient information of the first network layer respectively included in the at least two groups of local gradient information to obtain an intermediate fusion gradient of the first network layer. Exemplarily, the first working node fuses the local gradient information of the first network layer respectively included in the at least two groups of local gradient information, the corresponding parameter of the first network layer respectively included in the two groups of local gradient information may be fused in order. For example, the value of a specific parameter of the first network layer included in the local gradient information of the first group is a, the value of the parameter included in the local gradient information of the second group is b, and the value of the parameter in the local gradient information of the third group is b. the value of the parameter included in the gradient information is c; Taking the above parameter as an example, the first task node fuses the local gradient information of the first network layer included in each of these three groups of local gradient information, first calculating (a+b), and then ((a+ It may be calculating b)+c). In the above example, the corresponding value of the intermediate fusion gradient information of the first network layer of the parameter is ((a+b)+c).

일부 실시예에서, 단계503의 구현 방식은, 상기 제1 작업 노드가 상기 중간 융합 그래디언트 정보 및 상기 현재 반복을 통해 얻은 로컬 그래디언트 정보를 누적 처리하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻는 것일 수 있다. 상기 중간 융합 그래디언트 정보 중의 그래디언트 및 상기 현재 반복을 통해 얻은 로컬 그래디언트 정보 중의 그래디언트는 일대일 대응되고; 상기 제1 작업 노드가 상기 중간 융합 그래디언트 정보 및 상기 현재 반복을 통해 얻은 로컬 그래디언트 정보를 누적 처리하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻는 것은, 상기 중간 융합 그래디언트 정보 및 상기 현재 반복을 통해 얻은 로컬 그래디언트 정보 중 일대일 대응하는 파라미터를 누적 처리하는 것일 수 있다. 예를 들어, 중간 융합 그래디언트 정보의 특정 파라미터의 값은 d이고, 상기 파라미터가 현재 반복을 통해 얻은 로컬 그래디언트 정보 중 대응하는 값은 e이며, d 및 e를 누적 처리하여 (d+e)를 얻는다. 상기 신경망 모델의 임의의 네트워크 층의 목표 융합 그래디언트 정보는 제1 작업 노드의 복수 차의 내부 층 반복을 통해 얻은 복수 그룹의 상기 임의의 네트워크 층의 로컬 그래디언트 정보를 융합하여 얻을 수 있음을 이해해야 한다. In some embodiments, the implementation manner of step 503 is that the first working node accumulates the intermediate fusion gradient information and the local gradient information obtained through the current iteration to obtain a target fusion gradient of at least one network layer of the neural network model. It could be to get information. a gradient in the intermediate fusion gradient information and a gradient in the local gradient information obtained through the current iteration correspond one-to-one; When the first working node accumulates the intermediate fusion gradient information and the local gradient information obtained through the current iteration to obtain target fusion gradient information of at least one network layer of the neural network model, the intermediate fusion gradient information and the It may be cumulative processing of one-to-one corresponding parameters among local gradient information obtained through current iteration. For example, the value of a specific parameter of the intermediate fusion gradient information is d, the corresponding value of the local gradient information obtained through the current iteration of the parameter is e, and d and e are cumulatively processed to obtain (d+e) . It should be understood that the target fusion gradient information of an arbitrary network layer of the neural network model can be obtained by fusing local gradient information of a plurality of groups of the arbitrary network layer obtained through multiple inner layer iterations of the first working node.

단계504에서, 제1 작업 노드가 내부 층 반복 임계값에 도달했는지 여부를 판단한다. In step 504, it is determined whether the first working node has reached an inner layer repetition threshold.

도달한 경우, 단계505를 수행하고; 도달하지 못한 경우, 단계501을 수행한다. 상기 내부 층 반복 임계값은 3, 5, 10, 20 등 일 수 있고, 본 발명은 이에 한정되지 않는다. 실제 응용에서, 제1 작업 노드는 실제 수요에 근거하여 내부 층 반복 임계값을 상응하게 설정할 수 있다. 내부 층 반복 임계값이 클수록 제1 작업 노드가 글로벌 통신을 수행하는 횟수가 적다. if reached, perform step 505; If not reached, step 501 is performed. The inner layer repetition threshold may be 3, 5, 10, 20, etc., but the present invention is not limited thereto. In practical application, the first working node may set the inner layer repetition threshold correspondingly according to the actual demand. The larger the inner layer repetition threshold, the fewer times the first working node performs global communication.

단계505에서, 제1 작업 노드가 글로벌 통신 조작을 수행하여 글로벌 그래디언트 정보를 얻는다. In step 505, the first working node performs a global communication operation to obtain global gradient information.

일부 실시예에서, 상기 글로벌 그래디언트 정보는 모든 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보를 융합하여 얻은 그래디언트 정보일 수 있다. 예시적으로, 상기 글로벌 그래디언트 정보는 모든 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보 중 상응한 그래디언트를 누적하여 얻은 그래디언트 정보일 수 있다. 예를 들어, 각각의 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보는 벡터에 대응되고, 모든 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보를 융합하여 얻은 글로벌 그래디언트 정보에 대응하는 벡터는 각 작업 노드를 통해 계산하여 얻은 로컬 그래디언트 정보에 대응하는 벡터 중 동일한 위치의 원소를 누적하여 얻은 것일 수 있다. 일부 실시예에서, 제1 작업 노드는 글로벌 그래디언트 정보를 얻은 후, 분산형 트레이닝 시스템 중 각 작업 노드는 모두 글로벌 그래디언트 정보를 얻는다. In some embodiments, the global gradient information may be gradient information obtained by fusing local gradient information obtained by calculating through all work nodes. Exemplarily, the global gradient information may be gradient information obtained by accumulating a corresponding gradient among local gradient information obtained by calculating through all work nodes. For example, the local gradient information obtained by calculating through each work node corresponds to the vector, and the vector corresponding to the global gradient information obtained by fusing the local gradient information obtained by calculating through all work nodes is obtained through each work node. It may be obtained by accumulating elements at the same position among vectors corresponding to local gradient information obtained by calculation. In some embodiments, after the first working node obtains the global gradient information, each working node in the distributed training system all obtains the global gradient information.

단계506에서, 제1 작업 노드는 글로벌 그래디언트 정보를 사용하여 신경망 모델을 업데이트한다. In step 506, the first working node updates the neural network model using the global gradient information.

이해해야 할 것은, 분산형 트레이닝 시스템의 각 작업 노드는 모두 글로벌 그래디언트 정보를 사용하여 신경망 모델을 업데이트하고, 이러한 방법으로 각각의 작업 노드는 모두 하나의 동일한 업데이트된 신경망 모델을 얻을 수 있다. 단계501 내지 단계506은 제1 작업 노드가 파라미터 업데이트 조작을 한 번 구현하는 과정을 설명하고, 실제 응용에서, 제1 작업 노드는 도 5의 방법 흐름을 여러 번 수행하여 수렴된 신경망 모델을 얻을 수 있다. To understand, each task node in the distributed training system all uses global gradient information to update the neural network model, and in this way, each task node can all get one and the same updated neural network model. Steps 501 to 506 describe the process in which the first working node implements the parameter update operation once, and in actual application, the first working node can perform the method flow of FIG. 5 several times to obtain a converged neural network model. there is.

일부 실시예에서, 제1 작업 노드는 또한 다음과 같은 조작을 수행할 수 있다. 상기 제1 작업 노드가 상기 중간 융합 그래디언트 정보 및 상기 현재 반복에 대응하는 로컬 그래디언트 정보에 기반하여 상기 신경망 모델의 제3 네트워크 층의 목표 융합 그래디언트 정보를 얻는 과정에서, 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델의 제4 네트워크 층의 목표 융합 그래디언트 정보를 병렬로 전송한다. 선택적으로, 상기 제4 네트워크 층의 네트워크 깊이는 상기 제3 네트워크 층의 네트워크 깊이보다 크다. 제1 작업 노드는 역순 층별 조작에 따라 마지막 내부 층 반복을 수행할 수 있으므로 제1 작업 노드가 상기 신경망 모델의 최고 네트워크 층(최대 네트워크 깊이를 가짐)의 목표 융합 그래디언트 정보로부터 최저 네트워크 층(최소 네트워크 깊이를 가짐)의 목표 융합 그래디언트 정보를 연속적으로 얻을 수 있다. 제1 작업 노드는 특정 네트워크 층의 목표 융합 그래디언트 정보를 계산하는 과정에서, 계산하여 얻은 일부 네트워크 층의 목표 융합 그래디언트 정보를 다른 작업 노드에 전송할 수 있음을 이해해야 한다. 다시 말해서, 글로벌 통신 조작은 마지막 내부 층 반복의 역방향 계산과 서로 중첩될 수 있다. In some embodiments, the first working node may also perform the following operations. In the process in which the first working node obtains target fusion gradient information of a third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration, the at least one second working node along with the target fusion gradient information of the fourth network layer of the neural network model is transmitted in parallel. Optionally, a network depth of the fourth network layer is greater than a network depth of the third network layer. Since the first working node can perform the last inner layer iteration according to the reverse-order layer-by-layer operation, the first working node is the lowest network layer (minimum network) from the target fusion gradient information of the highest network layer (having the largest network depth) of the neural network model. depth) of the target fusion gradient information can be obtained continuously. It should be understood that, in the process of calculating the target fusion gradient information of a specific network layer, the first working node may transmit target fusion gradient information of some network layers obtained by calculation to another working node. In other words, global communication operations can overlap each other with the backward computation of the last inner layer iteration.

본 발명의 실시예에서, 제1 작업 노드와 적어도 하나의 제2 작업 노드는 네트워크 층의 목표 융합 그래디언트 정보를 전송하고; 그래디언트 정보의 전송 횟수 및 총 통신량을 감소시킬 수 있다. In an embodiment of the present invention, the first working node and the at least one second working node transmit target fusion gradient information of the network layer; It is possible to reduce the number of times of transmission of the gradient information and the total amount of communication.

통신 효율을 더 향상시키기 위해, 본 발명의 실시예는 통신 융합 책략을 더 제공하는바, 즉 여러 네트워크 층의 그래디언트를 큰 어레이로 병합한 후, 글로벌 통신을 시작한다. 통신 융합 책략은 전술한 실시예에 적용되어 통신 효율을 향상시킬 수 있다. In order to further improve the communication efficiency, the embodiment of the present invention further provides a communication convergence strategy, that is, after merging the gradients of several network layers into a large array, start global communication. The communication convergence strategy may be applied to the above-described embodiment to improve communication efficiency.

일반적인 신경망 모델 중의 대부분 운영자에게 있어서, 그 그래디언트 파라미터의 개수는 매우 작고, 일반적으로 특징맵 개수의 작은 상수 배수이며, 통신량은 KBytes 또는 Byte 등급이다. 베이스 층 통신의 관련 연구에 따르면, 전송되는 데이터양이 작으면 네트워크 대역폭을 충분히 사용할 수 없다. 큰 통신량을 획득하여 통신 효율을 향상시키기 위하여 통신 융합 책략을 도입한다. For most operators of general neural network models, the number of gradient parameters is very small, usually a small constant multiple of the number of feature maps, and the amount of communication is KBytes or Bytes. According to a related study of base layer communication, if the amount of transmitted data is small, the network bandwidth cannot be sufficiently used. A communication convergence strategy is introduced to obtain a large communication volume and improve communication efficiency.

상기 책략에서, 몇 가지 주의 사항이 있다. 한편, 통신 융합(또한 그래디언트 융합이라고 지칭함)의 규모를 합리적으로 구성해야 한다. 융합 규모가 너무 작으면 통신 효율은 높지 않고; 융합 규모가 너무 크면 통신 조작의 작동 시점을 지연시킬 수 있다. 따라서, 통신 융합 책략을 구현할 경우, 융합 크기를 구성할 수 있는 바, 예를 들어 드라이런(dry-run)을 통해 각각의 신경망 모델 및 플랫폼(예를 들어 분산형 트레이닝 시스템)에 가장 적합한 융합 규모를 디버깅한다. 다른 한편, 통신 융합의 원래 방식에서, 통신 전에 복수 개의 이산적으로 저장된 여러 개의 작은 어레이를 연속적으로 저장하는 큰 어레이로 병합하고, 통신 후 다시 분해해야 하기에 두 차례의 메모리 복제를 도입하여 추가 오버헤드가 발생할 수 있다. In the above maneuver, there are a few caveats. On the other hand, the scale of communication fusion (also referred to as gradient fusion) should be reasonably constructed. If the convergence scale is too small, the communication efficiency is not high; If the convergence scale is too large, it may delay the operation timing of the communication operation. Therefore, when implementing the communication convergence strategy, the fusion size can be configured, for example, the fusion size most suitable for each neural network model and platform (eg, distributed training system) through dry-run. to debug On the other hand, in the original method of communication convergence, a plurality of discretely stored multiple small arrays are merged into a large array that sequentially stores them before communication, and two times of memory duplication are introduced because they must be disassembled again after communication. head may occur.

일부 실시예에서, 제1 작업 노드는 단계201을 수행하기 전에, 다음과 같은 조작을 수행할 수 있다. 상기 제1 작업 노드가 상기 제1 네트워크 층에 대응하는 오프셋에 기반하여 계산하여 얻은 상기 제1 네트워크 층의 로컬 그래디언트 정보를 미리 할당된 목표 저장 공간에 저장하되, 여기서, 상기 목표 저장 공간은 상기 신경망 모델의 복수 개의 네트워크 층의 로컬 그래디언트 정보를 저장하기 위한 것이며; In some embodiments, before performing step 201, the first working node may perform the following operations. The first working node stores the local gradient information of the first network layer obtained by calculation based on the offset corresponding to the first network layer in a pre-allocated target storage space, wherein the target storage space is the neural network for storing local gradient information of a plurality of network layers of the model;

여기서, 상기 제1 작업 노드를 통해 발송된 상기 제1 네트워크 층의 로컬 그래디언트 정보는 상기 제1 네트워크 층에 대응하는 오프셋에 기반하여 상기 목표 저장 공간으로부터 획득한 것이거나, 및/또는, 상기 제1 작업 노드가 상기 적어도 하나의 제2 작업 노드로부터 수신된 상기 제1 네트워크 층의 로컬 그래디언트 정보에 기반하여 상기 목표 저장 공간에 저장된 상기 제1 네트워크 층의 로컬 그래디언트 정보를 업데이트한다. Here, the local gradient information of the first network layer sent through the first working node is obtained from the target storage space based on an offset corresponding to the first network layer, and/or the first The working node updates the local gradient information of the first network layer stored in the target storage space based on the local gradient information of the first network layer received from the at least one second working node.

상기 실시예에서, 제1 작업 노드가 신경망 모델의 모든 파라미터 그래디언트(그래디언트 정보에 대응됨)를 위해 통일된 연속 메모리 공간(목표 저장 공간에 대응됨)을 미리 개척한 후, 메모리 관리자를 통해 각각의 네트워크 층의 파라미터 그래디언트를 대응하는 오프셋(offset)을 가리켜 통신 시의 추가 메모리 복제를 방지한다. In the above embodiment, after the first working node has previously cared for a unified contiguous memory space (corresponding to the target storage space) for all parameter gradients (corresponding to the gradient information) of the neural network model, each Points to an offset corresponding to the parameter gradient of the network layer to avoid additional memory duplication during communication.

일부 실시예에서, 제1 작업 노드는 단계201을 수행하기 전에, 다음과 같은 조작을 수행할 수 있다. 상기 제1 작업 노드는 계산하여 얻은 상기 신경망 모델의 복수 개의 네트워크 층의 로컬 그래디언트 정보를 미리 할당된 목표 저장 공간에 저장하고, 메모리 관리자를 통해 상기 복수 개의 네트워크 층 중의 각각의 네트워크 층에 대응하는 오프셋을 결정하며, 상기 목표 저장 공간은 하나의 연속적인 저장 공간이고; 상기 제1 작업 노드는 상기 복수 개의 네트워크 층의 각각의 네트워크 층에 대응하는 오프셋에 기반하여 상기 목표 저장 공간으로부터 상기 복수 개의 네트워크 층 중의 적어도 두 개의 네트워크 층의 로컬 그래디언트 정보를 획득하며; 상기 적어도 두 개의 네트워크 층은 상기 제1 네트워크 층을 포함하고; 단계201은 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 적어도 두 개의 네트워크 층의 로컬 그래디언트 정보를 전송하는 것으로 대체될 수 있다. In some embodiments, before performing step 201, the first working node may perform the following operations. The first working node stores local gradient information of a plurality of network layers of the neural network model obtained by calculation in a pre-allocated target storage space, and an offset corresponding to each network layer of the plurality of network layers through a memory manager determine, wherein the target storage space is one continuous storage space; the first working node obtains local gradient information of at least two network layers of the plurality of network layers from the target storage space based on an offset corresponding to each network layer of the plurality of network layers; the at least two network layers include the first network layer; Step 201 may be replaced with transmitting local gradient information of the at least two network layers in the neural network model together with the at least one second working node.

도 6은 본 발명의 실시예를 통해 제공되는 통신 융합 책략의 일 예시의 모식도이다. 도 6에 도시된 바와 같이, 601은 신경망 모델의 각 네트워크 층을 나타내고, 여기서, L1은 제1 네트워크 층을 나타내며, Ln은 제n 네트워크 층을 나타내고; 602는 각 네트워크 층의 로컬 그래디언트 정보를 나타내며, 여기서, 그래디언트m, 그래디언트(m-1), …그래디언트1은 모두 하나의 그래디언트 또는 하나의 네트워크 층의 그래디언트를 나타내고; 603은 병합된 각 네트워크 층의 로컬 그래디언트 정보를 나타내며, 여기서, 그래디언트 그룹k, 그래디언트 그룹(k-1),...그래디언트 그룹1은 모두 적어도 두 개의 그래디언트 또는 적어도 두 개의 네트워크 층의 그래디언트를 포함한다. 본 발명의 실시예에서, 신경망 모델 중의 네트워크 층 및 그래디언트는 일대일 대응되지 않고, 일부 네트워크 층은 복수 개의 그래디언트가 있을 수 있으며, 일부 네트워크 층은 그래디언트가 없을 수 있다. 일부 실시예에서, 602의 각각의 직사각형 프레임(예를 들어 그래디언트m)이 하나의 네트워크 층의 그래디언트를 나타낼 경우, 제1 작업 노드는 매번 다른 작업 노드에 하나의 네트워크 층의 그래디언트를 m번 전송해야 하고, 제1 작업 노드는 매번 다른 작업 노드에 하나의 그래디언트 그룹(예를 들어 그래디언트 그룹k)을 k번 전송해야 하며, k는 m보다 작다. 일부 실시예에서, 602의 각각의 직사각형 프레임(예를 들어 그래디언트m)이 하나의 파라미터 벡터의 그래디언트를 나타낼 경우, 제1 작업 노드는 매번 다른 작업 노드에 하나의 그래디언트 그룹(예를 들어 그래디언트 그룹k)을 k번 전송해야 한다. 제1 작업 노드는 여러 네트워크 층의 로컬 그래디언트 정보를 큰 어레이로 병합한 후, 글로벌 통신을 시작하고; 이러한 방법으로 글로벌 통신 효율을 향상시킬 수 있음을 이해해야 한다. 6 is a schematic diagram of an example of a communication convergence strategy provided through an embodiment of the present invention. 6 , 601 denotes each network layer of the neural network model, where L1 denotes a first network layer, and Ln denotes an nth network layer; 602 indicates local gradient information of each network layer, where gradient m, gradient (m-1), ... Gradient 1 all represent one gradient or one network layer gradient; 603 denotes local gradient information of each merged network layer, where gradient group k, gradient group (k-1), ... gradient group 1 all contain at least two gradients or gradients of at least two network layers. do. In an embodiment of the present invention, network layers and gradients in a neural network model do not have a one-to-one correspondence, some network layers may have a plurality of gradients, and some network layers may have no gradients. In some embodiments, if each rectangular frame (eg, gradient m) of 602 represents a gradient of one network layer, the first working node must transmit the gradient of one network layer to the other working node m times each time. , and the first work node must transmit one gradient group (eg, gradient group k) k times to another work node each time, where k is less than m. In some embodiments, if each rectangular frame of 602 (eg, gradient m) represents a gradient of one parameter vector, then the first working node each time has one gradient group (eg, gradient group k) in a different working node. ) must be transmitted k times. The first working node merges the local gradient information of several network layers into a large array, and then starts global communication; It should be understood that in this way the global communication efficiency can be improved.

전술한 실시예는 신경망 모델을 트레이닝하는 방법의 흐름을 설명한다. 이하 트레이닝하여 얻은 신경망 모델을 적용하여 예측 임무를 구현하는 예를 소개한다. The foregoing embodiment describes the flow of a method for training a neural network model. Hereinafter, an example of implementing a prediction task by applying a neural network model obtained through training is introduced.

도 7은 본 발명의 실시예를 통해 제공되는 이미지 예측 방법의 흐름도이다. 도 7에 도시된 바와 같이, 상기 방법은 다음과 같은 단계를 포함한다: 7 is a flowchart of an image prediction method provided through an embodiment of the present invention. 7 , the method includes the following steps:

단계701에서, 이미지 처리 장치는 처리될 이미지를 획득한다. In step 701, the image processing apparatus acquires an image to be processed.

상기 이미지 처리 장치는 상기 제1 작업 노드일 수 있거나, 다른 작업 노드일 수 있거나, 예컨대 단말 기기 또는 서버와 같은 신경망 모델 트레이닝에 참여하지 않은 장치일 수 있다. The image processing device may be the first working node, may be another working node, or may be a device that does not participate in neural network model training, such as a terminal device or a server.

일부 실시예에서, 이미지 처리 장치는 서버이고, 이미지 처리 장치를 통해 처리될 이미지를 획득하는 것은 서버가 단말 기기로부터 처리될 이미지를 수신하거나 또는 사용자가 입력한 명령에 따라 다른 기기로부터 처리될 이미지를 획득하는 것일 수 있다. In some embodiments, the image processing apparatus is a server, and obtaining the image to be processed through the image processing apparatus means that the server receives an image to be processed from a terminal device or receives an image to be processed from another device according to a command input by a user may be to obtain.

일부 실시예에서, 이미지 처리 장치는 서버이고, 이미지 처리 장치를 통해 처리될 이미지를 획득하는 것은 서버가 사용자를 통해 업로드된 처리될 이미지를 획득하거나 또는 사용자가 입력한 명령에 따라 다른 기기로부터 처리될 이미지를 획득하는 것일 수 있다. In some embodiments, the image processing device is a server, and acquiring the image to be processed through the image processing device is to be processed from another device according to the server acquires the image to be processed uploaded through the user or according to a command input by the user. It may be acquiring an image.

단계702에서, 트레이닝하여 얻은 신경망 모델을 사용하여 상기 처리될 이미지에 대해 예측 처리를 수행하여 예측 결과를 얻는다. In step 702, prediction processing is performed on the image to be processed using the neural network model obtained by training to obtain a prediction result.

상기 신경망 모델은 전술한 실시예의 방법을 사용하여 트레이닝된 것일 수 있다. 도 7은 신경망 모델을 적용한 일 예시임을 이해해야 한다. 전술한 실시예의 트레이닝 방법을 사용하여 트레이닝된 신경망 모델은 예컨대 텍스트 인식, 이미지 인식, 이미지 분류 등과 같은 상이한 예측 임무를 처리할 수 있다. The neural network model may be trained using the method of the above-described embodiment. It should be understood that FIG. 7 is an example to which a neural network model is applied. The neural network model trained using the training method of the above-described embodiment can handle different prediction tasks, such as text recognition, image recognition, image classification, and the like.

일부 실시예에서, 이미지 처리 장치는 서버이고, 이미지 처리 장치는 단계702를 수행한 후, 또한 예측 결과를 예컨대 휴대폰, 개인용 컴퓨터 등과 같은 단말 기기에 발송할 수 있다. In some embodiments, the image processing device is a server, and after performing step 702, the image processing device may also send the prediction result to a terminal device, such as a mobile phone, a personal computer, or the like.

일부 실시예에서, 이미지 처리 장치는 단말 기기이고, 이미지 처리 장치는 단계702를 수행한 후, 또한 디스플레이를 통해 예측 결과를 나타내는 것과 같이 예측 결과를 출력할 수 있다. In some embodiments, the image processing apparatus is a terminal device, and after performing step 702, the image processing apparatus may also output the prediction result such as indicating the prediction result through a display.

본 발명의 실시예에서, 트레이닝된 신경망 모델을 사용하여 처리될 이미지에 대해 예측 처리를 수행하여 예측 결과를 얻고; 상이한 이미지 예측 임무를 고효율적으로 구현할 수 있다. In an embodiment of the present invention, prediction processing is performed on an image to be processed using a trained neural network model to obtain a prediction result; Different image prediction tasks can be implemented efficiently.

전술한 실시예는 제1 작업 노드를 통해 구현되는 신경망 모델의 트레이닝 방법을 설명한다. 이하 첨부 도면을 결부하여 제1 작업 노드의 각 모듈의 기능을 소개한다. The above-described embodiment describes a training method of a neural network model implemented through the first working node. Hereinafter, in conjunction with the accompanying drawings, the function of each module of the first working node is introduced.

도 8은 본 발명의 실시예를 통해 제공되는 데이터 처리 장치의 구조 모식도이다. 도 8의 데이터 처리 장치는 전술한 실시예의 제1 작업 노드일 수 있다. 도 8에 도시된 바와 같이, 데이터 처리 장치는, 8 is a structural schematic diagram of a data processing apparatus provided through an embodiment of the present invention. The data processing apparatus of FIG. 8 may be the first working node of the above-described embodiment. As shown in Fig. 8, the data processing device,

신경망 모델에 대해 수행하는 현재 반복에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 로컬 그래디언트 정보를 얻기 위한 처리 모듈(801); 및a processing module 801 for obtaining local gradient information of at least one network layer of the neural network model based on a current iteration performed on the neural network model; and

적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하기 위한 송수신 모듈(802)을 포함할 수 있고; a transmit/receive module (802) for transmitting local gradient information of a first network layer in the neural network model together with at least one second working node;

처리 모듈(801)은 또한 송수신 모듈(802)이 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 신경망 모델 중의 제2 네트워크 층의 파라미터를 병렬로 업데이트한다. The processing module 801 is further configured to: in the process by which the transmit/receive module 802 transmits the local gradient information of the first network layer of the neural network model together with the at least one second working node, the parameter of the second network layer of the neural network model update in parallel.

일부 실시예에서, 처리 모듈(801)은 CPU, GPU, NPU와 같은 프로세서일 수 있고, 송수신 모듈(802)은 특정 데이터 송수신 기능을 구비한 트랜시버일 수 있다. In some embodiments, the processing module 801 may be a processor such as a CPU, GPU, or NPU, and the transmission/reception module 802 may be a transceiver having a specific data transmission/reception function.

가능한 일 실시형태에서, 처리 모듈(801)은 또한, 상기 신경망 모델의 복수 개의 네트워크 층의 연결 관계에 기반하여 상기 현재 반복의 복수 개의 조작 사이의 의존 관계를 결정하고, 상기 복수 개의 조작은 적어도 상기 신경망 모델 중 적어도 하나의 네트워크 층의 로컬 그래디언트 정보의 전송 조작 및 파라미터 업데이트 조작을 포함하며; 상기 복수 개의 조작 사이의 의존 관계에 기초하여 상기 복수 개의 조작을 수행한다. In one possible embodiment, the processing module 801 is further configured to determine a dependency relationship between a plurality of manipulations of the current iteration based on a connection relationship of a plurality of network layers of the neural network model, wherein the plurality of manipulations are at least the a transmission operation and a parameter update operation of local gradient information of at least one network layer of the neural network model; The plurality of operations are performed based on a dependency relationship between the plurality of operations.

가능한 일 실시형태에서, 상기 제1 작업 노드는 역순으로 상기 신경망 모델 중 복수 개의 네트워크 층의 파라미터를 층별로 업데이트하거나; 및/또는, 상기 제2 네트워크 층의 네트워크 깊이는 상기 제1 네트워크 층의 네트워크 깊이보다 크다. In one possible embodiment, the first working node updates parameters of a plurality of network layers in the neural network model layer by layer in a reverse order; and/or, a network depth of the second network layer is greater than a network depth of the first network layer.

가능한 일 실시형태에서, 처리 모듈(801)은 구체적으로, 상기 송수신 모듈이 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 제2 네트워크 층의 파라미터 업데이트 조작이 의존하는 조작이 완료된 것을 결정할 경우, 상기 제2 네트워크 층의 파라미터를 병렬로 업데이트하되, 여기서, 상기 파라미터 업데이트 조작이 의존하는 조작은 상기 적어도 하나의 제2 작업 노드와 함께 상기 제2 네트워크 층의 로컬 그래디언트 정보를 전송한다. In one possible embodiment, the processing module 801 is specifically configured, in the process of the transmitting/receiving module transmitting the local gradient information of the first network layer in the neural network model together with the at least one second working node, the first When it is determined that the operation on which the parameter update operation of the second network layer depends is completed, the parameter of the second network layer is updated in parallel, wherein the operation on which the parameter update operation depends is performed with the at least one second working node; Together, transmit the local gradient information of the second network layer.

가능한 일 실시형태에서, 처리 모듈(801)은 또한, 상기 송수신 모듈이 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 제1 네트워크 층의 로컬 그래디언트 정보를 전송하는 과정에서, 상기 신경망 모델 중의 제3 네트워크 층의 로컬 그래디언트 정보를 계산한다. In one possible embodiment, the processing module 801 is further configured to: in the course of the transmitting/receiving module transmitting local gradient information of the first network layer in the neural network model together with at least one second working node, in the neural network model Calculate the local gradient information of the third network layer.

가능한 일 실시형태에서, 처리 모듈(801)은 또한, 상기 신경망 모델에 대해 내부 층 반복을 적어도 한번 수행하여 상기 적어도 하나의 내부 층 반복에 대응하는 중간 융합 그래디언트 정보를 얻기 위한 것이고; In one possible embodiment, the processing module 801 is further configured to perform inner layer iteration at least once on the neural network model to obtain intermediate fusion gradient information corresponding to the at least one inner layer iteration;

처리 모듈(801)은 구체적으로 상기 중간 융합 그래디언트 정보 및 상기 현재 반복에 대응하는 로컬 그래디언트 정보에 기반하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻기 위한 것이며; 상기 제1 작업 노드와 상기 적어도 하나의 제2 작업 노드가 전송하는 상기 제1 네트워크 층의 로컬 그래디언트 정보는 상기 제1 네트워크 층의 목표 융합 그래디언트 정보를 포함한다. The processing module 801 is specifically configured to obtain target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; The local gradient information of the first network layer transmitted by the first working node and the at least one second working node includes target fusion gradient information of the first network layer.

가능한 일 실시형태에서, 처리 모듈(801)은 구체적으로 상기 중간 융합 그래디언트 정보 및 상기 현재 반복을 통해 얻은 로컬 그래디언트 정보를 누적 처리하여 상기 신경망 모델의 적어도 하나의 네트워크 층의 목표 융합 그래디언트 정보를 얻기 위한 것이다. In one possible embodiment, the processing module 801 is specifically configured to cumulatively process the intermediate fusion gradient information and the local gradient information obtained through the current iteration to obtain target fusion gradient information of at least one network layer of the neural network model. will be.

가능한 일 실시형태에서, 송수신 모듈(802)은 또한 처리 모듈(801)이 상기 중간 융합 그래디언트 정보 및 상기 현재 반복에 대응하는 로컬 그래디언트 정보에 기반하여 상기 신경망 모델의 제3 네트워크 층의 목표 융합 그래디언트 정보를 얻는 과정에서, 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델의 제4 네트워크 층의 목표 융합 그래디언트 정보를 전송한다. In one possible embodiment, the transmit/receive module 802 is further configured to allow the processing module 801 to determine the target fusion gradient information of the third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration. In the process of obtaining , the target fusion gradient information of the fourth network layer of the neural network model is transmitted together with the at least one second working node.

가능한 일 실시형태에서, 처리 모듈(801)은 또한 상기 제1 네트워크 층의 로컬 그래디언트 정보 중의 각 수치를 모두 M배 증폭하고, 증폭된 각 수치를 반정밀도로 변환하되; 상기 M은 1보다 큰 실수이다. In one possible embodiment, the processing module 801 is further configured to amplify each value in the local gradient information of the first network layer all M times, and convert each amplified value to half precision; The M is a real number greater than 1.

가능한 일 실시형태에서, 처리 모듈(801)은 또한, 획득된 상기 제2 네트워크 층의 로컬 그래디언트 정보에 포함된 각 수치를 단일 정밀도로 변환하고, 상기 변환을 통해 얻은 각 수치를 M배 축소하여 처리 그래디언트 정보를 얻기 위한 것으로, 상기 M은 1보다 큰 실수이며; In one possible embodiment, the processing module 801 is further configured to transform each numerical value included in the obtained local gradient information of the second network layer to a single precision, and reduce each numerical value obtained through the transformation by M times for processing to obtain gradient information, wherein M is a real number greater than 1;

처리 모듈(801)은 구체적으로 상기 처리 그래디언트 정보를 사용하여 상기 신경망 모델 중의 상기 제2 네트워크 층의 파라미터를 업데이트한다. The processing module 801 specifically uses the processing gradient information to update the parameter of the second network layer in the neural network model.

가능한 일 실시형태에서, 처리 모듈(801)은 또한, 상기 제1 네트워크 층에 대응하는 오프셋에 기반하여 계산하여 얻은 상기 제1 네트워크 층의 로컬 그래디언트 정보를 미리 할당된 목표 저장 공간에 저장하되, 여기서, 상기 목표 저장 공간은 상기 신경망 모델의 복수 개의 네트워크 층의 로컬 그래디언트 정보를 저장하고; In one possible embodiment, the processing module 801 is further configured to store, in a pre-allocated target storage space, local gradient information of the first network layer obtained by calculation based on an offset corresponding to the first network layer, wherein , the target storage space stores local gradient information of a plurality of network layers of the neural network model;

여기서, 송수신 모듈(802)을 통해 발송된 상기 제1 네트워크 층의 로컬 그래디언트 정보는 상기 제1 네트워크 층에 대응하는 오프셋에 기반하여 상기 목표 저장 공간으로부터 획득한 것이거나, 및/또는, 처리 모듈(801)은 또한, 상기 적어도 하나의 제2 작업 노드로부터 수신된 상기 제1 네트워크 층의 로컬 그래디언트 정보에 기반하여 상기 목표 저장 공간에 저장된 상기 제1 네트워크 층의 로컬 그래디언트 정보를 업데이트한다. Here, the local gradient information of the first network layer sent through the transmission/reception module 802 is obtained from the target storage space based on an offset corresponding to the first network layer, and/or the processing module ( 801) also updates the local gradient information of the first network layer stored in the target storage space based on the local gradient information of the first network layer received from the at least one second working node.

가능한 일 실시형태에서, 처리 모듈(801)은 또한, 계산하여 얻은 상기 신경망 모델의 복수 개의 네트워크 층의 로컬 그래디언트 정보를 미리 할당된 목표 저장 공간에 저장하고, 메모리 관리자를 통해 상기 복수 개의 네트워크 층 중의 각각의 네트워크 층에 대응하는 오프셋을 결정하며; 상기 목표 저장 공간은 하나의 연속적인 저장 공간이고; 상기 제1 작업 노드는 상기 복수 개의 네트워크 층의 각각의 네트워크 층에 대응하는 오프셋에 기반하여 상기 목표 저장 공간으로부터 상기 복수 개의 네트워크 층 중의 적어도 두 개의 네트워크 층의 로컬 그래디언트 정보를 획득하며; 상기 적어도 두 개의 네트워크 층은 상기 제1 네트워크 층을 포함하고; 상기 송수신 모듈은 구체적으로 상기 적어도 하나의 제2 작업 노드와 함께 상기 신경망 모델 중의 상기 적어도 두 개의 네트워크 층의 로컬 그래디언트 정보를 전송한다. In one possible embodiment, the processing module 801 is further configured to store the local gradient information of the plurality of network layers of the neural network model obtained by calculation in a pre-allocated target storage space, and store one of the plurality of network layers through the memory manager. determine an offset corresponding to each network layer; the target storage space is one continuous storage space; the first working node obtains local gradient information of at least two network layers of the plurality of network layers from the target storage space based on an offset corresponding to each network layer of the plurality of network layers; the at least two network layers include the first network layer; The transmitting/receiving module specifically transmits the local gradient information of the at least two network layers in the neural network model together with the at least one second working node.

도 9는 본 발명의 실시예를 통해 제공되는 다른 하나의 데이터 처리 장치의 구조 모식도이다. 도 9에 도시된 바와 같이, 상기 데이터 처리 장치는, 9 is a structural schematic diagram of another data processing apparatus provided through an embodiment of the present invention. As shown in Fig. 9, the data processing device,

처리될 이미지를 획득하기 위한 획득 모듈(901); 및an acquiring module 901 for acquiring an image to be processed; and

트레이닝하여 얻은 신경망 모델을 사용하여 상기 처리될 이미지에 대해 예측 처리를 수행하여 예측 결과를 얻기 위한 처리 모듈(902)을 포함한다. and a processing module 902 for performing prediction processing on the image to be processed using the neural network model obtained by training to obtain a prediction result.

상기 데이터 처리 장치의 각 유닛의 구분은 논리적 기능의 구분일 뿐, 실제 구현 시 하나의 물리적 개체에 전체 또는 일부 통합될 수 있거나, 물리적으로 분리될 수 있음을 이해해야 한다. 예를 들어, 상기 각 유닛은 개별적으로 설정된 처리 소자일 수 있거나, 동일한 칩에 통합될 수도 있고, 이 밖에, 프로그램 코드의 형태로 컨트롤러의 저장 소자에 저장될 수 있으며, 프로세서의 특정 처리 소자를 통해 호출되고 이상의 각 유닛의 기능을 수행한다. 또한 각 유닛은 함께 통합될 수 있거나, 독립적으로 구현될 수 있다. 여기서 처리 소자는 신호 처리 능력을 구비한 집적 회로 칩일 수 있다. 구현 과정에서, 상기 방법의 각 단계 또는 이상의 각 유닛은 프로세서 소자의 하드웨어의 집적 논리 회로 또는 소프트웨어 형태의 명령을 통해 완성될 수 있다. 상기 처리 소자는 예컨대 중앙 처리 장치(영문: central processing unit, 약칭: CPU)와 같은 범용 프로세서일 수 있거나, 상기 방법을 구현하도록 구성된 예컨대, 하나 또는 복수 개의 주문형 집적 회로(영문: application-specific integrated circuit, 약칭: ASIC), 또는, 하나 또는 복수 개의 마이크로프로세서(영문: digital signal processor, 약칭: DSP), 또는, 하나 또는 복수 개의 필드 프로그래머블 게이트 어레이(영문: field-programmable gate array, 약칭: FPGA)와 같은 하나 또는 복수 개의 집적 회로일 수 있다. It should be understood that the division of each unit of the data processing apparatus is only a division of a logical function, and may be fully or partially integrated into one physical entity or may be physically separated in actual implementation. For example, each unit may be an individually set processing element, may be integrated into the same chip, or may be stored in the storage element of the controller in the form of program code, and may be processed through a specific processing element of the processor. It is called and performs the function of each unit above. Also, each unit may be integrated together or implemented independently. Here, the processing element may be an integrated circuit chip having signal processing capability. In the implementation process, each step or each unit of the above method may be completed through an integrated logic circuit of hardware of a processor element or an instruction in the form of software. The processing element may be, for example, a general-purpose processor, such as a central processing unit (in English: central processing unit, abbreviation: CPU), or, for example, one or a plurality of application-specific integrated circuits (in English: application-specific integrated circuit) configured to implement the method. , abbreviation: ASIC), or one or more microprocessors (English: digital signal processor, abbreviation: DSP), or, one or more field-programmable gate array (English: field-programmable gate array, abbreviation: FPGA) and The same may be one or a plurality of integrated circuits.

도 10은 본 발명의 실시예를 통해 제공되는 서버의 구조 모식도이고, 상기 서버(1000)는 상이한 구성 또는 성능의 차이로 인해 큰 차이가 발생할 수 있고, 하나 또는 하나 이상의 중앙 처리 장치(central processing units, CPU)(1022)(예를 들어, 하나 또는 하나 이상의 프로세서) 및 메모리(1032), 애플리케이션(1042) 또는 데이터(1044)를 저장하는 하나 또는 하나 이상의 저장매체(1030)(예를 들어, 하나 또는 하나 이상의 대용량 저장 기기), 하나 또는 하나 이상의 가속 기기(예를 들어, GPU 또는 NPU)(1024)를 포함할 수 있다. 여기서, 메모리(1032) 및 저장매체(1030)는 단기 저장 또는 영구 저장일 수 있다. 저장매체(1030)에 저장된 프로그램은 하나 또는 하나 이상의 모듈(미도시)을 포함할 수 있고, 각각의 모듈은 서버에서의 일련의 명령 조작을 포함할 수 있다. 나아가, 중앙 처리 장치(1022)는 저장매체(1030)와 통신하고, 서버(1000) 상에서 저장매체(1030)의 일련의 명령 조작을 실행하도록 구성될 수 있다. 가속 기기(1024)는 이미지 처리 임무와 같은 중앙 처리 장치(1022)를 통해 할당된 임무를 수행할 수 있다. 서버(1000)는 본 발명의 실시예를 통해 제공되는 데이터 처리 장치일 수 있다.10 is a structural schematic diagram of a server provided through an embodiment of the present invention, and the server 1000 may have large differences due to different configurations or differences in performance, and one or more central processing units (central processing units) , CPU) 1022 (eg, one or more processors) and one or more storage media 1030 (eg, one or one or more mass storage devices), one or more acceleration devices (eg, GPU or NPU) 1024 . Here, the memory 1032 and the storage medium 1030 may be short-term storage or permanent storage. The program stored in the storage medium 1030 may include one or more modules (not shown), and each module may include a series of instruction manipulations in the server. Furthermore, the central processing unit 1022 may be configured to communicate with the storage medium 1030 and execute a series of instruction manipulations of the storage medium 1030 on the server 1000 . The accelerator 1024 may perform a task assigned through the central processing unit 1022 , such as an image processing task. The server 1000 may be a data processing apparatus provided through an embodiment of the present invention.

서버(1000)는 하나 또는 하나 이상의 전원(1026), 하나 또는 하나 이상의 유선 또는 무선 네트워크 인터페이스(1050), 하나 또는 하나 이상의 입력 및 출력 인터페이스(1058), 및/또는, 예를 들어 Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM와 같은 하나 또는 하나 이상의 운영 체제(1041)를 더 포함할 수 있다. Server 1000 may include one or more power sources 1026 , one or more wired or wireless network interfaces 1050 , one or more input and output interfaces 1058 , and/or, for example Windows Server™, Mac It may further include one or more operating systems 1041 such as OS XTM, UnixTM, LinuxTM, FreeBSDTM.

상기 실시예에서 데이터 처리 장치를 통해 수행되는 단계는 상기 도 10에 도시된 서버 구조에 기반할 수 있다. 구체적으로, 가속 기기(1024)는 도 8의 처리 모듈(801)의 기능을 구현할 수 있고, 유선 또는 무선 네트워크 인터페이스(1050)는 도 8의 송수신 모듈(802)의 기능을 수행할 수 있다. 구체적으로, 가속 기기(1024)는 도 9의 처리 모듈(902)의 기능을 구현할 수 있고, 유선 또는 무선 네트워크 인터페이스(1050) 또는 입력 및 출력 인터페이스(1058)는 도 9의 획득 모듈의 기능을 수행할 수 있다. The steps performed by the data processing apparatus in the above embodiment may be based on the server structure shown in FIG. 10 . Specifically, the accelerator device 1024 may implement the function of the processing module 801 of FIG. 8 , and the wired or wireless network interface 1050 may perform the function of the transmission/reception module 802 of FIG. 8 . Specifically, the accelerator device 1024 may implement the function of the processing module 902 of FIG. 9 , and the wired or wireless network interface 1050 or the input and output interface 1058 performs the function of the acquisition module of FIG. 9 . can do.

도 11은 본 발명의 실시예를 통해 제공되는 단말 기기의 구조 모식도이다. 도 11에 도시된 바와 같이, 상기 단말 기기(110)는 프로세서(1101), 메모리(1102) 및 통신 인터페이스(1103)를 포함하고; 상기 프로세서(1101), 메모리(1102) 및 통신 인터페이스(1103)는 버스(1104)를 통해 서로 연결된다. 도 11의 단말 기기는 전술한 실시예의 데이터 처리 장치일 수 있다. 11 is a structural schematic diagram of a terminal device provided through an embodiment of the present invention. 11, the terminal device 110 includes a processor 1101, a memory 1102 and a communication interface 1103; The processor 1101 , memory 1102 , and communication interface 1103 are connected to each other through a bus 1104 . The terminal device of FIG. 11 may be the data processing apparatus of the above-described embodiment.

메모리(1102)는 랜덤 액세스 메모리(random access memory, RAM), 읽기 전용 메모리(read-only memory, ROM), 소거 가능 프로그램 가능 읽기 전용 메모리(erasable programmable read only memory, EPROM), 또는 휴대용 읽기 전용 메모리(compact disc read-only memory, CDROM)를 포함하지만 이에 한정되지 않고, 상기 메모리(1102)는 관련 명령 및 데이터에 사용된다. 통신 인터페이스(1103)는 데이터를 수신 및 발송하기 위한 것이다. Memory 1102 may be random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or portable read-only memory (compact disc read-only memory, CDROM), including but not limited to, the memory 1102 is used for related instructions and data. Communication interface 1103 is for receiving and sending data.

프로세서(1101)는 하나 또는 복수 개의 CPU 및 하나 또는 복수 개의 GPU를 포함할 수 있고, 프로세서(1101)가 하나의 CPU를 포함할 경우, 상기 CPU는 단일 코어 CPU일 수 있거나, 다중 코어 CPU일 수 있다. 상기 실시예에서 데이터 처리 장치를 통해 수행되는 단계는 상기 도 11에 도시된 단말 기기의 구조에 기반할 수 있다. 구체적으로, 프로세서(1101)는 도 8의 처리 모듈(801)의 기능을 구현할 수 있고, 통신 인터페이스(1103)는 도 8의 송수신 모듈의 기능을 구현할 수 있다. 구체적으로, 프로세서(1101)는 도 9의 처리 모듈(902)의 기능을 구현할 수 있고, 통신 인터페이스(1103)는 도 9의 획득 모듈의 기능을 구현할 수 있다. The processor 1101 may include one or a plurality of CPUs and one or a plurality of GPUs, and when the processor 1101 includes one CPU, the CPU may be a single-core CPU or a multi-core CPU there is. The steps performed by the data processing apparatus in the above embodiment may be based on the structure of the terminal device shown in FIG. 11 . Specifically, the processor 1101 may implement the function of the processing module 801 of FIG. 8 , and the communication interface 1103 may implement the function of the transmission/reception module of FIG. 8 . Specifically, the processor 1101 may implement the function of the processing module 902 of FIG. 9 , and the communication interface 1103 may implement the function of the acquisition module of FIG. 9 .

본 발명의 실시예는 컴퓨터 판독 저장매체를 제공하는 바, 상기 컴퓨터 판독 저장매체에는 컴퓨터 프로그램이 저장되어 있고, 상기 컴퓨터 프로그램이 프로세서를 통해 실행될 경우 전술한 실시예를 통해 제공되는 신경망 모델의 트레이닝 방법을 구현한다. An embodiment of the present invention provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed through a processor, the training method of a neural network model provided through the above-described embodiment to implement

본 발명의 실시예는 컴퓨터 판독 저장매체를 제공하는 바, 상기 컴퓨터 판독 저장매체에는 컴퓨터 프로그램이 저장되어 있고, 상기 컴퓨터 프로그램이 프로세서를 통해 실행될 경우 전술한 실시예를 통해 제공되는 이미지 예측 방법을 구현한다. An embodiment of the present invention provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed through a processor, the image prediction method provided through the above-described embodiment is implemented do.

본 발명의 실시예는 명령을 포함하는 컴퓨터 프로그램 제품을 제공하는 바, 이가 컴퓨터에서 실행될 경우 컴퓨터로 하여금 전술한 실시예를 통해 제공되는 신경망 모델의 트레이닝 방법을 수행하도록 한다. An embodiment of the present invention provides a computer program product including instructions, which, when executed in a computer, causes the computer to perform the training method of the neural network model provided through the above-described embodiment.

본 발명의 실시예는 명령을 포함하는 컴퓨터 프로그램 제품을 제공하는 바, 이가 컴퓨터에서 실행될 경우 컴퓨터로 하여금 전술한 실시예를 통해 제공되는 이미지 예측 방법을 수행하도록 한다.An embodiment of the present invention provides a computer program product including instructions, which, when executed in a computer, causes the computer to perform the image prediction method provided through the above-described embodiment.

이상은 본 발명의 구체적인 실시형태에 불과하고 본 발명의 보호범위는 이에 한정되지 않고, 본 기술분야의 통상업자는 본 발명에 개시된 기술적 범위에서 각종 균등한 수정 또는 대체를 용이하게 생각해낼 수 있으며, 이러한 수정 또는 대체는 모두 본 발명의 보호범위에 포함되어야 한다. 따라서, 본 발명의 보호범위는 특허청구범위의 보호범위를 기준으로 하여야 한다.The above is only a specific embodiment of the present invention, the protection scope of the present invention is not limited thereto, and those skilled in the art can easily come up with various equivalent modifications or replacements in the technical scope disclosed in the present invention, All such modifications or replacements should be included in the protection scope of the present invention. Accordingly, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

A method for training a neural network model, comprising:
obtaining local gradient information of at least one network layer of the neural network model based on a current iteration performed by a first working node on the neural network model; and
In the process of transmitting local gradient information of a first network layer in the neural network model together with at least one second working node, the first working node updates parameters of a second network layer in the neural network model in parallel; Training method of a neural network model, characterized in that it comprises.

According to claim 1,
the first working node determines a dependency relationship between a plurality of operations of the current iteration based on a connection relationship of a plurality of network layers of the neural network model, wherein the plurality of operations are at least one network layer of the neural network model further comprising a transmission operation and a parameter update operation of the local gradient information of ;
The method of training a neural network model, characterized in that the first task node performs the plurality of operations based on a dependency relationship between the plurality of operations.

3. The method of claim 1 or 2,
the first working node updates parameters of a plurality of network layers in the neural network model layer by layer in a reverse order; and/or
The method of training a neural network model, characterized in that the network depth of the second network layer is greater than the network depth of the first network layer.

4. The method according to any one of claims 1 to 3,
In the process of transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the step of updating, by the first working node, a parameter of a second network layer of the neural network model in parallel, comprises: ,
In the process in which the first working node transmits the local gradient information of the first network layer in the neural network model together with the at least one second working node, the operation on which the parameter update operation of the second network layer depends is completed updating a parameter of the second network layer in parallel when determining that Training method of a neural network model, characterized in that the transmission.

5. The method according to any one of claims 1 to 4,
calculating, by the first working node, local gradient information of a third network layer of the neural network model, in the process of transmitting local gradient information of the first network layer in the neural network model together with at least one second working node; Training method of a neural network model, characterized in that it further comprises.

6. The method according to any one of claims 1 to 5,
Before the first working node performs the current iteration on the neural network model, the training method of the neural network model is:
further comprising the step of the first working node performing an inner layer iteration on the neural network model at least once to obtain intermediate fusion gradient information corresponding to the at least one inner layer iteration;
The step of obtaining local gradient information of at least one network layer of the neural network model based on a current iteration performed by the first working node on the neural network model may include: obtaining target fusion gradient information of at least one network layer of the neural network model based on the corresponding local gradient information; and wherein the local gradient information of the first network layer transmitted by the first working node and the at least one second working node includes target convergence gradient information of the first network layer. How to train the model.

7. The method of claim 6,
obtaining, by the first working node, target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration,
and obtaining, by the first working node, the target fusion gradient information of at least one network layer of the neural network model by accumulating the intermediate fusion gradient information and the local gradient information obtained through the current iteration. How to train the model.

8. The method according to claim 6 or 7,
In the process in which the first working node obtains target fusion gradient information of a third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration, the at least one second working node Training method of a neural network model, characterized in that it further comprises the step of transmitting in parallel the target fusion gradient information of the fourth network layer of the neural network model together with.

9. The method according to any one of claims 1 to 8,
Before transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the method for training the neural network model comprises:
Further comprising the step of amplifying, by the first working node, each value in the local gradient information of the first network layer by M times, and converting each amplified value to half precision; The M is a training method of a neural network model, characterized in that the real number greater than 1.

10. The method according to any one of claims 1 to 9,
Before the first working node updates the parameters of the second network layer in the neural network model in parallel, the training method of the neural network model includes:
The first operation node converts each numerical value included in the obtained local gradient information of the second network layer to a single precision, and reduces each numerical value obtained through the transformation by M times to obtain processing gradient information. provided that M is a real number greater than 1;
The step of updating, by the first working node, a parameter of a second network layer in the neural network model in parallel, comprises:
and updating, by the first working node, a parameter of the second network layer in the neural network model using the processing gradient information.

11. The method according to any one of claims 1 to 10,
Before transmitting local gradient information of a first network layer of the neural network model together with at least one second working node, the method for training the neural network model comprises:
Storing, by the first working node, local gradient information of the first network layer obtained by calculating based on the offset corresponding to the first network layer in a pre-allocated target storage space, the target storage space is for storing local gradient information of a plurality of network layers of the neural network model;
The local gradient information of the first network layer sent through the first working node is obtained from the target storage space based on an offset corresponding to the first network layer, and/or the first working node is to update the local gradient information of the first network layer stored in the target storage space based on the local gradient information of the first network layer received from the at least one second working node. Training of a neural network model, characterized in that Way.

An image prediction method comprising:
acquiring an image to be processed;
12. An image prediction method comprising the step of obtaining a prediction result by performing prediction processing on the image to be processed using a neural network model obtained by training through any one of claims 1 to 11.

a processing module for obtaining local gradient information of at least one network layer of the neural network model based on a current iteration performed on the neural network model; and
a transmit/receive module for transmitting local gradient information of a first network layer in the neural network model together with at least one second working node;
The processing module is further configured to parallelize the parameters of the second network layer of the neural network model in the process in which the transmitting/receiving module transmits the local gradient information of the first network layer of the neural network model together with the at least one second working node. A data processing device, characterized in that for updating.

14. The method of claim 13,
The processing module is further configured to determine a dependency relationship between a plurality of manipulations of the current iteration based on a connection relationship of a plurality of network layers of the neural network model, wherein the plurality of manipulations are at least one network layer of at least one of the neural network models. including a transmission operation and a parameter update operation of the local gradient information of ; and performing the plurality of operations based on a dependency relationship between the plurality of operations.

15. The method of claim 13 or 14,
the processing module updates parameters of a plurality of network layers of the neural network model layer by layer in a reverse order; and/or
A network depth of the second network layer is greater than a network depth of the first network layer.

16. The method according to any one of claims 13 to 15,
The processing module is configured to: In the process by which the transmitting/receiving module transmits the local gradient information of the first network layer in the neural network model together with the at least one second working node, the parameter update operation of the second network layer depends When determining that the operation is complete, update the parameter of the second network layer in parallel, wherein the operation on which the parameter update operation depends is to send the local gradient information of the second network layer together with the at least one second working node Data processing device, characterized in that.

17. The method according to any one of claims 13 to 16,
The processing module is further configured, in the process of the transmitting/receiving module transmitting local gradient information of the first network layer of the neural network model together with at least one second working node, local gradient information of a third network layer of the neural network model Data processing device, characterized in that for calculating.

18. The method according to any one of claims 13 to 17,
the processing module is further configured to: perform inner layer iteration on the neural network model at least once to obtain intermediate fusion gradient information corresponding to the at least one inner layer iteration;
the processing module is configured to obtain target fusion gradient information of at least one network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration; The data processing apparatus of claim 1, wherein the local gradient information of the first network layer transmitted by the first working node and the at least one second working node includes target fusion gradient information of the first network layer.

19. The method of claim 18,
and the processing module is for accumulating the intermediate fusion gradient information and the local gradient information obtained through the current iteration to obtain target fusion gradient information of at least one network layer of the neural network model.

20. The method of claim 18 or 19,
The transmitting/receiving module may further include, in the process in which the processing module obtains target fusion gradient information of a third network layer of the neural network model based on the intermediate fusion gradient information and the local gradient information corresponding to the current iteration, the at least one The data processing apparatus according to claim 1, wherein the target fusion gradient information of the fourth network layer of the neural network model is transmitted in parallel with the second working node.

21. The method according to any one of claims 13 to 20,
The processing module is further configured to amplify each value in the local gradient information of the first network layer by M times, and convert each amplified value to half precision; The data processing apparatus, characterized in that M is a real number greater than 1.

22. The method according to any one of claims 13 to 21,
The processing module is further configured to convert each numerical value included in the obtained local gradient information of the second network layer with single precision, and reduce each numerical value obtained through the transformation by M times to obtain processing gradient information, M is a real number greater than 1;
and the processing module is configured to update a parameter of the second network layer in the neural network model by using the processing gradient information.

23. The method according to any one of claims 13 to 22,
The processing module is further configured to store the local gradient information of the first network layer obtained by calculation based on the offset corresponding to the first network layer in a pre-allocated target storage space, wherein the target storage space is of the neural network model. store local gradient information of a plurality of network layers;
The local gradient information of the first network layer sent through the transmit/receive module is obtained from the target storage space based on an offset corresponding to the first network layer, and/or the processing module is further configured to: and updating the local gradient information of the first network layer stored in the target storage space based on the local gradient information of the first network layer received from at least one second working node.

an acquisition module for acquiring an image to be processed; and
12. A data processing apparatus comprising a processing module for obtaining a prediction result by performing prediction processing on the image to be processed using a neural network model obtained by training through any one of claims 1 to 11.

A computer readable storage medium storing a computer program including program instructions,
13. A computer-readable storage medium, characterized in that when the program instructions are executed by a processor of a mobile device, it causes the processor to perform the method according to any one of claims 1 to 12.

In an electronic device comprising a memory and a processor,
The memory is for storing instructions, and the processor executes the instructions stored in the memory to cause the processor to perform the method according to any one of claims 1 to 12. .

A computer program comprising:
A computer program, characterized in that when the computer program is executed by a processor, it causes the processor to perform the method according to any one of claims 1 to 12.