CN116151374B

CN116151374B - Distributed model reasoning method, device, equipment, storage medium and program product

Info

Publication number: CN116151374B
Application number: CN202211532938.9A
Authority: CN
Inventors: 郝宏翔; 沈亮; 巩伟宝; 于佃海
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2024-02-13
Anticipated expiration: 2042-11-29
Also published as: CN116151374A

Abstract

The disclosure provides a distributed model reasoning method, a device, equipment, a storage medium and a program product, which relate to the technical field of data processing, in particular to the technical fields of artificial intelligence, deep learning and distribution. The specific implementation scheme is as follows: determining model networking data for the target computing unit according to the distributed training strategy and model networking sub-data associated with the plurality of distributed training computing units; determining model parameters for the target computing unit according to the distributed training strategy and model subparameters associated with the plurality of distributed training computing units; determining a first target model calculation graph for the target calculation unit according to the model networking data and the model parameters; transmitting the first target model calculation graph to a plurality of distributed reasoning calculation units; and receiving a target reasoning result from the distributed model reasoning calculation unit, wherein the target reasoning result is determined according to the data to be verified, the first target model calculation graph and the distributed reasoning strategy.

Description

Distributed model reasoning method, device, equipment, storage medium and program product

Technical Field

The disclosure relates to the technical field of data processing, in particular to the technical fields of artificial intelligence, deep learning and distribution, and specifically relates to a distributed model reasoning method, a device, equipment, a storage medium and a program product.

Background

With the development of artificial intelligence technology, deep learning as an important branch of artificial intelligence has a wide application prospect in scenes such as computer vision, intelligent recommendation, natural language and the like. The deep learning model is continuously optimized and changed, so that the scale of the model parameters and the related data volume are quickly increased, and how to cope with the increase of the model data volume becomes a technical problem to be solved urgently.

Disclosure of Invention

The present disclosure provides a distributed model reasoning method, apparatus, device, storage medium and program product.

According to an aspect of the present disclosure, there is provided a distributed model reasoning method including: determining model networking data for the target computing unit according to the distributed training strategy and model networking sub-data associated with the plurality of distributed training computing units; determining model parameters for the target computing unit according to the distributed training strategy and model subparameters associated with the plurality of distributed training computing units; determining a first target model calculation graph for the target calculation unit according to the model networking data and the model parameters; transmitting the first target model calculation graph to a plurality of distributed reasoning calculation units; and receiving a target reasoning result from the distributed model reasoning calculation unit, wherein the target reasoning result is determined according to the data to be verified, the first target model calculation graph and the distributed reasoning strategy.

According to another aspect of the present disclosure, there is provided a distributed model reasoning method, including: in response to receiving the first target model computational graph, determining a target reasoning result according to the data to be verified, the first target model computational graph and the distributed reasoning strategy; and sending the target reasoning result.

According to another aspect of the present disclosure, there is provided a distributed model reasoning apparatus, including: the system comprises a model networking data determining module, a model parameter determining module, a first target model calculation map determining module, a first transmitting module and a receiving module. The model networking data determining module is used for determining model networking data aiming at the target computing unit according to the distributed training strategy and model networking sub-data associated with the distributed training computing units; the model parameter determining module is used for determining model parameters aiming at the target computing unit according to the distributed training strategy and the model subparameters associated with the distributed training computing units; the first target model calculation diagram determining module is used for determining a first target model calculation diagram aiming at the target calculation unit according to the model networking data and the model parameters; the first sending module is used for sending the first target model calculation graph to the distributed reasoning calculation units; and the receiving module is used for receiving the target reasoning result from the distributed model reasoning calculation unit, wherein the target reasoning result is determined according to the data to be verified, the first target model calculation graph and the distributed reasoning strategy.

According to another aspect of the present disclosure, there is provided a distributed model reasoning apparatus, including: the system comprises a target reasoning result determining module and a second sending module. The target reasoning result determining module is used for determining a target reasoning result according to the data to be verified, the first target model calculation diagram and the distributed reasoning strategy in response to receiving the first target model calculation diagram; and the second sending module is used for sending the target reasoning result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, the computer program when executed by a processor implementing a method of an embodiment of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture diagram of a distributed model reasoning method and apparatus in accordance with an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a distributed model reasoning method in accordance with an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of model networking data for a target computing unit for a distributed model reasoning method, in which the distributed training strategy is a tensor model parallel strategy, according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of determining model parameters for a target computing unit for a distributed model reasoning method, wherein the distributed strategy is a tensor model parallel strategy, according to a further embodiment of the present disclosure;

FIG. 5 schematically illustrates a schematic diagram of model networking data for a target computing unit for a distributed model reasoning method, in which the distributed training strategy is a pipelined parallel strategy, according to yet another embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic diagram of determining model parameters for a target computing unit for a distributed model reasoning method according to yet another embodiment of the present disclosure, wherein the distributed strategy is a pipelined parallel strategy;

fig. 7 schematically illustrates a flow diagram of a distributed model reasoning apparatus according to yet another embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a distributed model reasoning apparatus, according to an embodiment of the present disclosure;

fig. 9 schematically illustrates a block diagram of a distributed model reasoning apparatus according to another embodiment of the present disclosure; and

fig. 10 schematically illustrates a block diagram of an electronic device in which a distributed model reasoning method of an embodiment of the present disclosure may be implemented.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Deep learning models typically require training and reasoning prior to practical application. The deep learning model will be simply referred to as a model hereinafter. The deep learning model includes a plurality of network layers, each network layer having model parameters including weights and offsets. Model training may be understood as performing iterative forward computation and backward propagation on an initial deep learning model with initial model parameters using labeled training samples, updating model parameters, e.g., completing training of the deep learning model in the event that the loss function of the deep learning model with certain model parameters converges.

Unlike model training, model reasoning is forward computation with verification data for a trained model, and the resulting reasoning results can be used, for example, to characterize the performance of the currently trained model.

In some embodiments, model training or model reasoning is performed on a single machine, i.e., a single device, and in some cases, the single device has one computing unit, which is difficult to handle a large number of model parameters and operations, and the like, which makes model training or model reasoning on the single machine have a large limitation.

In some embodiments, model training or model reasoning can be performed by using a distributed method, multiple devices are involved in a distributed manner, and the number of computing units can be expanded, so that a large number of model parameters, operations and the like can be dealt with. This embodiment requires that the relevant personnel have distributed technical knowledge, and has higher technical requirements and higher application difficulty for the relevant personnel.

For distributed model reasoning, since only forward computation is performed, distributed model reasoning of some embodiments requires the same number of distributed reasoning computation units as distributed training computation units for distributed model training, with poor flexibility.

In addition, distributed model reasoning of some embodiments has a situation that a specific distributed parallel strategy or a specific model type is not supported, and has a large application limitation.

Fig. 1 schematically illustrates a system architecture of a distributed model reasoning method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include a distributed training computing unit, a target computing unit 102, a terminal 103, a distributed reasoning computing unit and a network 105,

the distributed training computing unit may be provided in an electronic device, for example, and may be used to perform deep learning model training, for example. In the example of FIG. 1, distributed training computing units 101-1 through 101-N are schematically illustrated ₁ N is the sum of (C) ₁ Specific examples of distributed training computing elements.

The target computing unit 102 may be provided in an electronic device, for example, which may be a computer.

The terminal 103 may be configured to receive, for example, a distributed training policy for a distributed training computing unit and model networking sub-data, and perform a distributed model reasoning method applied to the terminal according to an embodiment of the present disclosure. The terminal 103 may be, for example, a server providing various services.

The distributed reasoning calculation unit may be arranged in the electronic device, for example, and the distributed training calculation unit may be used for performing the deep learning model reasoning after the deep learning model training is finished, for example. In the example of FIG. 1, distributed inference calculation units 104-1 through 104-N are schematically illustrated ₂ N is the sum of (C) ₂ Specific examples of the distributed inference calculation units.

The distributed reasoning computing unit may be adapted to determine, for example in response to receiving the first target model computation graph, a target reasoning result from the data to be verified, the first target model computation graph and the distributed reasoning strategy.

The network 105 is used as a medium to provide communication links between the distributed training computing units, the target computing unit 102, the terminal 103, and the distributed reasoning computing units. The network 105 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The target computing unit, the distributed training computing unit, and the distributed reasoning computing unit may be processors having computing and storage capabilities, such as a central processing unit, a graphics processor, a neural network processor, and the like. The intermediate processor Central Processing Unit is simply called CPU. Graphics processor Graphics Processing Unit, simply referred to as GPU. The neural network processor Neural Network Processing unit, abbreviated as NNP.

It should be noted that, the distributed model reasoning method provided by an embodiment of the present disclosure may be performed by a terminal, and the distributed model reasoning method provided by another embodiment of the present disclosure may be performed by a distributed reasoning computing unit.

It should be understood that the number of distributed training computing units, target computing units 102, terminals 103, distributed reasoning computing units, and networks 105 in fig. 1 are merely illustrative. There may be any number of distributed training computing units, target computing units 102, terminals 103, distributed reasoning computing units, and networks 105, as desired for implementation.

It should be noted that, in the technical solution of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing, etc. related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the public welfare.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

The embodiment of the present disclosure provides a distributed model reasoning method, and the distributed model reasoning method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 6 in conjunction with the system architecture of fig. 1. The distributed model reasoning method of the embodiments of the present disclosure may be performed, for example, by the terminal 103 shown in fig. 1.

Fig. 2 schematically illustrates a flow chart of a distributed model reasoning method according to an embodiment of the present disclosure.

As shown in fig. 2, the distributed model reasoning method 200 of an embodiment of the present disclosure may include, for example, operations S210-S240.

In operation S210, model networking data for a target computing unit is determined according to a distributed training strategy and model networking sub-data associated with a plurality of distributed training computing units.

A distributed training strategy may be understood as a strategy for distributed model training, which may include, for example, tensor model parallelism strategies and pipeline parallelism strategies.

A distributed training computing unit may be understood as a hardware unit that performs distributed model training, and in the following description, a computing unit will be taken as a GPU as an example, and a plurality of distributed training computing units may be understood as a plurality of GPUs that perform distributed model training, that is, multi-card training.

The target computing unit may be understood as a separate hardware unit, for example, model training or model reasoning for a single card may be performed at the target computing unit.

Model networking data can be understood as data characterizing the hierarchical structure of the deep learning model, the computational process of data from input to output.

In addition, model reasoning is a process of performing forward calculation based on model training, unlike model training, and does not involve, for example, an optimizer, so that model reasoning can be performed after model training is completed.

Aiming at the single-card model training of the target computing unit, complete model networking data, namely single-card model networking, is required. For multi-card model training of multiple distributed training computing units, each distributed training computing unit of the multi-card model training involves a portion of the complete model networking data, i.e., model networking sub-data, as each training unit performs a portion of the model training.

In operation S220, model parameters for the target computing unit are determined according to the distributed training strategy and model subparameters associated with the plurality of distributed training computing units.

It should be noted that, the model networking data characterizes the hierarchical structure of the deep learning model and the calculation process of the data from input to output, and the deep learning model includes model parameters besides the model networking data.

In operation S230, a first target model calculation map for the target calculation unit is determined according to the model networking data and the model parameters.

The computational graph can be understood as graph structure data representing the topology of the computational operations and data involved in the training process of the deep learning model.

The first target model computation graph may be understood as a computation graph for a single card of target computation units.

In operation S240, the first object model calculation map is transmitted to a plurality of distributed inference calculation units.

The distributed reasoning computing unit may be understood as a hardware unit performing model training, and a plurality of distributed reasoning computing units correspond to i.e. multi-card reasoning.

In operation S250, a target inference result from the distributed model inference calculation unit is received.

The target reasoning result is determined according to the data to be verified, the first target model calculation graph and the distributed reasoning strategy.

A distributed model reasoning strategy may be understood as a strategy for distributed model reasoning, which may also include, for example, tensor model parallelism strategies and pipeline parallelism strategies.

According to the distributed model reasoning method of the embodiment of the disclosure, the multi-card networking trained on the distributed model can be converted into single-card networking by determining the model networking data aiming at the target computing unit according to the distributed training strategy and the model networking sub-data associated with a plurality of distributed training computing units. The complete, global model networking can be characterized by using model networking data. The model subparameters trained for the distributed model may be converted into complete model parameters by the model subparameters associated with the plurality of distributed training computing units according to the distributed training strategy. The first target model computation graph for the target computation unit is determined to be a single-card computation graph according to the model networking data and the model parameters. Therefore, according to the distributed model reasoning method disclosed by the embodiment of the invention, the conversion from the related data aiming at multi-card training to the single-card calculation graph can be realized, and the subsequent execution of elastic and flexible distributed model reasoning in a plurality of distributed reasoning calculation units is facilitated.

In an actual application scenario, for example, on the basis of pre-defining data of multi-card training, a complete model training process can be determined by utilizing a single-card calculation diagram through the distributed model reasoning method of the embodiment of the disclosure, and a basis for model reasoning is provided. Therefore, the distributed model training and the distributed model reasoning can be decoupled to a certain extent based on the first target model calculation diagram, so that the distributed model reasoning method of the embodiment of the disclosure can realize elastic and more flexible distributed model reasoning on the basis of the distributed model training.

Elastic, more flexible distributed model reasoning is e.g. manifested in: the distributed model reasoning can be performed later by using a different distributed reasoning strategy than the distributed training strategy, or the distributed reasoning can be performed later by using a different number of distributed reasoning calculation units than the distributed training calculation units.

Fig. 3 schematically illustrates a schematic diagram of a distributed model reasoning method according to another embodiment of the present disclosure. The distributed training strategy includes a tensor model parallel strategy.

As shown in fig. 3, the distributed model reasoning method according to an embodiment of the present disclosure further includes: and determining a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy. And determining the network layer parameter sub-tensor dimension of the target network layer according to the tensor model parallel strategy aiming at any one first distributed training calculation unit. And determining model networking sub-data according to the parameter sub-tensor dimension and the target network layer aiming at any one first distributed training calculation unit.

The network layer parameter tensor dimension characterizes the dimension of a partial parameter tensor obtained by dividing the full-tensor parameter tensor of the target network layer.

The Tensor model parallel strategy can be understood as a strategy for dividing tensors (tensors) of the model into different distributed training calculation units and realizing large-scale model training.

For a known tensor model parallel strategy, a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy can be determined, the tensor model parallel strategy further characterizes a division mode of model parameters on the first distributed training units, a network layer parameter sub-tensor of a target network layer indicated by the tensor model parallel strategy can be determined, and then a network layer parameter sub-tensor dimension of the target network layer can be determined.

In the example of fig. 3, two first distributed training calculation units corresponding to the tensor model parallelism strategy s1 are schematically shown, and the identifiers of the two first distributed training calculation units are rank0 and rank1.

For tensor model parallelism strategy, the corresponding model structure of each first distributed training computing unit is the same, e.g. two first distributed training computing units identified as rank0 and rank1 respectively each have a target network layer L ₀ Network layer L ₁ Network layer L ₂ Target network layer L ₃ Target network layer L ₄ Network layer L ₅ A specific example of a total of 6 network layers.

In the example of fig. 3, a parallel example of parameter tensors for a portion of the target network layers in the full-scale network layer is schematically shown.

For example, for a total of 6 network layers of the first distributed training computing element identified as rank0 and the first distributed training computing element identified as rank1, the target network layer L ₀ Target network layer L ₃ Target network layer L ₄ Parameter tensors of 3 target network layers. Of the total of 6 network layers, there is also network layer L ₁ Network layer L ₂ Network layer L ₅ The parameter tensor of (c) is not divided.

In the example of fig. 3, examples of different parallel ways of parameter tensors are also schematically shown.

For example, the tensor model parallelism policy may represent parameter tensors in an embedded parallel manner. Target network layer L with first distributed computing unit ₀ For example, target network layer L ₀ The parameter sub-tensors of (a) may be determined in parallel, for example, using embedded (ebedding). For example, two identified as rank0 and rank1 may be determined according to a tensor model parallelism strategy in an embedded parallelism manner Target network layer L of first distributed computing units ₀ Parameter sub-tensor dimension parallel embedding [0]And parallel embedding [1 ]]。

For example, the tensor model parallel policy may represent parameter tensors in a column-dimension parallel manner. Target network layer L with first distributed computing unit ₃ For example, target network layer L ₃ For example, determined in a parallel manner using column dimensions (column). The target network layer L of the two first distributed computing units identified as rank0 and rank1 may be determined, for example, according to a tensor model parallelism strategy in a column dimension parallelism manner ₃ Parameter tensor dimension column parameter [0 ]]And column parallel [1 ]]。

For example, the tensor model parallel policy may represent parameter tensors in a parallel manner of row dimensions. Target network layer L with first distributed computing unit ₄ For example, target network layer L ₄ For example, determined in a parallel manner using the row dimension (row). The target network layer L of the two first distributed computing units identified as rank0 and rank1 may be determined, for example, according to a tensor model parallelism strategy in a row-dimension parallelism manner ₄ Parameter tensor dimension row parameter [0 ]]And row parallel [1 ]]。

It should be noted that, in the distributed model reasoning method according to the embodiment of the present disclosure, the model reasoning process is based on the calculation map obtained by the model training process, and the reasoning process only needs the specific value of the parameter tensor of each network layer when the corresponding network layer operates, so, according to the distributed model reasoning method according to the embodiment of the present disclosure, by using the network layer parameter tensor dimension of the target network layer as the basis for subsequently determining the first target model calculation map according to the tensor model parallel policy, only the dimension of the network layer parameter tensor of the target network layer is focused at present, and based on the parameter tensor dimension and the target network layer, the model group network sub data is determined. The storage occupation of the specific numerical value of the network layer parameter sub-tensor of the target network layer can be reduced, and the distributed model reasoning efficiency is higher.

According to the distributed model reasoning method disclosed by the embodiment of the disclosure, the parameter tensor parallelism of a part of target network layers and different parallel modes of the parameter tensors are supported in the full-scale network layers, so that an application scene supporting a more flexible tensor model partitioning strategy and covering a larger range of tensor model partitioning strategies can be provided.

Illustratively, a distributed model reasoning method according to yet another embodiment of the present disclosure may enable, for example, determining a specific example of model networking data for a target computing unit from a distributed training strategy and model networking sub-data associated with a plurality of distributed training computing units using the following embodiments: and connecting the network layer parameter sub-tensor dimension of each target network layer according to a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy to obtain the parameter tensor total dimension of the target network layer. And determining model networking data aiming at the target computing unit according to the target network layer and the parameter tensor overall dimension.

The parameter tensor full-dose dimension is for the target computing unit.

For example, in the above embodiment, at the target network layer L of two first distributed computing units identified as rank0 and rank1 ₀ Dimension parallel embedding [0 ] of parameter sub-tensors]And parallel embedding [1 ]]In the case of 50, 50 respectively, the target network layer L of two first distributed computing units may be used, for example ₀ Determining a target network layer L for a target computing unit SC ₀ The parameter tensor full-dose dimension emmbedding is 100 (50+50).

For example, in the above embodiment, at the target network layer L of two first distributed computing units identified as rank0 and rank1 ₃ Dimension column parallel [0 ] of parameter sub-tensors of (2)]And column parallel [1 ]]In the case of 30×50 and 30×50 respectively, for example, the target network layer L of two first distributed computing units may be used ₃ Determines the parameter tensor overall dimension column of the target network layer L3 for the target computing unit SC to be 30×100 (30×50+50])。

For example, in the above embodiment, at two first distributed computing units identified as rank0 and rank1Target network layer L ₄ Dimension row parallel [0 ] of parameter sub-tensors of (2)]And row parallel [1 ]]In the case of 30×50 and 30×50 respectively, for example, the target network layer L of two first distributed computing units may be used ₄ Determining a target network layer L for a target computing unit SC ₄ The parameter tensor full-dose dimension row is 60×50 ([ 30+30)]*50)。

According to the distributed model reasoning method, the network layer parameter tensor dimension of each target network layer is connected through the plurality of first distributed training calculation units corresponding to the tensor model parallel strategy, so that the parameter tensor full dimension of the target network layer can be accurately and rapidly obtained. By determining model networking data for the target computing unit according to the target network layer and the parameter tensor overall dimension, conversion from model networking sub-data of a plurality of distributed training computing units to model networking data for the target computing unit (multi-card networking to single-card networking) is achieved, and subsequent determination of the first target model computing graph based on the model networking data is facilitated.

As shown in fig. 4, a distributed model reasoning method according to yet another embodiment of the present disclosure may implement, for example, a specific example of determining model parameters for a target computing unit from a distributed training strategy and model sub-parameters associated with a plurality of distributed training computing units using the following embodiments: and determining the parameter sub-tensors of the target network layer of each first distributed training computing unit according to the parameter list of the first distributed training computing units. A full-scale parameter tensor for each target network layer is determined from a first global index for a plurality of first distributed training computing elements.

The full-quantity parameter tensor comprises a plurality of parameter sub-tensors, and the first global index characterizes the mapping relation between the global parameter tensor of any one target network layer and a plurality of first distributed training calculation units.

Each first distributed training computing unit may be associated with a corresponding parameter list, i.e. a first distributed training computing unit parameter list. For example, in the example of FIG. 4, the label ra is schematically shownThe first distributed training calculation unit parameter list pa-list0 corresponding to the first distributed training calculation unit of nk0 and rank1 and the first distributed training calculation unit parameter list pa-list1 corresponding to the first distributed training calculation unit identified as rank 1. Taking the example that 3 target network layers are included in the 6 network layers in the above embodiment, the target network layer L corresponding to the first distributed training calculation unit identified as rank0 ₀ The target network layer L of the first distributed training computing element identified as rank0 may be determined, for example, from the first distributed training computing element parameter list pa-list0 ₀ Parameter sub-tensor param0 slice [0 ]]Similarly, a parameter sub-tensor parameter 0 slice [1 ] of the target network layer L1 of the first distributed training computing element identified as rank1 can be obtained ]。

Because the first global index characterizes the mapping relation between the global parameter tensor of any one target network layer and the plurality of first distributed training calculation units, under the condition of determining the parameter tensor of the target network layer of each first distributed training calculation unit, the total parameter tensor of each target network layer can be determined according to the first global index.

For example, to determine the target network layer L ₀ For example, two parameter sub-tensors, param0 slice 0, identified as rank0 and rank1, of the target network layer]And param0 slice [0 ]]And a first global index for determining a target network layer L ₀ The full-dose parameter tensor parameter of (2) param0.

According to the distributed model reasoning method of the embodiment of the disclosure, the parameter sub tensor of the target network layer of each first distributed training computing unit can be searched and determined according to the first distributed training computing unit parameter list, and the full parameter tensor of each target network layer can be accurately and conveniently mapped and determined according to the first global indexes of the first distributed training computing units. It should be noted that, the parameter tensor total dimension of the target network layer determined in the above embodiment is a dimension of the parameter tensor of the target network layer, and the global parameter tensor determined in the embodiment according to the present disclosure is a specific value of the parameter tensor of the target network layer, so that the parameter tensor of the target network layer for the target computing unit SC (single card) can be determined.

For example, for the network layer L ₁ Network layer L ₂ Network layer L ₅ Since the parameter tensor of the network layer is not divided, the network layer L for each first distributed training computing unit ₁ Network layer L ₂ Network layer L ₅ The parameter tensors of (2) are identical and known, so the network layer L of the target computing unit ₁ Network layer L ₂ Network layer L ₅ The parameter tensors are respectively matched with the network layer L of the first distributed training calculation unit ₁ Network layer L ₂ Network layer L ₅ The parameter tensor of (c) is the same.

Fig. 5 schematically illustrates a schematic diagram of a distributed model reasoning method according to a further embodiment of the present disclosure.

The distributed training strategy includes a pipelined parallel strategy.

As shown in fig. 5, the distributed model reasoning method according to an embodiment of the present disclosure further includes: a plurality of second distributed training computing units corresponding to the pipelined parallel strategy is determined. And determining a sub-network layer and a parameter tensor dimension corresponding to the sub-network layer according to the pipeline parallel strategy aiming at any one second distributed training calculation unit. And aiming at any one second distributed training calculation unit, determining model networking sub-data according to the sub-network layer and the parameter tensor dimension corresponding to the sub-network layer.

The sub-network layer is obtained by dividing the total network layer.

Pipeline parallel strategy can be understood as a strategy of placing different network layers of a model into different distributed training calculation units to realize large-scale model training.

For the known pipeline parallel strategy, a plurality of second distributed training calculation units corresponding to the pipeline parallel strategy can be determined, the pipeline parallel strategy further characterizes the division mode of the full-scale network layer of the model on the second distributed training units, and then the sub-network layer and the parameter tensor dimension corresponding to the sub-network layer can be determined.

In the example of fig. 5, two first distributed training calculation units corresponding to the pipeline parallelism strategy s2 are schematically shown, and the identifications of the two first distributed training calculation units are still rank0 and rank1.

FIG. 5 schematically illustrates that the model includes a network layer L ₀ To network layer L ₅ Is a full number of 6 network layers. Each network layer has its parameter tensor. For the pipelined parallel strategy, each second distributed training computing unit corresponds to a portion of the model, e.g., the second distributed training computing units identified as rank0 respectively correspond to the network layer L ₀ Network layer L ₁ Network layer L ₃ Network layer L can be used ₀ Network layer L ₁ Network layer L ₃ The three are understood as sub-network layers. The second distributed training computing element identified as rank1 corresponds to network layer L ₃ Network layer L ₄ Network layer L ₅ Network layer L can be used ₃ Network layer L ₄ Network layer L ₅ The three are understood as sub-network layers.

It should be noted that, similarly to the tensor model parallel policy, the model reasoning process related to the distributed model reasoning method in the embodiment of the present disclosure is based on the calculation graph obtained by the model training process, and the reasoning process only needs specific numerical values of the parameters of each network layer when the corresponding network layer operates, so, according to the distributed model reasoning method in the embodiment of the present disclosure, only the sub-network layer and the parameter tensor dimension corresponding to the sub-network layer are focused currently according to the pipeline parallel policy, and the model networking sub-data is determined based on the sub-network layer and the parameter tensor dimension corresponding to the sub-network layer. The storage occupation of the numerical value of the parameter tensor of the sub-network layer can be reduced, and the distributed model reasoning efficiency is higher.

As shown in fig. 5, a distributed model reasoning method according to yet another embodiment of the present disclosure may enable, for example, determining a specific example of model networking data for a target computing unit from a distributed training strategy and model networking sub-data associated with a plurality of distributed training computing units using the following embodiments: and connecting the sub-network layers according to a plurality of second distributed training calculation units corresponding to the pipeline parallel strategy to obtain a full network layer. And determining model networking data aiming at the target computing unit according to the parameter tensor dimension corresponding to the full-quantity network layer and the sub-network layer.

In the example of fig. 5, a sub-network layer identified as rank0 (network layer L, for example ₀ Network layer L ₁ Network layer L ₂ ) And a sub-network layer (L) identified as rank1 ₃ 、L ₄ 、L ₅ ) Connection to obtain the total network layer (network layer L ₀ To network layer L ₅ )。

It should be noted that the unique connection order between the sub-network layers may be determined according to the pipeline parallelism strategy S2. For example, in the above example, the sub-network layer connection identified as rank1 is determined following the sub-network layer identified as rank0 according to the pipeline parallelism policy S2.

It should also be noted that, because the pipeline parallelism strategy distributes the network layer of the model among multiple distributed training computing units, the parameter tensor of each network layer is not distributed among multiple distributed training computing units. The dimensions of the parameter tensors of the sub-network layers are unchanged, whether at any one of the second distributed training computing units or the target computing unit,

according to the distributed model reasoning method, the total network layer can be accurately and rapidly obtained by connecting the sub-network layers corresponding to each second distributed training computing unit according to the plurality of second distributed training computing units corresponding to the pipeline parallel strategy. By determining model networking data for the target computing unit according to the parameter tensor dimensions corresponding to the full-scale network layer and the sub-network layer, conversion from model networking sub-data of a plurality of distributed training computing units to model networking data for the target computing unit (multi-card networking to single-card networking conversion) is realized, and subsequent determination of the first target model computing graph based on the model networking data is facilitated.

As shown in fig. 6, a distributed model reasoning method according to yet another embodiment of the present disclosure may enable, for example, the following embodiments to determine specific examples of model parameters for a target computing unit from a distributed training strategy and model sub-parameters associated with a plurality of distributed training computing units: and determining the parameter tensor of the sub-network layer of each second distributed training computing unit according to the parameter list of the second distributed training computing unit. A parameter tensor for the full-scale network layer is determined from the second global index for the plurality of second distributed training computing elements.

The parameter tensor of the full-scale network layer comprises parameter tensors of a plurality of sub-network layers, and the second global index characterizes the mapping relation between the parameter tensors of the full-scale network layer and a plurality of second distributed training calculation units.

Each second distributed training computing unit may be associated with a corresponding parameter list, i.e. a second distributed training computing unit parameter list. For example, in the example of fig. 6, a second distributed training computing element parameter list pa-list0 corresponding to a first distributed training computing element identified as rank0 and a second distributed training computing element parameter list pa-list1 corresponding to a second distributed training computing element identified as rank1 are schematically shown.

Still with the full amount of 6 network layers (network layer L) of the above embodiment ₀ To network layer L ₅ ) And the sub-network layer of the second distributed training computing element identified as rank0 comprises network layer L ₀ Network layer L ₁ Network layer L ₂ The sub-network layer of the second distributed training computing element identified as rank1 comprises network layer L ₃ Network layer L ₄ Network layer L ₅ For example, the sub-network layer corresponding to the second distributed training computing element identified as rank0 may determine, for example, a parameter tensor (including network layer L) of the sub-network layer of the second distributed training computing element identified as rank0 based on the second distributed training computing element parameter list pa-list0 ₀ Parameter tensor param0 of (2), network layer L ₁ Parameter tensor param1 of (c) and network layer L ₂ Parameter tensor param 2). Similarly, a parameter tensor for the sub-network layer of the second distributed training computing element identified as rank1 may be obtained.

Because the second global index characterizes the mapping relation between the parameter tensor of the full-scale network layer and the plurality of second distributed training calculation units, under the condition of determining the parameter tensor of the sub-network layer of each second distributed training calculation unit, the parameter tensor of the full-scale network layer can be determined according to the second global index.

According to the distributed model reasoning method, the parameter tensors of the target network layers of each second distributed training computing unit can be searched and determined according to the second distributed training computing unit parameter list, and the parameter tensors of the whole network layers can be accurately and conveniently mapped and determined according to the second global indexes of the second distributed training computing units.

It should be noted that, the parameter tensor dimension corresponding to the sub-network layer determined in the above embodiment is the parameter tensor dimension corresponding to the sub-network layer, and the parameter tensor of the full-scale network layer determined in accordance with the embodiment of the present disclosure is a specific value of the parameter tensors of the plurality of sub-network layers, so that the parameter tensor of the full-scale network layer for the target computing unit SC (single card) may be determined.

Illustratively, in accordance with a distributed model reasoning method of a further embodiment of the present disclosure, the first target model computational graph is characterized by a dynamic computational graph.

The distributed model reasoning method according to the embodiment of the disclosure can further comprise: and converting the first target model calculation graph characterized by the dynamic calculation graph to obtain the first target model calculation graph characterized by the static calculation graph.

In deep learning model construction, some embodiments support both dynamic computational graph (dynamic graph) programming and static computational graph (static graph) programming, which differ in terms of both code writing and execution.

For example, a dynamic computational graph employs the Python programming style, executes each line of network code analytically, and returns the results of the computation at the same time. The static calculation graph adopts a mode of compiling before executing. The complete neural network structure needs to be predefined in the code first.

According to the distributed model reasoning method, the first target model calculation diagram characterized by the dynamic calculation diagram is obtained by converting the first target model calculation diagram characterized by the dynamic calculation diagram, dynamic and static unification can be achieved, the conversion of the dynamic diagram into the static diagram is not perceived by relevant technicians, the execution performance of distributed model reasoning can be improved, and global optimization is facilitated.

According to the distributed model reasoning method of the further embodiment of the disclosure, for example, the first target model calculation graph using the static graph can be packaged into a program, the program can be understood as a static description of the calculation graph, and the calculation graph can be directly executed based on the program, so that higher execution efficiency is achieved.

The embodiment of the present disclosure further provides a distributed model reasoning method, and the distributed model reasoning method according to the exemplary embodiment of the present disclosure is described below with reference to fig. 7 in conjunction with the system architecture of fig. 1. The distributed model reasoning method of the embodiments of the present disclosure may be performed, for example, by the distributed reasoning computing unit shown in fig. 1.

Fig. 7 schematically illustrates a flow chart of a distributed model reasoning method 700 provided by an embodiment of the present disclosure.

As shown in fig. 7, the distributed model reasoning method 700 according to an embodiment of the present disclosure includes operations S710 to S720.

In response to receiving the first target model computation graph, a target inference result is determined according to the data to be validated, the first target model computation graph, and the distributed inference policy in operation S710.

In operation S720, a target inference result is transmitted.

According to the distributed model reasoning method, the first target model calculation diagram aiming at the target calculation unit (single card) can be subjected to distributed model reasoning according to the distributed reasoning strategy by determining the target reasoning result according to the data to be verified, the first target model calculation diagram and the distributed reasoning strategy in response to the first target model calculation diagram, and the distributed model reasoning efficiency is higher.

Illustratively, according to another embodiment of the present disclosure, the distributed model reasoning method, the distributed reasoning strategy corresponds to a plurality of distributed reasoning computing units, the number of distributed reasoning computing units being different from the number of distributed training computing units.

According to the distributed model reasoning method, a plurality of distributed reasoning computing units are corresponding to the distributed reasoning strategy, the number of the supported distributed reasoning computing units is different from that of the distributed training computing units, flexible and elastic distributed model reasoning can be performed after distributed model training, and the limitation that the number of the distributed training computing units for performing distributed model training is the same as that of the distributed model computing units for performing distributed model reasoning is broken through.

Fig. 8 schematically illustrates a block diagram of a distributed model reasoning apparatus, according to an embodiment of the present disclosure.

As shown in fig. 8, the distributed model inference apparatus 800 of the embodiment of the present disclosure includes, for example, a model networking data determination module 810, a model parameter determination module 820, a first target model calculation map determination module 830, a first transmission module 840, and a reception module 850.

The model networking data determination module 810 is configured to determine model networking data for the target computing unit based on the distributed training strategy and model networking sub-data associated with the plurality of distributed training computing units.

The model parameter determination module 820 is configured to determine model parameters for the target computing unit according to the distributed training strategy and model sub-parameters associated with the plurality of distributed training computing units.

The first target model calculation map determining module 830 is configured to determine a first target model calculation map for the target calculation unit according to the model networking data and the model parameters.

A first sending module 840 is configured to send the first object model calculation graph to a plurality of distributed inference calculation units.

And the receiving module 850 is configured to receive a target inference result from the distributed model inference calculation unit, where the target inference result is determined according to the data to be verified, the first target model computation graph, and the distributed inference policy.

According to an embodiment of the present disclosure, the distributed training strategy includes a tensor model parallel strategy; the apparatus further comprises: the first distributed training calculation unit determining module is used for determining a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy; the first dimension determining module is used for determining the network layer parameter sub-tensor dimension of the target network layer according to the tensor model parallel strategy aiming at any one first distributed training computing unit, wherein the network layer parameter sub-tensor dimension represents the dimension of a partial parameter tensor obtained by dividing the full-tensor parameter tensor of the target network layer; the first determining module of model networking sub-data is used for determining the model networking sub-data according to the parameter sub-tensor dimension and the target network layer aiming at any one of the first distributed training calculation units.

According to an embodiment of the present disclosure, the model networking data determining module includes: the parameter tensor full-tensor dimension determination submodule is used for connecting the network layer parameter tensor dimension of each target network layer according to a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy to obtain the parameter tensor full-tensor dimension of the target network layer; and the model networking data first determining submodule is used for determining model networking data aiming at the target computing unit according to the target network layer and the parameter tensor overall dimension.

According to an embodiment of the present disclosure, the model parameter determining module includes: the parameter sub-tensor determining sub-module is used for determining the parameter sub-tensor of the target network layer of each first distributed training computing unit according to the parameter list of the first distributed training computing unit; the full-quantity parameter tensor determining submodule is used for determining the full-quantity parameter tensor of each target network layer according to first global indexes aiming at a plurality of first distributed training computing units, wherein the full-quantity parameter tensor comprises a plurality of parameter sub-tensors, and the first global indexes represent the mapping relation between the global parameter tensor of any target network layer and the plurality of first distributed training computing units.

According to an embodiment of the present disclosure, the distributed training strategy comprises a pipelined parallel strategy; the apparatus further comprises: the second distributed training calculation unit determining module is used for determining a plurality of second distributed training calculation units corresponding to the pipeline parallel strategy; the dimension second determining module is used for determining a sub-network layer and parameter tensor dimensions corresponding to the sub-network layer according to a pipeline parallel strategy aiming at any one second distributed training computing unit, wherein the sub-network layer is obtained by dividing a full-scale network layer; the second determining module is used for determining model networking sub-data according to the sub-network layer and the parameter tensor dimension corresponding to the sub-network layer aiming at any one second distributed training calculation unit.

According to an embodiment of the present disclosure, the model networking data determining module includes: the full-quantity network layer determining submodule is used for connecting the sub-network layers according to a plurality of second distributed training calculation units corresponding to the pipeline parallel strategy to obtain a full-quantity network layer; and the model networking data second determining submodule is used for determining model networking data aiming at the target computing unit according to the parameter tensor dimension corresponding to the full network layer and the sub network layer.

According to an embodiment of the present disclosure, the model parameter determining module includes: the sub-network layer parameter tensor determining sub-module is used for determining the parameter tensor of the sub-network layer of each second distributed training computing unit according to the parameter list of the second distributed training computing unit; a full-scale network layer parameter tensor determination submodule for determining a parameter tensor of the full-scale network layer according to the second global index for the plurality of second distributed training calculation units; the parameter tensor of the full-quantity network layer comprises parameter tensors of a plurality of sub-network layers, and the second global index characterizes the mapping relation between the parameter tensor of the full-quantity network layer and a plurality of second distributed training calculation units.

According to an embodiment of the present disclosure, the first target model computational graph is characterized using a dynamic computational graph; the apparatus further comprises: and converting the first target model calculation graph characterized by the dynamic calculation graph to obtain the first target model calculation graph characterized by the static calculation graph.

Fig. 9 schematically illustrates a block diagram of a distributed model reasoning apparatus according to another embodiment of the present disclosure.

As shown in fig. 9, the distributed model inference apparatus 900 of the embodiment of the present disclosure includes, for example, a target inference result determination module 910 and a second transmission module 920.

The target inference result determining module 910 is configured to determine, in response to receiving the first target model computation graph, a target inference result according to the data to be verified, the first target model computation graph, and the distributed inference policy.

And a second sending module 920, configured to send the target inference result.

According to the embodiment of the disclosure, the distributed reasoning strategy corresponds to a plurality of distributed reasoning computing units, and the number of the distributed reasoning computing units is different from that of the distributed training computing units.

It should be understood that the embodiments of the apparatus portion of the present disclosure correspond to the same or similar embodiments of the method portion of the present disclosure, and the technical problems to be solved and the technical effects to be achieved also correspond to the same or similar embodiments, which are not described herein in detail.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as a distributed model reasoning method. For example, in some embodiments, the distributed model reasoning method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the distributed model reasoning method described above can be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the distributed model reasoning method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A distributed model reasoning method, comprising:

determining model networking data for a target computing unit according to a distributed training strategy and a plurality of model networking sub-data respectively associated with a plurality of distributed training computing units, the plurality of distributed training computing units being a plurality of hardware units for performing distributed model training, each distributed training computing unit performing a portion of the distributed model training using the respective model networking sub-data; the target computing unit is an independent hardware unit for independently executing model training or model reasoning by using the model networking data;

determining the model parameters for the target computing unit according to the distributed training strategy and the model subparameters associated with a plurality of distributed training computing units;

Determining a first target model calculation graph for a target calculation unit according to the model networking data and the model parameters;

transmitting the first target model calculation graph to a plurality of distributed reasoning calculation units, wherein the plurality of distributed reasoning calculation units are a plurality of hardware units used for executing distributed model reasoning according to the first target model calculation graph, and the number of the plurality of distributed training calculation units is different from that of the plurality of distributed reasoning calculation units;

receiving target reasoning results from the plurality of distributed reasoning computing units, wherein the target reasoning results are determined according to data to be verified, the first target model computing graph and a distributed reasoning strategy;

the model is a deep learning model for computer vision, intelligent recommendation and natural language; the distributed training computing unit, the target computing unit, and the distributed reasoning computing unit include at least one of a central processor, a graphics processor, and a neural network processor.

2. The method of claim 1, wherein the distributed training strategy comprises a tensor model parallelism strategy; the method further comprises the steps of:

Determining a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy;

determining the network layer parameter sub-tensor dimension of the target network layer according to the tensor model parallel strategy by aiming at any one first distributed training calculation unit, wherein the network layer parameter sub-tensor dimension represents the dimension of partial parameter tensors obtained by dividing the full-tensor parameter tensor of the target network layer;

and determining the model networking sub-data according to the parameter sub-tensor dimension and a target network layer aiming at any one of the first distributed training calculation units.

3. The method of claim 2, wherein the determining model networking data for the target computing unit based on the distributed training strategy and model networking sub-data associated with the plurality of distributed training computing units comprises:

connecting the network layer parameter sub-tensor dimension of each target network layer according to a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy to obtain the parameter tensor total dimension of the target network layer;

and determining the model networking data aiming at the target computing unit according to the target network layer and the parameter tensor overall dimension.

4. The method of claim 2, wherein the determining the model parameters for the target computing unit according to the distributed training strategy and model subparameters associated with a plurality of the distributed training computing units comprises:

determining a parameter sub-tensor of the target network layer of each first distributed training computing unit according to a first distributed training computing unit parameter list;

determining a full-quantity parameter tensor of each target network layer according to a first global index for a plurality of first distributed training computing units, wherein the full-quantity parameter tensor comprises a plurality of parameter sub-tensors, and the first global index characterizes the mapping relation between the global parameter tensor of any one target network layer and the plurality of first distributed training computing units.

5. The method of claim 1, wherein the distributed training strategy comprises a pipelined parallel strategy; the method further comprises the steps of:

determining a plurality of second distributed training calculation units corresponding to the pipeline parallel strategy;

determining a sub-network layer and parameter tensor dimensions corresponding to the sub-network layer according to the pipeline parallel strategy aiming at any one second distributed training calculation unit, wherein the sub-network layer is obtained by dividing a full-scale network layer;

And determining the model networking sub-data according to the sub-network layer and the parameter tensor dimension corresponding to the sub-network layer aiming at any one second distributed training calculation unit.

6. The method of claim 5, wherein the determining model networking data for the target computing unit based on the distributed training strategy and model networking sub-data associated with the plurality of distributed training computing units comprises:

connecting the sub-network layers according to a plurality of second distributed training calculation units corresponding to the pipeline parallel strategy to obtain a full network layer;

and determining the model networking data aiming at the target computing unit according to the parameter tensor dimension corresponding to the full-scale network layer and the sub-network layer.

7. The method of claim 6, wherein the determining the model parameters for the target computing unit according to the distributed training strategy and model subparameters associated with a plurality of the distributed training computing units comprises:

determining a parameter tensor of the sub-network layer of each second distributed training computing unit according to a second distributed training computing unit parameter list;

Determining a parameter tensor for the full-scale network layer from a second global index for a plurality of the second distributed training computing elements; the parameter tensor of the full-scale network layer comprises parameter tensors of a plurality of sub-network layers, and the second global index characterizes the mapping relation between the parameter tensor of the full-scale network layer and the second distributed training calculation units.

8. The method of any of claims 1-7, the first target model computational graph characterized by a dynamic computational graph; the method further comprises the steps of:

and converting the first target model calculation graph characterized by the dynamic calculation graph to obtain the first target model calculation graph characterized by the static calculation graph.

9. A distributed model reasoning apparatus comprising:

a model networking data determining module configured to determine model networking data for a target computing unit according to a distributed training strategy and a plurality of model networking sub-data associated with each of a plurality of distributed training computing units, the plurality of distributed training computing units being a plurality of hardware units for performing distributed model training, each distributed training computing unit performing a portion of the distributed model training using the respective model networking sub-data; the target computing unit is an independent hardware unit for independently executing model training or model reasoning by using the model networking data;

The model parameter determining module is used for determining the model parameters aiming at the target computing unit according to the distributed training strategy and the model subparameters associated with a plurality of distributed training computing units;

the first target model calculation diagram determining module is used for determining a first target model calculation diagram aiming at a target calculation unit according to the model networking data and the model parameters;

a first sending module, configured to send the first target model computation graph to a plurality of distributed inference computation units, where the plurality of distributed inference computation units are a plurality of hardware units configured to perform distributed model inference according to the first target model computation graph, and the number of the plurality of distributed training computation units is different from the number of the plurality of distributed inference computation units;

the receiving module is used for receiving target reasoning results from the distributed reasoning calculation units, wherein the target reasoning results are determined according to the data to be verified, the first target model calculation graph and the distributed reasoning strategy;

10. The apparatus of claim 9, wherein the distributed training strategy comprises a tensor model parallelism strategy; the apparatus further comprises:

the first distributed training calculation unit determining module is used for determining a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy;

the first dimension determining module is configured to determine, for any one of the first distributed training computing units, a network layer parameter sub-tensor dimension of a target network layer according to the tensor model parallel policy, where the network layer parameter sub-tensor dimension characterizes a dimension of a partial parameter tensor obtained by dividing a full-tensor parameter tensor of the target network layer;

the first determining module is used for determining the model networking sub-data according to the parameter tensor dimension and the target network layer for any one of the first distributed training calculation units.

11. The apparatus of claim 10, wherein the model networking data determination module comprises:

the parameter tensor full-tensor dimension determining submodule is used for connecting the network layer parameter tensor dimension of each target network layer according to a plurality of first distributed training calculation units corresponding to the tensor model parallel strategy to obtain the parameter tensor full-tensor dimension of the target network layer;

And the model networking data first determining submodule is used for determining the model networking data aiming at the target computing unit according to the target network layer and the parameter tensor overall dimension.

12. The apparatus of claim 10, wherein the model parameter determination module comprises:

a parameter sub-tensor determining sub-module, configured to determine a parameter sub-tensor of the target network layer of each of the first distributed training computing units according to a parameter list of the first distributed training computing units;

the full-quantity parameter tensor determining sub-module is configured to determine a full-quantity parameter tensor of each of the target network layers according to a first global index for a plurality of first distributed training computing units, where the full-quantity parameter tensor includes a plurality of parameter sub-tensors, and the first global index characterizes a mapping relationship between a global parameter tensor of any one of the target network layers and the plurality of first distributed training computing units.

13. The apparatus of claim 9, wherein the distributed training strategy comprises a pipelined parallel strategy; the apparatus further comprises:

the second distributed training calculation unit determining module is used for determining a plurality of second distributed training calculation units corresponding to the pipeline parallel strategy;

The dimension second determining module is used for determining a sub-network layer and parameter tensor dimensions corresponding to the sub-network layer according to the pipeline parallel strategy aiming at any one second distributed training computing unit, wherein the sub-network layer is obtained by dividing a full network layer;

the second determining module is configured to determine, for any one of the second distributed training calculation units, model networking sub-data according to the sub-network layer and the parameter tensor dimension corresponding to the sub-network layer.

14. The apparatus of claim 13, wherein the model networking data determination module comprises:

the full-quantity network layer determining submodule is used for connecting the sub-network layers according to a plurality of second distributed training calculation units corresponding to the pipeline parallel strategy to obtain a full-quantity network layer;

and the model networking data second determining submodule is used for determining the model networking data aiming at the target computing unit according to the parameter tensor dimension corresponding to the full network layer and the sub network layer.

15. The apparatus of claim 14, wherein the model parameter determination module comprises:

A sub-network layer parameter tensor determining sub-module, configured to determine, according to a second distributed training calculation unit parameter list, a parameter tensor of the sub-network layer of each of the second distributed training calculation units;

a full-scale network layer parameter tensor determination submodule for determining a parameter tensor of the full-scale network layer according to second global indexes for a plurality of second distributed training calculation units; the parameter tensor of the full-scale network layer comprises parameter tensors of a plurality of sub-network layers, and the second global index characterizes the mapping relation between the parameter tensor of the full-scale network layer and the second distributed training calculation units.

16. The apparatus of any of claims 9-15, the first target model computational graph characterized by a dynamic computational graph; the apparatus further comprises:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.