CN117350384A

CN117350384A - Model parallel reasoning method and device, electronic equipment and storage medium

Info

Publication number: CN117350384A
Application number: CN202311296700.5A
Authority: CN
Inventors: 廖金龙; 姚建国; 吴长平; 许士芳
Original assignee: Suiyuan Intelligent Technology Chengdu Co ltd
Current assignee: Suiyuan Intelligent Technology Chengdu Co ltd
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2024-01-05

Abstract

The embodiment of the invention discloses a model parallel reasoning method, a device, electronic equipment and a storage medium, wherein the model parallel reasoning method comprises the following steps: determining a heterogeneous combination model of parallel reasoning from each model to be reasoning; wherein the heterogeneous combination model comprises at least two models to be inferred; carrying out fusion optimization treatment on the static diagrams of each model to be inferred in the heterogeneous combination model to obtain a parallel inference static diagram of the heterogeneous combination model; constructing parallel reasoning input data according to the input data of each model to be reasoning in the heterogeneous combination model and the parallel reasoning static diagram; and carrying out parallel reasoning on the heterogeneous combination model according to the parallel reasoning input data and the parallel reasoning static diagram. The technical scheme of the embodiment of the invention can improve the performance and efficiency of model parallel reasoning.

Description

Model parallel reasoning method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a model parallel reasoning method, a device, electronic equipment and a storage medium.

Background

The multi-model parallel reasoning is an important technology in the field of deep learning, and by fully utilizing hardware resources such as GPU (Graphics Processing Unit, graphics processor) and the like, the reasoning efficiency of the model is improved as much as possible, the reasoning time is shortened, the utilization rate of a single Zhang Tuili card is improved, and the multi-model parallel reasoning is necessary in some application occasions with high real-time requirements.

At present, the implementation modes of the multi-model parallel reasoning mainly comprise a multi-process multi-model parallel reasoning mode and a multi-thread multi-model parallel reasoning mode. The two multi-model parallel reasoning modes are essentially to complete parallel reasoning by scheduling the computing resources of a single reasoning card, and do not relate to the improvement of parallel reasoning of a model infrastructure.

The inventors have found that the following drawbacks exist in the prior art in the process of implementing the present invention: the multi-model parallel reasoning approach of multithreading does not actually enable true parallel reasoning due to the limitations of the global GIL (Global Interpreter Lock ) lock of python (Python Programming Language, a high-level programming language). The multi-model parallel reasoning mode of multiple processes has the problem that the process scheduling increases the reasoning overhead, and when a plurality of heterogeneous models are used for parallel reasoning, the model with relatively small reasoning calculation amount cannot fully utilize the parallel calculation performance of the reasoning card.

Disclosure of Invention

The embodiment of the invention provides a model parallel reasoning method, a device, electronic equipment and a storage medium, which can improve the performance and efficiency of model parallel reasoning.

According to an aspect of the present invention, there is provided a model parallel reasoning method, including:

determining a heterogeneous combination model of parallel reasoning from each model to be reasoning; wherein the heterogeneous combination model comprises at least two models to be inferred;

carrying out fusion optimization treatment on the static diagrams of each model to be inferred in the heterogeneous combination model to obtain a parallel inference static diagram of the heterogeneous combination model;

constructing parallel reasoning input data according to the input data of each model to be reasoning in the heterogeneous combination model and the parallel reasoning static diagram;

and carrying out parallel reasoning on the heterogeneous combination model according to the parallel reasoning input data and the parallel reasoning static diagram.

According to another aspect of the present invention, there is provided a model parallel reasoning apparatus comprising:

the heterogeneous combination model determining module is used for determining a heterogeneous combination model of parallel reasoning from each model to be inferred; wherein the heterogeneous combination model comprises at least two models to be inferred;

The parallel reasoning static diagram acquisition module is used for carrying out fusion optimization processing on the static diagrams of each model to be reasoning in the heterogeneous combination model to obtain a parallel reasoning static diagram of the heterogeneous combination model;

the parallel reasoning input data construction module is used for constructing parallel reasoning input data according to the input data of each model to be reasoning in the heterogeneous combination model and the parallel reasoning static diagram;

and the heterogeneous combination model parallel reasoning module is used for carrying out parallel reasoning on the heterogeneous combination model according to the parallel reasoning input data and the parallel reasoning static diagram.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model parallel reasoning method of any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the model parallel reasoning method of any of the embodiments of the present invention when executed.

According to the embodiment of the invention, the heterogeneous combination model of parallel reasoning is determined from each model to be reasoning so as to perform fusion optimization processing on the static diagram of each model to be reasoning in the heterogeneous combination model, so that the parallel reasoning static diagram of the heterogeneous combination model is obtained, parallel reasoning input data is constructed according to the input data of each model to be reasoning in the heterogeneous combination model and the parallel reasoning static diagram, parallel reasoning is performed on the heterogeneous combination model according to the parallel reasoning input data and the parallel reasoning static diagram, the problems of low efficiency, poor performance and the like of the existing model parallel reasoning method are solved, and the performance and efficiency of model parallel reasoning can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a model parallel reasoning method provided in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of a model parallel reasoning method provided in a second embodiment of the present invention;

FIG. 3 is a schematic flow chart of a model parallel reasoning method provided by an embodiment of the invention;

FIG. 4 is a schematic flow diagram of heterogeneous model parallel reasoning provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of a model parallel reasoning apparatus according to a third embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and "object" in the description of the present invention and the claims and the above drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a model parallel reasoning method provided in an embodiment of the present invention, where the embodiment is applicable to a case of performing parallel reasoning on each model according to a fusion optimization processing result of a static diagram of each model, the method may be performed by a model parallel reasoning apparatus, and the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device, where the electronic device may be a terminal device or a server device, so long as the electronic device may be used to infer a model, and the embodiment of the present invention does not limit a specific type of the electronic device. Accordingly, as shown in fig. 1, the method includes the following operations:

s110, determining a heterogeneous combination model of parallel reasoning from each model to be reasoning; wherein the heterogeneous combination model comprises at least two models to be inferred.

The model to be inferred may be a model with inference requirements, for example, may be any type of deep learning network model, reinforcement learning model, or any other artificial intelligent network model with inference requirements, and the embodiment of the present invention does not limit the specific model type of the model to be inferred. The heterogeneous combined model can be a model set formed by combining different models to be inferred.

In the embodiment of the invention, in order to realize the parallel reasoning process of the models from the underlying structure of the models, the static graphs corresponding to the models to be reasoning can be fused, so that the models to be reasoning can be performed with parallel reasoning according to the fusion result of the static graphs. Because the graph fusion parallel reasoning process can be realized based on a single process and does not involve scheduling of different processes, the time cost of scheduling of the parallel reasoning process can be effectively reduced, and each model in the graph fusion parallel reasoning process executes the reasoning process based on the same thread, so that the parallel reasoning performance of the reasoning card can be improved.

It will be appreciated that the types of models to be inferred are different, and their corresponding inference requirements are often inconsistent. Therefore, when the parallel reasoning is carried out on each model to be reasoning through the mode of graph fusion, the synchronicity of the reasoning demands is guaranteed firstly, namely the model to be reasoning which needs to be simultaneously reasoning can carry out graph fusion parallel reasoning, and therefore the model which needs to be parallel reasoning needs to be determined from the models to be reasoning to form a heterogeneous combination model.

In a specific example, if n models to be inferred are deployed on the inference card, heterogeneous models are among the models to be inferred, it is theoretically possible to obtain Each heterogeneous combination model can comprise at least two models to be inferred.

And S120, carrying out fusion optimization processing on the static diagrams of each model to be inferred in the heterogeneous combination model to obtain a parallel inference static diagram of the heterogeneous combination model.

The parallel reasoning static diagram can be a static diagram obtained by fusion optimization processing of the static diagrams of all models to be inferred in the heterogeneous combination model. That is, the parallel reasoning static diagram may be a static diagram of heterogeneous combination model matching, which is used as a reference when each model to be inferred in the heterogeneous combination model is inferred in parallel, so as to execute data processing logic inside the model.

Because the models to be inferred are mutually independent, independent static diagram compiling can be carried out on the models to be inferred in the heterogeneous combined model respectively, and static diagrams corresponding to the models to be inferred in the heterogeneous combined model are obtained respectively. For example, different frameworks such as onnx (Open Neural Network Exchange ) or pytorch (an open-source Python machine learning library based on Torch, which is used for application programs such as natural language processing) may be used to compile separate static graphs of each model to be inferred in the heterogeneous combined model, so as to obtain separate static graphs of each model to be inferred in the heterogeneous combined model.

Correspondingly, after the independent static diagrams of each model to be inferred in the heterogeneous combination model are obtained, fusion optimization processing can be carried out on the independent static diagrams of each model to be inferred in the heterogeneous combination model, and the parallel inference static diagrams of the heterogeneous combination model are obtained. And for other models to be inferred which are not subjected to graph fusion, the static graph optimization process can be independently performed to obtain the computational graph of the other models to be inferred which are not subjected to graph fusion and finally used for reasoning.

It can be understood that when the multiple to-be-inferred models which are not subjected to graph fusion are independently inferred, the current inference requirement of each to-be-inferred model needs to be inferred by calling a process each time, so that scheduling of multiple processes exists, and more inference overheads exist for more discrete static graphs and static graphs with smaller scale, so that parallel inference performance of the models is reduced. If the models to be inferred with the same or similar inference requirements can be combined to form a heterogeneous combination model, then the static diagrams of the models to be inferred in the heterogeneous combination model are fused and optimized, and the models to be inferred in the heterogeneous combination model can be inferred in parallel based on the parallel inference static diagrams of the heterogeneous combination model in one process, so that the performance and efficiency of model parallel inference are improved, and the method can be applied to different hardware devices, AI (Artificial Intelligence ) acceleration cards, algorithm frameworks and the like.

S130, constructing parallel reasoning input data according to the input data of each model to be reasoning in the heterogeneous combination model and the parallel reasoning static diagram.

The parallel reasoning input data can be input to the reasoning card so as to perform parallel reasoning on each model to be reasoning in the heterogeneous combination model.

It is understood that model reasoning is to obtain reasoning results according to the operational rules established by the model given input data. Since the heterogeneous combined model is a single process at the graph level to infer multiple models to be inferred in parallel, there may be a difference between the inferred input data of the multiple models to be inferred contained in the heterogeneous combined model. Therefore, the input data of each model to be inferred in the heterogeneous combination model can be spliced according to the reference factors such as the formation condition of the heterogeneous combination model before the process is started, and the parallel inference input data is obtained.

S140, carrying out parallel reasoning on the heterogeneous combination model according to the parallel reasoning input data and the parallel reasoning static diagram.

Correspondingly, after the parallel reasoning input data are constructed, the parallel reasoning input data can be input to the reasoning card. The inference card can conduct parallel inference on each model to be inferred in the heterogeneous combined model based on the parallel inference static diagram obtained by fusion optimization processing according to parallel inference input data.

Example two

Fig. 2 is a flowchart of a model parallel reasoning method provided by a second embodiment of the present invention, where the present embodiment is implemented based on the foregoing embodiment, and in the present embodiment, various specific alternative implementations of determining a heterogeneous combined model of parallel reasoning, performing fusion optimization processing on a static diagram, and constructing parallel reasoning input data are provided. Accordingly, as shown in fig. 2, the method of this embodiment may include:

s210, determining a heterogeneous model combination strategy.

The heterogeneous model combination strategy is used for referencing the strategy for forming the heterogeneous combination model for each model to be inferred.

Optionally, the heterogeneous model combining policy may include at least one of: model structure similarity strategy, model algorithm similarity strategy and operator use similarity strategy.

The model structure similarity strategy may be a strategy of combining models with partially similar model structures to obtain a heterogeneous combined model. The model algorithm similarity strategy can be a strategy for combining models with similar model algorithm principles or types to obtain a heterogeneous combined model. The operator use similarity strategy can be a strategy for combining models with similar operators in the model to obtain a heterogeneous combined model.

S220, determining the heterogeneous combination model of the parallel reasoning from the models to be reasoning according to the heterogeneous model combination strategy.

It can be appreciated that in practice, some models to be inferred have large differences, such as excessively large differences in model structure, algorithm type, internal operators, and the like. For the model to be inferred with large difference, better inference performance improvement may not be obtained by graph fusion. Therefore, when determining the heterogeneous combination model, one or more heterogeneous model combination strategies can be selected, and further, the heterogeneous combination model of parallel reasoning is determined from the models to be inferred according to the selected heterogeneous model combination strategies.

In one specific example, assume that A, B, C heterogeneous models to be inferred are deployed on an inference card. The partial structures of the model A and the model B are consistent, the partial structures of the model B and the model C are consistent, and the difference between the model A and the model C is large, namely the reasoning performance may not be improved after the graph fusion of the model A and the model C. Thus, model A and model B may be selected as one heterogeneous combination model, and model B and model C may be selected as another heterogeneous combination model.

S230, carrying out fusion processing on the static diagrams of the models to be inferred in the heterogeneous combination model according to a sub-graph splicing fusion strategy to obtain fusion static diagrams.

The sub-graph stitching and fusing strategy can be a strategy for fusing static graphs of various models to be inferred in the heterogeneous combined model. The fused static diagram can be a static diagram obtained by fusing the static diagrams of the models to be inferred in the heterogeneous combined model.

In the embodiment of the invention, when the static diagrams of all the models to be inferred in the heterogeneous combination model are fused and optimized, the static diagrams of all the models to be inferred in the heterogeneous combination model can be fused according to a sub-graph splicing fusion strategy to obtain a fused static diagram.

In an alternative embodiment of the present invention, the sub-graph stitching fusion strategy may include constant folding, compute node fusion, and parallel analysis; the fusion processing of the static diagram of each model to be inferred in the heterogeneous combination model according to the sub-graph splicing fusion strategy can comprise: performing constant folding processing on target constant expressions of static diagrams of the models to be inferred in the heterogeneous combination model; performing computing node fusion processing on a first target computing node of the static diagram of each model to be inferred in the heterogeneous combination model; and carrying out calculation node parallel calculation analysis processing on a second target calculation node of the static diagram of each model to be inferred in the heterogeneous combination model.

The target constant expression may be an expression capable of performing a constant folding process in the static diagram. The first target computing node may be a computing node type capable of performing a node fusion process. The second target computing node may be a computing node type capable of performing parallel computing processing.

Optionally, a fusion mode of static graphs such as constant folding, calculation node fusion and parallel analysis can be used as a sub-graph splicing and fusion strategy, so that the static graphs of each model to be inferred in the heterogeneous combination model are fused according to the sub-graph splicing and fusion strategy.

Specifically, an expression which can perform constant folding processing on the static diagram of each model to be inferred in the heterogeneous combination model can be determined as a target constant expression, and the constant folding processing is performed on the target constant expression. For example, at the time of the compiler parsing, the target constant expression is calculated and evaluated, and the target constant expression is replaced with the obtained value and put in the constant table.

Specifically, the computing nodes of the static diagram of each model to be inferred in the heterogeneous combined model can be analyzed and classified, the first target computing node and the second target computing node are screened out, and the computing node fusion processing is carried out on the first target computing node. For example, in order to obtain higher execution performance, for a neural network with a multi-branch structure, a plurality of computing nodes which can be executed in parallel and consume little computational resources are determined as a first target computing node, and the computing logic of the first target computing node is fused and then executed. Alternatively, the manner of fusing the first target computing node may include, but is not limited to, a horizontal fusion (horizontal fusion) manner, an operator-level parallel (operator level parallelism) manner, and the like. Correspondingly, for the second target computing node, the computing node parallel computing analysis processing can be performed.

Therefore, the static diagrams of all models to be inferred in the heterogeneous combined model can be unified fused through the available sub-graph splicing and fusion strategy, and a plurality of small static diagrams are spliced into a larger fusion static diagram.

And S240, optimizing the fusion static diagram according to a static diagram optimization strategy to obtain the parallel reasoning static diagram.

The static diagram optimization strategy can be a strategy for optimizing the fusion static diagram.

Correspondingly, after the fusion static diagram is obtained, the fusion static diagram can be optimized uniformly again, so that the final parallel reasoning static diagram is obtained.

In an optional embodiment of the present invention, the optimizing the fused static graph according to a static graph optimization policy may include: performing arrangement optimization on the time sequence and the data of the fusion static diagram; performing serial-parallel combination optimization on the control flow of the fusion static diagram; and carrying out fusion processing on the target operators of the fusion static diagram.

Optionally, when the fused static diagram is optimized according to the static diagram optimization policy, the sequence and the data of the fused static diagram can be arranged and optimized. Specifically, the optimization of the execution time sequence can be automatically performed on the fusion static diagram so as to improve the communication and calculation concurrency of the fusion static diagram. Meanwhile, the data arrangement optimization and the memory static arrangement can be carried out on the fusion static diagram, so that the memory utilization rate is improved, memory fragments are reduced, and the batch size (which represents the number of data (samples) which are transmitted to a program for training at a time) is improved, so that the calculation performance is improved.

Optionally, when the fusion static diagram is optimized according to the static diagram optimization strategy, the control flow of the fusion static diagram can be subjected to serial-parallel combination optimization, namely, the serial-parallel of the control flow of the fusion static diagram is optimized, so that extra parallel read-write competition caused by optimization and code conversion is avoided, and the overall performance is improved.

Optionally, when the fusion static diagram is optimized according to the static diagram optimization strategy, fusion processing can be performed on the target operator of the fusion static diagram. For example, by fusing a plurality of small operators fusing static graphs into one large operator, access to intermediate data is reduced and computation density is increased to improve performance, etc.

And optimizing the fusion static diagram through the optional static diagram optimization strategy, and finally obtaining a parallel reasoning static diagram after fusion optimization. At this time, the heterogeneous combination model can execute calculation on the obtained parallel reasoning static diagram in the same process, so that not only can the scheduling overhead of the process be reduced, but also the number of the scattered diagram small diagrams can be reduced, and the calculation density and performance are improved.

S250, determining the data dimension of the input data of each model to be inferred in the heterogeneous combination model.

It is understood that the dimensions of the input data corresponding to the different models to be inferred may also be different. For example, model a may have a data dimension of 4*8, while model B may have a data dimension of 4×12. Therefore, when parallel reasoning input data is constructed, the data dimension of the input data of each model to be inferred in the heterogeneous combination model can be determined, so that the input data of each model to be inferred is spliced according to the data dimension of the input data of each model to be inferred in the heterogeneous combination model.

S260, determining the fusion sequence of each static diagram in the parallel reasoning static diagram.

S270, determining the splicing sequence of the input data according to the fusion sequence of the static diagrams in the parallel reasoning static diagrams.

S280, splicing the input data according to the data dimension of the input data according to the splicing sequence of the input data, and obtaining the parallel reasoning input data.

In the embodiment of the invention, when the input data of each model to be inferred is spliced, besides the data dimension of the input data of each model to be inferred, the fusion sequence of each static diagram in the parallel inference static diagram can be considered, so that the splicing sequence of each input data is determined according to the fusion sequence of each static diagram in the parallel inference static diagram, and then each input data is spliced according to the splicing sequence of each input data and the data dimension of each input data, and finally the parallel inference input data is obtained.

In a specific example, assume that the fusion order of static diagram a and static diagram B in the parallel reasoning static diagram is static diagram a-static diagram B, where static diagram a is a static diagram of model a and static diagram B is a static diagram of model B. The concatenation order of the input data may be determined to be the input data of model a-the input data of model B. Assuming that the data dimension of the input data of the model a is 4*8 and the data dimension of the input data of the model B is 4*4, splicing the input data according to the data dimension of each input data according to the splicing order of the input data, and splicing the input data of the model B in 4*4 dimension behind the input data of the model a in 4*8 dimension to obtain parallel reasoning input data in 4 x 12 dimensions.

S290, determining the reasoning priority sequence of each heterogeneous combination model according to model operation performance and/or model reasoning requirements.

Wherein the inference priority order can be used to determine the inference order of the different heterogeneous composition models. It will be appreciated that the higher the inference priority, the earlier the order of reasoning for its heterogeneous combined model.

Optionally, when determining the inference priority order according to the model operation performance, the better the model operation performance is, the higher the inference priority, and the higher the inference priority order is. Alternatively, the model running performance may be determined according to the relevant factors such as the model running time length, the model calculation speed, and the model storage capacity. For example, the shorter the operation duration, the faster the calculation speed, the smaller the storage capacity, and the better the model operation performance.

Optionally, when determining the reasoning priority order according to the model reasoning requirement, the higher the matching degree between the model and the model reasoning requirement is, the higher the reasoning priority is, and the higher the reasoning priority order is. By way of example, the model inference requirements may be determined based on relevant factors such as user priority information.

S2110, carrying out parallel reasoning on each heterogeneous combination model according to the reasoning priority order and the parallel reasoning input data and the parallel reasoning static diagram.

Because the graph fusion parallel reasoning is carried out on the model to be reasoning, the reasoning synchronism is ensured, if a large number of heterogeneous combination models need to be subjected to parallel reasoning, the reasoning priority order of each heterogeneous combination model can be determined according to the model running performance and/or model reasoning requirements according to greedy strategies, so that the heterogeneous combination model needing to be subjected to current reasoning is determined according to the reasoning priority order, and parallel reasoning is carried out on the heterogeneous combination model needing to be subjected to current reasoning according to parallel reasoning input data of the heterogeneous combination model needing to be subjected to current reasoning and the parallel reasoning static graph.

In a specific example, when the actual model to be inferred is subjected to parallel inference deployment, the inference card background can continuously receive the inference requirements of each model to be inferred and execute inference in a queue mode. Firstly, the running time lengths of the obtained parallel reasoning static diagrams of the heterogeneous combination models can be ranked, and the shorter the running time length of the parallel reasoning static diagrams is, the better the performance of the heterogeneous combination models is represented. Furthermore, according to the stored reasoning requirements, the parallel reasoning is executed by using the parallel reasoning static diagram of the heterogeneous combination model after the input data of the heterogeneous combination model are spliced according to the combination reasoning requirements of the heterogeneous combination model with good performance. By adopting the parallel reasoning mode, the process scheduling overhead can be reduced, and the calculation power of the acceleration card is fully utilized to improve the reasoning performance.

The technical scheme provides the heterogeneous model parallel reasoning method based on graph fusion, static graphs of the models to be reasoning are subjected to unified fusion optimization in a graph fusion mode, and a plurality of models to be reasoning are subjected to parallel reasoning in a single process, so that process scheduling overhead of parallel reasoning of the models can be effectively reduced, and parallel reasoning performance of the models can be improved.

In order to more clearly describe the technical solution provided by the embodiment of the present invention, in a specific example, heterogeneous models such as bert (Bidirectional Encoder Representation from Transformers, a Pre-trained language characterization model), gpt (generating Pre-Trained Transformer, generating Pre-trained transformation model), vit (Vision Transformer, a Self-Attention-based image classification model) are deployed on a single inference card gpu, which specifically describes the model parallel inference method provided by the embodiment of the present invention. Fig. 3 is a schematic flow chart of a model parallel reasoning method according to an embodiment of the present invention. Accordingly, as shown in fig. 3, the model parallel reasoning method may include the following operations:

and (1) determining a heterogeneous model and a heterogeneous combined model of parallel reasoning.

The main structures of the bert model, the gpt model and the vit model are all transformers, the model structures have larger similarity, and according to a model structure similarity strategy, the heterogeneous combination model of the bert+gpt can be obtained by combining the bert model and the gpt model, the heterogeneous combination model of the bert+vit can be obtained by combining the bert model and the vit model, the heterogeneous combination model of the gpt+vit can be obtained by combining the gpt model and the vit model, and the heterogeneous combination model of the bert+gpt+vit can be obtained by combining the bert model, the gpt model and the vit model. In total, four heterogeneous combined models of bert+gpt, bert+vit, gpt+vit and bert+gpt+vit can be obtained, and the four heterogeneous combined models of bert+gpt, bert+vit, gpt+vit and bert+gpt+vit are equivalent to the fact that 7 models, namely three independent heterogeneous models of bert, gpt and vit, are deployed by the reasoning card. And (3) carrying out reasoning calculation on the 7 model starting processes respectively during reasoning. Compared with independent model reasoning, the heterogeneous combined model can complete the reasoning of two reasoning demands in one process, so that the process scheduling can be effectively saved, and the reasoning performance can be improved.

And (2) compiling a static diagram of the heterogeneous model.

The heterogeneous model structures are mutually independent, so that independent static diagram compiling is carried out on the bert, the gpt and the vit respectively, and fixed compiled code texts are obtained respectively. For example, the static map may be compiled using the pytorch framework for bert, gpt, and vit separately, resulting in three independent static maps. On one hand, the independent static diagrams can be subjected to diagram optimization to obtain independent reasoning optimization static diagrams; on the other hand, fusion optimization processing of static graphs can be performed on four heterogeneous combination models, namely bert+gpt, bert+vit, gpt+vit and bert+gpt+vit.

And (3) fusing and optimizing the heterogeneous model static diagram.

When four heterogeneous combination models of bert+gpt, bert+vit, gpt+vit and bert+gpt+vit are respectively subjected to graph fusion optimization, because all static graph information is acquired, the fusion optimization processing of the static graphs can be performed according to a given strategy. Specifically, when fusion processing is performed on each static diagram in the heterogeneous combination model, sub-diagram splicing fusion operation can be performed, and a plurality of small diagrams can be spliced into a large diagram through constant folding, calculation node fusion, parallel analysis and other modes, so that a fusion static diagram of each heterogeneous combination model is obtained. Furthermore, for the optimization of time sequence and data arrangement of the fusion static diagram, the optimization of execution time sequence can be automatically performed, the concurrency of communication and calculation is improved, the memory utilization rate is improved through the optimization of data arrangement and the static arrangement of the memory, memory fragments are reduced, and the batch size is improved, so that the performance is improved. The control flow serial-parallel optimization merging of the fusion static diagram, namely, the serial-parallel optimization of the control flow, avoids the extra parallel read-write competition caused by optimization and code conversion, and improves the overall performance; and (3) carrying out fusion processing on operators for fusing the static images, and fusing a plurality of small operators for fusing the static images into a large operator, so that the access of intermediate data is reduced, the computation density is improved, the performance is improved, and the like. Through the static diagram fusion optimization mode, the optimized parallel reasoning static diagram can be obtained by the final various heterogeneous combined models.

And (4) splicing and parallel reasoning of heterogeneous model input data.

Model reasoning is to obtain a reasoning result according to a given operation rule given input data. Heterogeneous combined models are single-process parallel reasoning multiple heterogeneous models at the graph level. The heterogeneous combination model comprises a plurality of heterogeneous models, wherein the input data of the heterogeneous models have different shapes, such as the input data of bert and gpt are inconsistent, such as the data dimension of the input data of bert is 4 x 8, and the data dimension of the input data of gpt is 4*4. Therefore, before the reasoning process is started, input data of each heterogeneous model in the heterogeneous combination model is spliced according to the data dimension of each heterogeneous model according to the constitution of the heterogeneous combination model, and parallel reasoning input data is obtained. For example, in the above example, input data of each heterogeneous model of the bert+gpt heterogeneous combination model is subjected to data stitching to form data with 4 x 12 dimensions before the process is started. After the parallel reasoning input data are obtained through splicing, the parallel reasoning calculation is executed according to the fused parallel reasoning static diagram, and a parallel reasoning result is obtained. Of course, preprocessing of the input data may also be performed by padding (a shorthand attribute for defining a space between the element border and the element content) or clipping, etc. on the input data according to the actual shape.

When the heterogeneous combination models are subjected to parallel reasoning, the reasoning synchronism needs to be ensured, and for the parallel reasoning requirements of a large number of heterogeneous combination models, the combined reasoning requirements of the heterogeneous combination models with good performance can be preferentially combined according to a greedy strategy. Fig. 4 is a schematic flow chart of heterogeneous model parallel reasoning provided by the embodiment of the invention. In a specific example, as shown in fig. 4, the specific procedure of heterogeneous model parallel reasoning is as follows:

when the actual heterogeneous model parallel reasoning deployment is carried out, the background can continuously receive reasoning requirements and execute reasoning in a queue mode. Firstly, sequencing the running time of the parallel reasoning static diagram of the four heterogeneous combination models obtained in the step (3), if the running time is equal to the running time of the bert+gpt+vit > bert+gpt > bert+vit > gpt+vit, determining the current heterogeneous combination model according to the combination reasoning requirement of the heterogeneous combination model with good performance, and if the first reasoning heterogeneous combination model bert+gpt+vit is determined as the current heterogeneous combination model. Accordingly, after the current heterogeneous combination model is determined, parallel reasoning input data of the current heterogeneous combination model can be used for performing parallel reasoning on the current heterogeneous combination model by using the fused parallel reasoning static diagram. By adopting the method, the process scheduling overhead can be reduced, the computing power of the acceleration card is fully utilized, and the reasoning performance is improved.

In a specific example, heterogeneous models such as yolo (You Only Look Once, a target detection algorithm based on deep learning), resnet (Deep residual network, deep residual neural network), transformers (converter models), gpt and the like are deployed on a general AI acceleration device to specifically illustrate a model parallel reasoning method provided by an embodiment of the present invention. Accordingly, as shown in fig. 3, the model parallel reasoning method may include the following operations:

In the heterogeneous model, the main structures of yolo and resnet are convolution calculation, the main structures of transformers and gpt are transformers, and the structures have larger similarity, so that a heterogeneous combination model of yolo+resnet can be obtained by combining yolo and resnet, and a heterogeneous combination model of transformers+gpt can be obtained by combining transformers and gpt. Although the convolution type and the transducer type models have different network structures, the convolution type and the transducer type models can be converted into matrix operation in the actual end, so that the heterogeneous combination models can be formed. In order to simplify the processing, the two heterogeneous combination models are divided, and 6 models, namely four independent heterogeneous models of yolo, resnet, transformers and gpt and two heterogeneous combination models of yolo+resnet and transformers+gpt, are deployed in an equivalent way to the reasoning card. And (3) carrying out reasoning calculation on the starting processes of the 6 models respectively during reasoning. Compared with independent model reasoning, the heterogeneous combined model can complete the reasoning of two reasoning demands in one process, so that the process scheduling can be effectively saved, and the reasoning performance can be improved.

And (2) compiling a static diagram of the heterogeneous model.

The heterogeneous model structures are independent of each other, so that independent static diagram compiling is carried out on yolo, resnet, transformers and gpt respectively, and fixed compiled code texts are obtained respectively. For example, the onnx framework may be used to compile a static diagram for yolo, resnet, transformers and gpt, respectively, to obtain four independent static diagrams. On one hand, the independent static diagrams can be subjected to diagram optimization to obtain independent reasoning optimization static diagrams; on the other hand, fusion optimization processing of static diagrams can be performed on two heterogeneous combination models, namely yolo+resnet and transformers+gpt.

And (3) fusing and optimizing the heterogeneous model static diagram.

When the two heterogeneous combinations of yolo+resnet and transformers+gpt are respectively subjected to graph fusion optimization, the fusion optimization processing of the static graphs can be performed according to a given strategy because all static graph information is acquired. Specifically, when fusion processing is performed on each static diagram in the heterogeneous combination model, sub-diagram splicing fusion operation can be performed, and a plurality of small diagrams can be spliced into a large diagram through constant folding, calculation node fusion, parallel analysis and other modes, so that a fusion static diagram of each heterogeneous combination model is obtained. Furthermore, for the optimization of time sequence and data arrangement of the fusion static diagram, the optimization of execution time sequence can be automatically performed, the concurrency of communication and calculation is improved, the memory utilization rate is improved through the optimization of data arrangement and the static arrangement of the memory, memory fragments are reduced, and the batch size is improved, so that the performance is improved. The control flow serial-parallel optimization merging of the fusion static diagram, namely, the serial-parallel optimization of the control flow, avoids the extra parallel read-write competition caused by optimization and code conversion, and improves the overall performance; and (3) carrying out fusion processing on operators for fusing the static images, and fusing a plurality of small operators for fusing the static images into a large operator, so that the access of intermediate data is reduced, the computation density is improved, the performance is improved, and the like. Through the static diagram fusion optimization mode, the optimized parallel reasoning static diagram can be obtained by the final various heterogeneous combined models.

And (4) splicing and parallel reasoning of heterogeneous model input data.

And similarly, input data of the heterogeneous combination model are spliced and reasoning is carried out in parallel.

It should be noted that any permutation and combination of the technical features in the above embodiments also belong to the protection scope of the present invention.

Example III

Fig. 5 is a schematic diagram of a model parallel reasoning apparatus according to a third embodiment of the present invention, as shown in fig. 5, where the apparatus includes: a heterogeneous combined model determination module 310, a parallel reasoning static diagram acquisition module 320, a parallel reasoning input data construction module 330, and a heterogeneous combined model parallel reasoning module 340, wherein:

a heterogeneous combination model determining module 310, configured to determine a heterogeneous combination model of parallel reasoning from the models to be reasoning; wherein the heterogeneous combination model comprises at least two models to be inferred;

the parallel reasoning static diagram obtaining module 320 is configured to perform fusion optimization processing on the static diagrams of each model to be reasoning in the heterogeneous combination model, so as to obtain a parallel reasoning static diagram of the heterogeneous combination model;

a parallel reasoning input data construction module 330, configured to construct parallel reasoning input data according to the input data of each model to be reasoning in the heterogeneous combination model and the parallel reasoning static diagram;

And the heterogeneous combination model parallel reasoning module 340 is configured to perform parallel reasoning on the heterogeneous combination model according to the parallel reasoning input data and the parallel reasoning static diagram.

Optionally, the heterogeneous combination model determining module 310 is specifically configured to: determining a heterogeneous model combination strategy; determining a heterogeneous combination model of the parallel reasoning from each model to be inferred according to the heterogeneous model combination strategy; wherein the heterogeneous model combining strategy comprises at least one of the following: model structure similarity strategy, model algorithm similarity strategy and operator use similarity strategy.

Optionally, the parallel reasoning static diagram acquisition module 320 is specifically configured to: carrying out fusion processing on the static diagrams of each model to be inferred in the heterogeneous combination model according to a sub-graph splicing fusion strategy to obtain fusion static diagrams; and optimizing the fusion static diagram according to a static diagram optimization strategy to obtain the parallel reasoning static diagram.

Optionally, the subgraph splicing and fusing strategy comprises constant folding, computing node fusion and parallel analysis; the parallel reasoning static diagram acquisition module 320 is specifically configured to: performing constant folding processing on target constant expressions of static diagrams of the models to be inferred in the heterogeneous combination model; performing computing node fusion processing on a first target computing node of the static diagram of each model to be inferred in the heterogeneous combination model; and carrying out calculation node parallel calculation analysis processing on a second target calculation node of the static diagram of each model to be inferred in the heterogeneous combination model.

Optionally, the parallel reasoning static diagram acquisition module 320 is specifically configured to: performing arrangement optimization on the time sequence and the data of the fusion static diagram; performing serial-parallel combination optimization on the control flow of the fusion static diagram; and carrying out fusion processing on the target operators of the fusion static diagram.

Optionally, the parallel reasoning input data construction module 330 is specifically configured to: determining the data dimension of the input data of each model to be inferred in the heterogeneous combination model; determining the fusion sequence of each static diagram in the parallel reasoning static diagram; determining the splicing sequence of the input data according to the fusion sequence of each static diagram in the parallel reasoning static diagram; and splicing the input data according to the data dimension of the input data according to the splicing sequence of the input data to obtain the parallel reasoning input data.

Optionally, the heterogeneous combination model is multiple in number; the heterogeneous combined model parallel reasoning module 340 is specifically configured to: determining the reasoning priority order of each heterogeneous combination model according to model running performance and/or model reasoning requirements; and carrying out parallel reasoning on each heterogeneous combination model according to the reasoning priority order and the parallel reasoning input data and the parallel reasoning static diagram.

The model parallel reasoning device can execute the model parallel reasoning method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details which are not described in detail in this embodiment can be seen in the model parallel reasoning method provided by any embodiment of the present invention.

Since the model parallel reasoning apparatus described above is an apparatus capable of executing the model parallel reasoning method in the embodiment of the present invention, based on the model parallel reasoning method described in the embodiment of the present invention, those skilled in the art can understand the specific implementation of the model parallel reasoning apparatus of the present embodiment and various variations thereof, so how the model parallel reasoning apparatus implements the model parallel reasoning method in the embodiment of the present invention will not be described in detail herein. The device adopted by the model parallel reasoning method in the embodiment of the invention belongs to the scope of protection required by the application as long as the person skilled in the art implements the model parallel reasoning method.

Example IV

Fig. 6 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the model parallel reasoning method.

In some embodiments, the model parallel reasoning method may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the model parallel reasoning method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the model parallel reasoning method in any other suitable way (e.g. by means of firmware).

By way of example, the model parallel reasoning method may include the following operations:

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

Claims

1. A model parallel reasoning method, comprising:

2. The method of claim 1, wherein determining a heterogeneous combined model of parallel reasoning from the models to be reasoning comprises:

determining a heterogeneous model combination strategy;

determining a heterogeneous combination model of the parallel reasoning from each model to be inferred according to the heterogeneous model combination strategy;

wherein the heterogeneous model combining strategy comprises at least one of the following: model structure similarity strategy, model algorithm similarity strategy and operator use similarity strategy.

3. The method of claim 1, wherein the performing fusion optimization on the static graphs of each model to be inferred in the heterogeneous combined model to obtain a parallel inferred static graph of the heterogeneous combined model includes:

carrying out fusion processing on the static diagrams of each model to be inferred in the heterogeneous combination model according to a sub-graph splicing fusion strategy to obtain fusion static diagrams;

and optimizing the fusion static diagram according to a static diagram optimization strategy to obtain the parallel reasoning static diagram.

4. The method of claim 3, wherein the sub-graph stitching fusion strategy comprises constant folding, compute node fusion, and parallel analysis; and carrying out fusion processing on the static diagrams of each model to be inferred in the heterogeneous combination model according to a sub-graph splicing fusion strategy, wherein the fusion processing comprises the following steps:

Performing constant folding processing on target constant expressions of static diagrams of the models to be inferred in the heterogeneous combination model;

performing computing node fusion processing on a first target computing node of the static diagram of each model to be inferred in the heterogeneous combination model;

and carrying out calculation node parallel calculation analysis processing on a second target calculation node of the static diagram of each model to be inferred in the heterogeneous combination model.

5. A method according to claim 3, wherein said optimizing said fused static graph according to a static graph optimization strategy comprises:

performing arrangement optimization on the time sequence and the data of the fusion static diagram;

performing serial-parallel combination optimization on the control flow of the fusion static diagram;

and carrying out fusion processing on the target operators of the fusion static diagram.

6. The method of claim 1, wherein said constructing parallel inference input data from the input data of each model to be inferred in the heterogeneous combined model and the parallel inference static graph comprises:

determining the data dimension of the input data of each model to be inferred in the heterogeneous combination model;

determining the fusion sequence of each static diagram in the parallel reasoning static diagram;

Determining the splicing sequence of the input data according to the fusion sequence of each static diagram in the parallel reasoning static diagram;

and splicing the input data according to the data dimension of the input data according to the splicing sequence of the input data to obtain the parallel reasoning input data.

7. The method of claim 1, wherein the heterogeneous combined model is a plurality of numbers; the parallel reasoning of the heterogeneous combination model according to the parallel reasoning input data and the parallel reasoning static diagram comprises the following steps:

determining the reasoning priority order of each heterogeneous combination model according to model running performance and/or model reasoning requirements;

and carrying out parallel reasoning on each heterogeneous combination model according to the reasoning priority order and the parallel reasoning input data and the parallel reasoning static diagram.

8. A model parallel reasoning apparatus, comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model parallel reasoning method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores computer instructions for causing a processor to implement the model parallel reasoning method of any of claims 1-7 when executed.