CN116402141B

CN116402141B - Model reasoning method and device, electronic equipment and storage medium

Info

Publication number: CN116402141B
Application number: CN202310678319.9A
Authority: CN
Inventors: 孙瑞鑫; 吴志华
Original assignee: Taichu Wuxi Electronic Technology Co ltd
Current assignee: Taichu Wuxi Electronic Technology Co ltd
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-05
Anticipated expiration: 2043-06-09
Also published as: CN116402141A

Abstract

The embodiment of the invention discloses a model reasoning method, a device, electronic equipment and a storage medium. The model reasoning method specifically comprises the following steps: determining an inference acceleration card corresponding to a current inference model, and determining acceleration card model parameters corresponding to the inference acceleration card; parameter segmentation is carried out on the acceleration card model parameters to obtain model segmentation parameters corresponding to the reasoning acceleration card; determining a corresponding parameter storage space of the model segmentation parameter in the reasoning acceleration card; and under the condition that a model reasoning request is received, executing model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space by the reasoning acceleration card. The technical scheme of the embodiment of the invention can reduce the memory overhead of the reasoning acceleration card and improve the performance of the reasoning acceleration card, thereby improving the model reasoning performance of the current reasoning model.

Description

Model reasoning method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a model reasoning method, a device, electronic equipment and a storage medium.

Background

With the continuous development of artificial intelligence technology, large models of the billion/trillion parameter scale become more and more, and the demand for model online reasoning services provided by the large models of the billion/trillion parameter scale is also higher and higher.

However, when a large model with a billion/trillion parameter scale is in the process of performing model online reasoning, a single Zhang Jiasu card cannot load model parameters of the whole large model, so that a plurality of accelerator cards are generally used to load model parameters of the large model in the prior art to perform model online reasoning on a plurality of accelerator cards. However, the method of implementing model reasoning by using multiple accelerator cards generally results in redundancy of video memory, thereby reducing the model reasoning capability of the large model.

Disclosure of Invention

The embodiment of the invention provides a model reasoning method, a device, electronic equipment and a storage medium, which can reduce the memory overhead of a reasoning acceleration card and improve the performance of the reasoning acceleration card, thereby improving the model reasoning performance of a current reasoning model.

According to an aspect of the present invention, there is provided a model reasoning method including:

determining an inference acceleration card corresponding to a current inference model, and determining acceleration card model parameters corresponding to the inference acceleration card;

Parameter segmentation is carried out on the acceleration card model parameters to obtain model segmentation parameters corresponding to the reasoning acceleration card;

determining a corresponding parameter storage space of the model segmentation parameter in the reasoning acceleration card;

and under the condition that a model reasoning request is received, executing model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space by the reasoning acceleration card.

According to another aspect of the present invention, there is provided a model reasoning apparatus comprising:

the reasoning acceleration card determining module is used for determining a reasoning acceleration card corresponding to the current reasoning model and determining acceleration card model parameters corresponding to the reasoning acceleration card;

the parameter segmentation module is used for carrying out parameter segmentation on the acceleration card model parameters to obtain model segmentation parameters corresponding to the reasoning acceleration card;

the storage space determining module is used for determining a parameter storage space corresponding to the model segmentation parameter in the reasoning acceleration card;

and the model reasoning module is used for executing the model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space through the reasoning acceleration card under the condition of receiving a model reasoning request.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model reasoning method of any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the model reasoning method of any of the embodiments of the present invention when executed.

According to the technical scheme, the method and the device for the model reasoning of the current reasoning acceleration card comprise the steps of determining the reasoning acceleration card corresponding to the current reasoning model, determining the acceleration card model parameter corresponding to the reasoning acceleration card, carrying out parameter segmentation on the acceleration card model parameter to obtain the model segmentation parameter corresponding to the reasoning acceleration card, and determining the corresponding parameter storage space of the model segmentation parameter in the reasoning acceleration card, so that when a model reasoning request is received, the model reasoning of the current reasoning model is carried out through the reasoning acceleration card according to the model segmentation parameter stored in the parameter storage space, the problem that the model reasoning capacity is low due to the redundancy of a display memory in the prior art is solved, the display memory expense of the reasoning acceleration card can be reduced, the performance of the reasoning acceleration card is improved, and the model reasoning performance of the current reasoning model is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a model reasoning method according to an embodiment of the present invention.

Fig. 2 is a flowchart of a model reasoning method according to a second embodiment of the present invention.

Fig. 3 is a flowchart of a model reasoning method according to a third embodiment of the present invention.

Fig. 4 is an exemplary flowchart of a model reasoning method according to a fourth embodiment of the present invention.

Fig. 5 is a schematic diagram of a memory division of an accelerator card according to a fourth embodiment of the invention.

Fig. 6 is a schematic diagram of a current inference model according to a fourth embodiment of the present invention.

Fig. 7 is a schematic diagram of a model inference apparatus according to a fifth embodiment of the present invention.

Fig. 8 is a schematic structural diagram of an electronic device implementing a model reasoning method of an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a model reasoning method provided in an embodiment of the present invention, where the embodiment may be applied to a situation of reducing the memory overhead of a reasoning acceleration card and improving the performance of the reasoning acceleration card, the method may be implemented by a model reasoning device, where the device may be implemented by software and/or hardware, and may generally be directly integrated in an electronic device that performs the method, where the electronic device may be a terminal device or a server device, and the embodiment of the present invention does not limit the type of electronic device that performs the model reasoning method. Specifically, as shown in fig. 1, the model reasoning method specifically includes the following steps:

s110, determining an inference acceleration card corresponding to the current inference model, and determining acceleration card model parameters corresponding to the inference acceleration card.

The current reasoning model can be any large model which is currently used for model reasoning. The inference accelerator card may be an accelerator card device capable of performing model inference tasks. It is to be appreciated that the current inference model can correspond to one or more inference accelerator cards to enable model inference of the current inference model by the inference accelerator cards. The accelerator card model parameters may be model parameters used by the inference accelerator card in performing model inference tasks. It is understood that different inferential accelerator cards may correspond to different accelerator card model parameters.

In the embodiment of the invention, the reasoning acceleration card corresponding to the current reasoning model is determined so as to determine the acceleration card model parameters corresponding to the reasoning acceleration card. Alternatively, the inference accelerator card may be a heterogeneous many-core device. It should be noted that, the embodiment of the present invention does not limit a specific implementation manner of determining the accelerator card model parameter, so long as the determination of the accelerator card model parameter can be achieved.

And S120, carrying out parameter segmentation on the acceleration card model parameters to obtain model segmentation parameters corresponding to the reasoning acceleration card.

The parameter segmentation may be to segment the acceleration card model parameters, so as to store the segmented parameters into the corresponding storage spaces. The model segmentation parameters may be parameters obtained after parameter segmentation is performed on the accelerator card model parameters.

In the embodiment of the invention, after the acceleration card model parameters corresponding to the reasoning acceleration card are determined, the acceleration card model parameters can be further subjected to parameter segmentation to obtain the model segmentation parameters corresponding to the reasoning acceleration card. It can be understood that the parameter segmentation is performed on the acceleration card model parameters, so that a plurality of model segmentation parameters can be obtained.

S130, determining a corresponding parameter storage space of the model segmentation parameter in the reasoning acceleration card.

The parameter storage space can be a space for storing model segmentation parameters in the reasoning acceleration card. It is understood that the number of parameter storage spaces may be consistent with the number of model cut parameters.

In the embodiment of the invention, after the model parameters of the acceleration card are subjected to parameter segmentation to obtain the model segmentation parameters corresponding to the inference acceleration card, the corresponding parameter storage space of the model segmentation parameters in the inference acceleration card can be further determined so as to store the model segmentation parameters in the corresponding parameter storage space.

And S140, under the condition that a model reasoning request is received, executing model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space by the reasoning acceleration card.

Wherein, the model reasoning request can be to request the current reasoning model to perform model reasoning.

In the embodiment of the invention, after the corresponding parameter storage space of the model segmentation parameters in the reasoning acceleration card is determined, the model segmentation parameters can be further stored in the parameter storage space, so that when a model reasoning request is received, the model reasoning of the current reasoning model is executed by the reasoning acceleration card according to the model segmentation parameters stored in the parameter storage space.

According to the technical scheme, the corresponding reasoning acceleration card of the current reasoning model is determined, the acceleration card model parameters corresponding to the reasoning acceleration card are determined, the parameter segmentation is carried out on the acceleration card model parameters to obtain the model segmentation parameters corresponding to the reasoning acceleration card, and the corresponding parameter storage space of the model segmentation parameters in the reasoning acceleration card is determined, so that when a model reasoning request is received, the model reasoning of the current reasoning model is carried out through the reasoning acceleration card according to the model segmentation parameters stored in the parameter storage space, the problem that the model reasoning capacity is low due to the redundancy of the display memory in the prior art is solved, the display memory expense of the reasoning acceleration card can be reduced, the performance of the reasoning acceleration card is improved, and the model reasoning performance of the current reasoning model is improved.

Example two

Fig. 2 is a flowchart of a model reasoning method provided by a second embodiment of the present invention, where the embodiment further refines the above technical solutions, and provides a plurality of specific alternative implementation manners for performing parameter segmentation on the acceleration card model parameters to obtain model segmentation parameters corresponding to the acceleration card, determining a parameter storage space corresponding to the model segmentation parameters in the acceleration card, and executing, by the acceleration card, model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space. The technical solution in this embodiment may be combined with each of the alternatives in one or more embodiments described above. As shown in fig. 2, the method may include the steps of:

S210, determining an inference acceleration card corresponding to the current inference model, and determining acceleration card model parameters corresponding to the inference acceleration card.

S220, determining the core group access type corresponding to the acceleration card model parameter.

The core group access type may be an access type of a core group in the inference accelerator card when accessing the accelerator card model parameter, for example, a type that a plurality of core groups access at the same time, a type that a single core group accesses separately, etc., which is not limited by the embodiment of the present invention. It can be appreciated that there may be multiple core groups in the inference accelerator card, and that when the inference accelerator card performs the model inference task, the model inference task may be performed by the multiple core groups in the inference accelerator card. When the core groups in the reasoning acceleration card execute model reasoning tasks, the core groups need to access the acceleration card model parameters.

In the embodiment of the invention, after the acceleration card model parameters corresponding to the reasoning acceleration card are determined, the core group access type corresponding to the acceleration card model parameters can be further determined.

S230, dividing acceleration card model parameters with the kernel group access type being a single-kernel access type into first dividing parameters corresponding to the reasoning acceleration card; and splitting the acceleration card model parameters with the kernel group access type being the multi-kernel access type into second splitting parameters corresponding to the reasoning acceleration card.

The single-core access type may be a type in which the accelerator card model parameter may be accessed only by a single core group, that is, a type in which multiple core groups are not required to be accessed simultaneously. The first partition parameter may be a parameter that is only accessible by a single core group, i.e. a parameter that does not require simultaneous access by multiple core groups. The multi-core access type may be a type where accelerator card model parameters need to be accessed by multiple core groups simultaneously. The second slicing parameter may be a parameter that requires multiple core groups to be accessed simultaneously.

In the embodiment of the invention, after the core group access type corresponding to the acceleration card model parameter is determined, the acceleration card model parameter can be further subjected to parameter segmentation according to the core group access type. Specifically, the accelerator card model parameters with the core group access type being the single-core access type are segmented into first segmentation parameters, and the accelerator card model parameters with the core group access type being the multi-core access type are segmented into second segmentation parameters.

S240, determining a current core group storage space corresponding to each current core group in the reasoning acceleration card, and determining a current shared space and a current private space corresponding to each current core group in each current core group storage space.

Wherein the current core group may be one of the core groups of the inference accelerator card. The current core group memory space may be a memory space corresponding to the current core group. It will be appreciated that there may be multiple core groups, for example four core groups, in an inference accelerator card, and that each core group may correspond to a storage space for storing parameters required by the core group to perform the model inference task.

The current shared space may be a space in the current core group storage space that allows access to other core groups other than the current core group. The current private space may be a space in the current core group storage space that is not allowed to be accessed by other core groups than the current core group.

In the embodiment of the invention, after parameter segmentation is performed on the accelerator card model parameters according to the core group access type, the current core group storage space corresponding to each current core group in the reasoning accelerator card can be further determined, so that the current shared space and the current private space corresponding to each current core group are determined in each current core group storage space. That is, the sum of the size of the current shared space and the size of the current private space is the size of the current core group storage space. It is understood that the storage space corresponding to each core group in one inference accelerator card can be divided into a shared space and a private space.

S250, determining each current private space as a parameter storage space corresponding to the first dividing parameter; and determining each current shared space as a parameter storage space corresponding to the second segmentation parameter.

In the embodiment of the invention, after determining the current shared space and the current private space corresponding to each current core group in the storage space of each current core group, each current private space can be further determined as a parameter storage space corresponding to a first segmentation parameter, and each current shared space is determined as a parameter storage space corresponding to a second segmentation parameter.

Optionally, after determining each current private space as the parameter storage space corresponding to the first segmentation parameter and each current shared space as the parameter storage space corresponding to the second segmentation parameter, the first segmentation parameter may be further stored into the current private space and the second segmentation parameter may be stored into the current shared space. Alternatively, when the first cut parameter is stored to the current private space, the half-precision or full-precision first cut parameter may be stored to the current private space. When the second slicing parameters are stored in the current shared space, the second slicing parameters with half precision or full precision may be stored in the current shared space.

And S260, under the condition that a model reasoning request is received, executing first model reasoning of the current reasoning model according to the first segmentation parameters stored in each current private space through each current core group in the reasoning acceleration card, and obtaining a first model reasoning result.

Wherein the first model reasoning may be a model reasoning task. The first model reasoning result may be a result of the first model reasoning. It is understood that the first model reasoning may be a model reasoning task performed by each core group according to a first segmentation parameter in the private space corresponding to each core group.

In the embodiment of the invention, when a model reasoning request is received, a first model reasoning of a current reasoning model is executed by each current core group in a reasoning acceleration card according to first segmentation parameters stored in each current private space so as to obtain a first model reasoning result. For example, assuming that the inference accelerator card may include a core group 1 and a core group 2, where a first partition parameter in a private space corresponding to the core group 1 is a parameter a, and a first partition parameter in a private space corresponding to the core group 2 is a parameter B, a first model inference is performed by the core group 1 according to the parameter a and by the core group 2 according to the parameter B at the same time.

S270, determining a second segmentation parameter stored in each current shared space as a shared space parameter.

S280, executing second model reasoning of the current reasoning model according to each first model reasoning result and the shared space parameter through each current core group in the reasoning acceleration card to obtain a second model reasoning result, so that the second model reasoning result is determined to be the acceleration card reasoning result of the reasoning acceleration card.

Wherein the shared space parameter may be a sum of parameters stored in each current shared space. The second model reasoning can be another model reasoning task. The second model reasoning result may be a result of the second model reasoning. It is understood that the second model reasoning may be a model reasoning task performed by the respective core groups based on the first model reasoning result and the shared space parameters. The accelerator card reasoning result may be a result of a reasoning accelerator card performing a model reasoning task.

In the embodiment of the invention, the second segmentation parameters stored in each current shared space are determined as the shared space parameters, so that each current core group in the reasoning acceleration card executes second model reasoning according to each first model reasoning result and the shared space parameters, thereby obtaining a second model reasoning result, and further determining the second model reasoning result as the acceleration card reasoning result of the reasoning acceleration card.

For example, it is assumed that the inference accelerator may include a core group 1 and a core group 2, where a second segmentation parameter in a shared space corresponding to the core group 1 is a parameter a, and a second segmentation parameter in a shared space corresponding to the core group 2 is a parameter b, and the shared space parameter is a+b. If the first model reasoning result obtained by the kernel group 1 is the result 1 and the first model reasoning result obtained by the kernel group 2 is the result 2, the kernel group 1 executes the second model reasoning according to the result 1, the result 2 and the parameter a+parameter b, and the kernel group 2 executes the second model reasoning according to the result 1, the result 2 and the parameter a+parameter b.

Optionally, before performing the second model inference of the current inference model according to the first model inference result and the shared space parameter by each current core group in the inference accelerator card, each first model inference result may be calculated by each current core group, for example, an average value of each first model inference result is calculated, or a maximum value of each first model inference result is calculated, which is not limited in this embodiment of the present invention.

According to the technical scheme, the current core group storage space corresponding to each current core group in the reasoning acceleration card is determined, the current shared space and the current private space are determined, the parameter storage space corresponding to each current core group in the current private space is determined, the parameter storage space corresponding to each current shared space is determined, the parameter storage space corresponding to each second split parameter is determined, so that when a model reasoning request is received, the first model reasoning of each current core group in the current core group is executed according to the first split parameter in the reasoning acceleration card to obtain a first model reasoning result, the second split parameter in each current shared space is determined to be a shared space storage space, the current shared space and the current shared space are determined to be a parameter storage space corresponding to the first split parameter, the current shared space is determined to be a parameter storage space corresponding to the second split parameter, and the current shared space is determined to be a parameter storage space corresponding to the second split parameter.

Example III

Fig. 3 is a flowchart of a model reasoning method provided by the third embodiment of the present invention, which is a further refinement of the above technical solutions, and provides various specific alternative implementations of executing, by the reasoning acceleration card, model reasoning of the current reasoning model according to model segmentation parameters stored in the parameter storage space. The technical solution in this embodiment may be combined with each of the alternatives in one or more embodiments described above. As shown in fig. 3, the method may include the steps of:

s310, determining an inference acceleration card corresponding to the current inference model, and determining acceleration card model parameters corresponding to the inference acceleration card.

S320, parameter segmentation is carried out on the acceleration card model parameters to obtain model segmentation parameters corresponding to the reasoning acceleration card.

S330, determining a corresponding parameter storage space of the model segmentation parameter in the reasoning acceleration card.

S340, determining the execution sequence of the acceleration card corresponding to each reasoning acceleration card under the condition that the model reasoning request is received and the number of the acceleration cards of the reasoning acceleration card is a plurality of.

The number of accelerator cards may be the number of inference accelerator cards that perform the model inference task of the current inference model. The accelerator card execution order may be an order in which the inference accelerator cards execute model inference tasks. It will be appreciated that the model accelerator card may have a different order of execution when performing model inference tasks by the model accelerator card. For example, assuming that the model acceleration card includes an acceleration card a and an acceleration card b, the execution sequence of the acceleration card a may be sequence 1, and the execution sequence of the acceleration card b may be sequence 2, when performing model reasoning of the current reasoning model, model reasoning may be performed through the acceleration card a first, and then model reasoning may be performed through the acceleration card b.

In the embodiment of the invention, when the model reasoning request is received and the number of the acceleration cards of the reasoning acceleration cards is a plurality of, the execution sequence of the acceleration cards corresponding to each reasoning acceleration card is determined. Alternatively, one inference accelerator card may correspond to one accelerator card execution order, or may correspond to a plurality of accelerator card execution orders. For example, assuming that the model accelerator card includes accelerator card a and accelerator card b, the execution sequence of accelerator card a may be sequence 1 and sequence 3, and the execution sequence of accelerator card b may be sequence 2 and sequence 4, when executing the model reasoning of the current reasoning model, the model reasoning may be executed through accelerator card a, then the model reasoning is executed through accelerator card b, then the model reasoning is executed through accelerator card a, and finally the model reasoning is executed through accelerator card b. Alternatively, the accelerator card execution order may be determined based on a model inference process of the current inference model.

S350, determining a current reasoning acceleration card, and determining a current acceleration card execution sequence corresponding to the current reasoning acceleration card.

The current reasoning acceleration card may be the acceleration card currently executing the model reasoning task. The current accelerator card execution order may be an execution order corresponding to the current inferential accelerator card.

In the embodiment of the invention, after the execution sequence of the acceleration card corresponding to each reasoning acceleration card is determined, the current reasoning acceleration card can be further determined so as to determine the current acceleration card execution sequence corresponding to the current reasoning acceleration card.

Optionally, the number of the current inference acceleration cards corresponding to the current acceleration card execution sequence may be one or more; correspondingly, under the condition that the current accelerator card execution sequence is determined to correspond to a plurality of current inference accelerator cards, the operation of executing the model inference of the current inference model through each current inference accelerator card is executed simultaneously according to the current accelerator card execution sequence and the model segmentation parameters stored in each parameter storage space.

Specifically, the number of the current inference acceleration cards corresponding to the current acceleration card execution sequence may be one or more. That is, there may be one or more inference accelerator cards whose execution order is the same execution order. Illustratively, it is assumed that the execution order of the inference accelerator card a may be order 1, and the execution order of the inference accelerator card b may be order 1.

Specifically, when the current accelerator card execution order corresponds to a plurality of current inference accelerator cards, the operation of executing the model inference of the current inference model by each current inference accelerator card according to the current accelerator card execution order and the model segmentation parameters stored in each parameter storage space may be simultaneously executed. That is, model reasoning of the current reasoning model can be performed simultaneously by each current reasoning acceleration card. For example, assuming that the execution sequence of the inference acceleration card a is sequence 1 and the execution sequence of the inference acceleration card b is also sequence 1, the model inference task may be executed simultaneously by the inference acceleration card a and the inference acceleration card b.

S360, judging whether the current accelerator card execution sequence is a first execution sequence; if yes, executing S370; otherwise, S380 is performed.

Wherein the first execution order may be the order of the first execution.

In the embodiment of the invention, after determining the current accelerator card execution sequence corresponding to the current reasoning accelerator card, whether the current accelerator card execution sequence is the first execution sequence can be further determined.

And S370, executing the model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning acceleration card through the current reasoning acceleration card.

In the embodiment of the invention, when the current acceleration card execution sequence is determined to be the first execution sequence, the current reasoning acceleration card is used for executing the model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning acceleration card.

S380, acquiring an accelerator card reasoning result of a reasoning accelerator card corresponding to the last accelerator card execution sequence of the current accelerator card execution sequence.

S390, executing the model reasoning of the current reasoning model according to the reasoning result of the current reasoning acceleration card and the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning acceleration card.

In the embodiment of the invention, when the current accelerator card execution sequence is determined not to be the first execution sequence, the accelerator card reasoning result of the reasoning accelerator card corresponding to the last accelerator card execution sequence of the current accelerator card execution sequence can be obtained, so that the current reasoning model can be performed through the current reasoning accelerator card according to the accelerator card reasoning result and the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning accelerator card.

For example, assuming that the current accelerator card execution sequence is sequence 2, the accelerator card reasoning result of the reasoning accelerator card corresponding to the previous accelerator card execution sequence may be obtained, that is, the accelerator card reasoning result of the reasoning accelerator card whose execution sequence is sequence 2 is obtained, so that the current reasoning model is executed by the current reasoning accelerator card according to the accelerator card reasoning result and the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning accelerator card.

According to the technical scheme, the corresponding reasoning acceleration card of the current reasoning module and the parameters of the acceleration card module are determined, parameter segmentation is carried out on the parameters of the acceleration card module to obtain the model segmentation parameters corresponding to the reasoning acceleration card, then the corresponding parameter storage space of the model segmentation parameters in the reasoning acceleration card is determined, so that when a model reasoning request is received, the number of the acceleration cards of the reasoning acceleration card is multiple, the execution sequence of the acceleration card corresponding to each reasoning acceleration card is determined, and the current reasoning acceleration card and the current acceleration card execution sequence are determined, therefore, when the current acceleration card execution sequence is the first execution sequence, the model of the current reasoning module is executed by the current reasoning acceleration card according to the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning acceleration card, and when the current acceleration card execution sequence is not the first execution sequence, the acceleration card corresponding to the previous acceleration card execution sequence is obtained, the current reasoning acceleration card of the current acceleration card is executed according to the reasoning card reasoning result and the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning acceleration card, the current reasoning capacity of the current reasoning acceleration card is reduced, the current performance of the current reasoning module is improved, and the current performance of the current reasoning module is improved due to the fact that the current reasoning module is improved.

Example IV

In order for those skilled in the art to better understand the model reasoning method of the present embodiment, a specific example will be described below. Fig. 4 is an exemplary flowchart of a model reasoning method provided in the fourth embodiment of the present invention, as shown in fig. 4, where the method may specifically include:

(1) Initializing large model parameter slices, including intersecting segment parameters (i.e., second segmentation parameters) and non-intersecting segment parameters (i.e., first segmentation parameters); the cross-segment parameters may be model parameters stored in an accelerator card (i.e., an inference accelerator card, a many-core device) that require simultaneous access by multiple core groups. The non-intersecting segment parameters may be model parameters stored in an accelerator card that do not require simultaneous access by multiple core groups.

(2) Initializing an reasoning framework, and initializing a large model video memory space, wherein the large model video memory space comprises cross section video memories (namely a shared space), and a core group is unique to the video memory (namely a private space). Fig. 5 is a schematic diagram of a memory segmentation of an accelerator card according to a fourth embodiment of the present invention, where, as shown in fig. 5, a cross-segment memory may be divided among many-core memories in the accelerator card, and is constructed as a continuous memory address. The core group unique video memory can be divided from the many-core video memory in the accelerator card, and only the many-cores can be accessed at high speed.

(3) And reading half-precision or full-precision large model parameters on different video memory spaces of the accelerator card, namely storing the model segmentation parameters in the different video memory spaces.

(4) When receiving the reasoning request, accessing the parameters of the video memory space where the acceleration card is positioned and the parameters stored in the cross section video memory through different core groups in each acceleration card.

(5) Each core group completes its respective reasoning task according to the reasoning request. Specifically, the respective unique reasoning tasks are independently completed on different core groups, data summarization is efficiently completed through communication among the core groups, then each core group reads shared data in the cross section, and the reasoning tasks are completed according to the shared data and the data summarization result.

(6) The results of the accelerator cards are summarized through network communication and the rest of the reasoning process is continuously completed. That is, after the model acceleration card corresponding to the current acceleration card execution sequence completes the model reasoning, the model reasoning can be continuously performed by the model acceleration card corresponding to the next acceleration card execution sequence.

Fig. 6 is a schematic diagram of a current inference model provided in a fourth embodiment of the present invention, where, as shown in fig. 6, the current inference model may be a stacked L-layer model, where the white part in each dashed box is a parameter stored on a unique video memory of a different core group in the same accelerator card, and the gray part in each dashed box is a parameter stored on a cross-segment video memory in the same accelerator card. The model reasoning task of each layer of the current reasoning model can be completed by a plurality of model acceleration cards.

For example, assuming that the current inference model is a 2-layer model, each layer has 8 steps, that is, the model inference of the current inference model includes a 1 st step M11 of the first-layer model, a 2 nd step M12, … of the first-layer model, an 8 th step M28 of the second-layer model, and the inference accelerator card includes an accelerator card 1, an accelerator card 2, an accelerator card 3, and an accelerator card 4, each accelerator card may have two core groups, then M11, M12 of the first-layer model may be performed by the accelerator card 1 according to the corresponding parameters, M13, M14 may be performed by the accelerator card 2 according to the corresponding parameters, M15, M16 may be performed by the accelerator card 3 according to the corresponding parameters, M17, M18 may be performed by the accelerator card 4 according to the corresponding parameters, and four accelerator cards may simultaneously perform the model inference tasks of the first-layer model, that is, the four accelerator cards may all have the corresponding execution order of 1. After the four accelerator cards execute the model reasoning of the first layer model, the model reasoning results of the four accelerator cards are summarized, and the summarized results are respectively input into the four accelerator cards. The second layer model may be executed by the accelerator card 1 according to the summary result and the corresponding parameters, the accelerator cards 2 according to the summary result and the corresponding parameters may be executed by the accelerator cards 23 and 24, the accelerator cards 3 according to the summary result and the corresponding parameters may be executed by the accelerator cards 25 and 26, the accelerator cards 4 according to the summary result and the corresponding parameters may be executed by the accelerator cards 27 and 28, and the four accelerator cards may simultaneously execute the model reasoning task of the second layer model, that is, the execution sequence corresponding to the four accelerator cards is the sequence 2.

According to the technical scheme, for a typical trillion-scale parameter model, the model parameter scale is large enough to store all parameters in a single acceleration card, so that model reasoning of a plurality of acceleration cards (such as N Zhang Jiasu cards) is introduced, and the model can be considered to be expanded from the original M crowdcores to M×N crowdcores. From the lateral expansion, each accelerator card performs the reasoning task as per the case of a single accelerator card, except that the communication process between M core groups in the previous single Zhang Jiasu card is changed to the communication process between m×n cross accelerator cards. From the aspect of longitudinal expansion, each acceleration card completes the reasoning task of a single acceleration card according to the original process, then the middle reasoning result is sent to the next acceleration card, the reasoning process after the completion of the next acceleration card is carried out, the reasoning result is sent to the next Zhang Jiasu card again, and the process is repeated to complete the whole reasoning process.

According to the technical scheme, the cross section video memory and the non-cross section video memory are divided and combined with the model parallel reasoning, so that the occupation of the video memory can be reduced to the greatest extent while the reasoning performance of the kernel group is ensured.

Example five

Fig. 7 is a schematic diagram of a model inference apparatus provided in a fifth embodiment of the present invention, as shown in fig. 7, where the apparatus includes: an inference accelerator determination module 710, a parameter segmentation module 720, a storage space determination module 730, and a model inference module 740, wherein:

the inference accelerator card determining module 710 is configured to determine an inference accelerator card corresponding to a current inference model, and determine accelerator card model parameters corresponding to the inference accelerator card;

the parameter segmentation module 720 is configured to perform parameter segmentation on the acceleration card model parameter to obtain a model segmentation parameter corresponding to the inference acceleration card;

a storage space determining module 730, configured to determine a parameter storage space corresponding to the model segmentation parameter in the inference accelerator card;

and the model reasoning module 740 is used for executing the model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space through the reasoning acceleration card under the condition of receiving a model reasoning request.

Optionally, the parameter segmentation module 720 may specifically include: determining a core group access type corresponding to the acceleration card model parameters; dividing acceleration card model parameters with the core group access type being a single-core access type into first dividing parameters corresponding to the reasoning acceleration card; and splitting the acceleration card model parameters with the core group access type being the multi-core access type into second splitting parameters corresponding to the reasoning acceleration card.

Optionally, the storage space determining module 730 may be specifically configured to: determining a current core group storage space corresponding to each current core group in the reasoning acceleration card, and determining a current shared space and a current private space corresponding to each current core group in each current core group storage space; the current shared space is a space which is accessed by other core groups except the current core group in the storage space of the current core group; the current private space is a space which is not allowed to be accessed by other core groups except the current core group in the storage space of the current core group; determining each current private space as a parameter storage space corresponding to the first dividing parameter; and determining each current shared space as a parameter storage space corresponding to the second segmentation parameter.

Optionally, the model inference module 740 may be specifically configured to: executing a first model reasoning of a current reasoning model according to the first segmentation parameters stored in each current private space through each current core group in the reasoning acceleration card to obtain a first model reasoning result; determining a second segmentation parameter stored in each current shared space as a shared space parameter; and executing second model reasoning of the current reasoning model according to each first model reasoning result and the shared space parameter through each current core group in the reasoning acceleration card to obtain a second model reasoning result, so as to determine the second model reasoning result as an acceleration card reasoning result of the reasoning acceleration card.

Optionally, the model inference module 740 may be further specifically configured to: under the condition that the number of the acceleration cards of the reasoning acceleration cards is determined to be a plurality of, determining the execution sequence of the acceleration cards corresponding to each reasoning acceleration card; and executing model reasoning of the current reasoning model according to the execution sequence of each acceleration card and the model segmentation parameters stored in each parameter storage space through each reasoning acceleration card.

Optionally, the model inference module 740 may be further configured to: determining a current reasoning acceleration card, and determining a current acceleration card execution sequence corresponding to the current reasoning acceleration card; and under the condition that the current acceleration card execution sequence is determined to be the first execution sequence, executing model reasoning of the current reasoning model by the current reasoning acceleration card according to the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning acceleration card.

Optionally, the model inference module 740 may be further configured to: under the condition that the current accelerator card execution sequence is determined to be not the first execution sequence, acquiring an accelerator card reasoning result of a reasoning accelerator card corresponding to the last accelerator card execution sequence of the current accelerator card execution sequence; and executing the model reasoning of the current reasoning model according to the reasoning result of the current reasoning acceleration card and the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning acceleration card by the current reasoning acceleration card.

Optionally, the number of the current inference acceleration cards corresponding to the current acceleration card execution sequence may be one or more; accordingly, the model inference module 740 may be further specifically configured to: and under the condition that the current acceleration card execution sequence corresponds to a plurality of current reasoning acceleration cards, executing the operation of executing the model reasoning of the current reasoning model through each current reasoning acceleration card according to the current acceleration card execution sequence and the model segmentation parameters stored in each parameter storage space.

The model reasoning device provided by the embodiment of the invention can execute the model reasoning method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example six

Fig. 8 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as model reasoning methods.

In some embodiments, the model reasoning method may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the model reasoning method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the model reasoning method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of model reasoning, comprising:

Under the condition of receiving a model reasoning request, executing model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space by the reasoning acceleration card;

the parameter segmentation is performed on the acceleration card model parameters to obtain model segmentation parameters corresponding to the reasoning acceleration card, and the method comprises the following steps:

determining a core group access type corresponding to the acceleration card model parameter;

dividing acceleration card model parameters with the kernel group access type being a single-kernel access type into first dividing parameters corresponding to the reasoning acceleration card;

dividing the acceleration card model parameters with the kernel group access type being the multi-kernel access type into second dividing parameters corresponding to the reasoning acceleration card;

the executing, by the inference accelerator card, the model inference of the current inference model according to the model segmentation parameters stored in the parameter storage space, including:

under the condition that the number of the acceleration cards of the reasoning acceleration cards is determined to be a plurality of, determining the execution sequence of the acceleration cards corresponding to the reasoning acceleration cards;

and executing the model reasoning of the current reasoning model according to the execution sequence of each acceleration card and the model segmentation parameters stored in the parameter storage space through each reasoning acceleration card.

2. The method of claim 1, further comprising, prior to said determining a corresponding parameter storage space for said model cut parameters in said inference accelerator card:

determining a current core group storage space corresponding to each current core group in the reasoning acceleration card, and determining a current shared space and a current private space corresponding to each current core group in each current core group storage space;

the current shared space is a space which is in the current core group storage space and allows other core groups except the current core group to access; the current private space is a space which is not allowed to be accessed by other core groups except the current core group in the storage space of the current core group;

the determining the corresponding parameter storage space of the model segmentation parameter in the reasoning acceleration card comprises the following steps:

determining each current private space as a parameter storage space corresponding to the first dividing parameter;

and determining each current shared space as a parameter storage space corresponding to the second segmentation parameter.

3. The method according to claim 2, wherein said performing, by said inference accelerator card, model inference of said current inference model based on model slicing parameters stored in said parameter storage space, comprises:

Executing a first model reasoning of the current reasoning model according to the first segmentation parameters stored in each current private space through each current core group in the reasoning acceleration card to obtain a first model reasoning result;

determining a second segmentation parameter stored in each current shared space as a shared space parameter;

and executing second model reasoning of the current reasoning model according to each first model reasoning result and the shared space parameter through each current core group in the reasoning acceleration card to obtain a second model reasoning result, so as to determine the second model reasoning result as the acceleration card reasoning result of the reasoning acceleration card.

4. The method of claim 1, wherein said executing, by each of said inference accelerator cards, a model inference of said current inference model based on each of said accelerator card execution orders, and model slicing parameters stored in each of said parameter storage spaces, comprises:

determining a current reasoning acceleration card, and determining a current acceleration card execution sequence corresponding to the current reasoning acceleration card;

and under the condition that the current acceleration card execution sequence is determined to be the first execution sequence, executing model reasoning of the current reasoning model by the current reasoning acceleration card according to model segmentation parameters stored in a parameter storage space corresponding to the current reasoning acceleration card.

5. The method of claim 4, wherein said executing, by each of said inference accelerator cards, a model inference of said current inference model based on each of said accelerator card execution orders, and model segmentation parameters stored in each of said parameter storage spaces, further comprises:

under the condition that the current accelerator card execution sequence is determined to be a non-first execution sequence, acquiring an accelerator card reasoning result of a reasoning accelerator card corresponding to the last accelerator card execution sequence of the current accelerator card execution sequence;

and executing the model reasoning of the current reasoning model according to the reasoning result of the current reasoning acceleration card and the model segmentation parameters stored in the parameter storage space corresponding to the current reasoning acceleration card by the current reasoning acceleration card.

6. The method of claim 4, wherein the number of current inferential accelerator cards corresponding to the current accelerator card execution order is one or more;

the executing, by each inference accelerator card, the model inference of the current inference model according to the execution sequence of each accelerator card and the model segmentation parameters stored in each parameter storage space, further includes:

And under the condition that the current acceleration card execution sequence corresponds to a plurality of current reasoning acceleration cards, executing the operation of executing the model reasoning of the current reasoning model according to the current acceleration card execution sequence and the model segmentation parameters stored in the parameter storage space through each current reasoning acceleration card.

7. A model reasoning apparatus, comprising:

the model reasoning module is used for executing model reasoning of the current reasoning model according to the model segmentation parameters stored in the parameter storage space through the reasoning acceleration card under the condition of receiving a model reasoning request;

the parameter segmentation module is further used for determining a core group access type corresponding to the acceleration card model parameter; dividing acceleration card model parameters with the kernel group access type being a single-kernel access type into first dividing parameters corresponding to the reasoning acceleration card; dividing the acceleration card model parameters with the kernel group access type being the multi-kernel access type into second dividing parameters corresponding to the reasoning acceleration card;

The model reasoning module is further used for determining the execution sequence of the acceleration card corresponding to each reasoning acceleration card under the condition that the number of the acceleration cards of the reasoning acceleration card is determined to be a plurality of; and executing the model reasoning of the current reasoning model according to the execution sequence of each acceleration card and the model segmentation parameters stored in the parameter storage space through each reasoning acceleration card.

8. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model reasoning method of any of claims 1-6.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores computer instructions for causing a processor to implement the model reasoning method of any of claims 1-6 when executed.