CN113537475A

CN113537475A - Request processing method and device based on neural network model

Info

Publication number: CN113537475A
Application number: CN202010294562.7A
Authority: CN
Inventors: 曾魁; 王涛; 赵宇; 骆卫华
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2021-10-22

Abstract

The invention discloses a request processing method and device based on a neural network model, relates to the technical field of computers, and can solve the problems that hardware resources cannot dynamically and rapidly change along with the change of flow and the flexibility is poor in the prior art. The method mainly comprises the following steps: judging whether at least one target neural network model is stored in a video memory of a network processor or not, wherein the target neural network model comprises a neural network model corresponding to a request queue to be processed; if the target neural network model is not stored in the video memory, prefetching the target neural network model from a memory and loading the target neural network model into the video memory so as to cover at least one neural network model stored in the video memory in advance; and processing the request in the request queue to be processed based on the target neural network model stored in the video memory. The method is mainly suitable for the scene of processing different requests based on a plurality of neural network models.

Description

Request processing method and device based on neural network model

Technical Field

The invention relates to the technical field of computers, in particular to a request processing method based on a neural network model.

Background

With the rapid development of computer technology, neural network models are gradually applied to various aspects such as computer vision and natural language processing, and achieve better effects. And as the data processing requirements increase, the size of the neural network model which needs to be deployed becomes larger and larger. For example, one data processing needs to support translation functions of 20 languages, each language direction (for example, one language direction from Chinese to English) needs at least one model to provide services, and if translation between the language directions is supported, the system needs to support more than 400 models.

Each Neural network model needs to be loaded into a Network Processor (NPU) first during execution, but the video memory of a single network processor is very limited, so that many Neural network models cannot be deployed on the same network processor. In order to be able to support large-scale model services, the current approach is: a plurality of network processors are deployed to build a service cluster, and each network processor only needs to be responsible for loading a part of the neural network model (for example, 10-20 models). However, the hardware resources cannot dynamically change rapidly with the change of the traffic, for example, when the traffic increases in a short time, a network processor cannot be added rapidly to share the traffic, and the flexibility is poor.

Disclosure of Invention

In view of this, the present invention provides a request processing method and apparatus based on a neural network model, and aims to solve the problem in the prior art that hardware resources cannot dynamically and rapidly change with the change of traffic, and the flexibility is poor.

In a first aspect, the present invention provides a request processing method based on a neural network model, the method including:

judging whether at least one target neural network model is stored in a video memory of a network processor or not, wherein the target neural network model comprises a neural network model corresponding to a request queue to be processed;

if the target neural network model is not stored in the video memory, prefetching the target neural network model from a memory and loading the target neural network model into the video memory so as to cover at least one neural network model stored in the video memory in advance;

and processing the request in the request queue to be processed based on the target neural network model stored in the video memory.

Optionally, prefetching the target neural network model from the memory and loading the target neural network model into the video memory to cover at least one neural network model pre-stored in the video memory includes:

prefetching the target neural network model from the memory;

judging whether the video memory contains a neural network model in an unexecuted state or not;

and if the video memory contains the neural network model in the non-execution state, enabling the target neural network model to cover at least one neural network model in the non-execution state based on a covering strategy.

Optionally, the method further includes:

if the video memory does not have the neural network model in the unexecuted state, judging whether the size of the residual space of the video memory is larger than or equal to that of the target neural network model;

and if the size of the residual space is larger than or equal to that of the target neural network model, directly loading the target neural network model into the video memory.

Optionally, the enabling the target neural network model to overlay at least one neural network model in an unexecuted state based on an overlay policy includes:

if the reserved space of the video memory is larger than a preset space threshold value, a neural network model in an unexecuted state is randomly selected, and the selected neural network model is covered by using the target neural network model;

and if the reserved space of the video memory is smaller than a preset space threshold, covering at least one neural network model in the unexecuted state by using the target neural network model according to the size of the neural network model in the unexecuted state and the size of the target neural network model.

and the prefetching thread informs the network processor to prefetch the target neural network model from the memory and load the model into the video memory so as to cover at least one neural network model prestored in the video memory.

Optionally, processing the request in the to-be-processed request queue based on the target neural network model stored in the video memory includes:

the prefetching thread adds the name of the target neural network model to a prefetching task queue;

and the prefetching thread informs a working thread to process the request in the request queue corresponding to the name of the target neural network model in the prefetching task queue.

Optionally, after the prefetch thread adds the name of the target neural network model to the prefetch task queue, the method further includes:

if the number of the names contained in the pre-fetching task queue is larger than or equal to a preset number threshold, the pre-fetching thread processes the target neural network model corresponding to the next request queue to be processed when the number of the names contained in the pre-fetching task queue is reduced to be smaller than the preset number threshold;

and if the number of the names contained in the pre-fetching task queue is smaller than a preset number threshold, the pre-fetching thread processes the target neural network model corresponding to the next request queue to be processed.

Optionally, the determining whether the at least one target neural network model is stored in a video memory of the network processor includes:

and if the request exists in the request queue to be processed, judging whether at least one target neural network model corresponding to the request queue to be processed is stored in the video memory.

In a second aspect, the present invention provides a request processing method based on a neural network model, the method including:

judging whether at least one target language translation model is stored in a video memory of a graphic processor, wherein the target language translation model comprises a neural network model which corresponds to a request queue to be processed and is used for language translation;

if the target language translation model is not stored in the video memory, prefetching the target language translation model from a memory and loading the target language translation model into the video memory so as to cover at least one language translation model pre-stored in the video memory;

processing the request in the request queue to be processed based on the target language translation model stored in the video memory;

and outputting a translation result corresponding to the request in the request queue to be processed.

In a third aspect, the present invention provides a request processing method based on a neural network model, the method including:

receiving a request sent by a requester;

determining, from the request, at least one target neural network model required to process the request;

judging whether the target neural network model is stored in a first storage space or not;

if the target neural network model is not stored in the first storage space, prefetching the target neural network model from a second storage space and loading the target neural network model into the first storage space so as to cover at least one neural network model stored in the first storage space in advance;

processing the request based on the target neural network model stored in the first storage space;

and sending the processing result to the requester.

In a fourth aspect, the present invention provides a request processing apparatus based on a neural network model, the apparatus comprising:

the judging unit is used for judging whether at least one target neural network model is stored in a video memory of the network processor or not, and the target neural network model comprises a neural network model corresponding to a request queue to be processed;

the loading unit is used for prefetching the target neural network model from a memory and loading the target neural network model into the memory to cover at least one neural network model pre-stored in the memory if the target neural network model is not stored in the memory;

and the processing request unit is used for processing the request in the request queue to be processed based on the target neural network model stored in the video memory.

Optionally, the loading unit includes:

a prefetch module to prefetch the target neural network model from the memory;

the judging module is used for judging whether the video memory contains a neural network model in an unexecuted state;

and the coverage module is used for enabling the target neural network model to cover at least one neural network model in the non-execution state based on a coverage strategy if the video memory contains the neural network model in the non-execution state.

Optionally, the determining unit is further configured to determine whether the size of the remaining space of the video memory is greater than or equal to the size of the target neural network model if there is no neural network model in the video memory that is in an unexecuted state;

the loading unit is further configured to directly load the target neural network model into the video memory if the size of the remaining space is greater than or equal to the size of the target neural network model.

Optionally, the covering module is configured to randomly select a neural network model in an unexecuted state if the reserved space of the video memory is greater than a preset space threshold, and cover the selected neural network model with the target neural network model; and if the reserved space of the video memory is smaller than a preset space threshold, covering at least one neural network model in the unexecuted state by using the target neural network model according to the size of the neural network model in the unexecuted state and the size of the target neural network model.

Optionally, the loading unit is configured to notify the network processor of prefetching the target neural network model from the memory by a prefetching thread and loading the target neural network model into the video memory, so as to cover at least one neural network model pre-stored in the video memory.

Optionally, the processing request unit includes:

an adding module, configured to add, by a prefetch thread, a name of the target neural network model to a prefetch task queue;

and the notification processing module is used for notifying the working thread to process the request in the request queue corresponding to the name of the target neural network model in the pre-fetching task queue by the pre-fetching thread.

Optionally, the apparatus further comprises:

the waiting processing unit is used for processing the target neural network model corresponding to the next request queue to be processed when the number of the names contained in the pre-fetching task queue is reduced to be less than a preset number threshold value by the pre-fetching thread after the pre-fetching thread adds the name of the target neural network model to the pre-fetching task queue and if the number of the names contained in the pre-fetching task queue is greater than or equal to the preset number threshold value;

and the processing unit is used for processing the target neural network model corresponding to the next request queue to be processed by the prefetching thread if the number of the names contained in the prefetching task queue is smaller than a preset number threshold.

Optionally, the determining unit is configured to determine whether at least one target neural network model corresponding to the to-be-processed request queue is stored in the video memory if there is a request in the to-be-processed request queue.

In a fifth aspect, the present invention provides a request processing apparatus based on a neural network model, the apparatus comprising:

the judging unit is used for judging whether at least one target language translation model is stored in a video memory of the graphics processor or not, wherein the target language translation model comprises a neural network model which corresponds to a request queue to be processed and is used for performing language translation;

a loading unit, configured to prefetch the target language translation model from a memory and load the target language translation model into the memory if the target language translation model is not stored in the memory, so as to cover at least one language translation model pre-stored in the memory;

the processing request unit is used for processing the request in the request queue to be processed based on the target language translation model stored in the video memory;

and the output unit is used for outputting the translation result corresponding to the request in the request queue to be processed.

In a sixth aspect, the present invention provides a request processing apparatus based on a neural network model, the apparatus comprising:

the receiving unit is used for receiving a request sent by a requester;

a determining unit, configured to determine, according to the request, at least one target neural network model required for processing the request;

the judging unit is used for judging whether the target neural network model is stored in a first storage space or not;

a loading unit, configured to prefetch the target neural network model from a second storage space and load the target neural network model into the first storage space to cover at least one neural network model pre-stored in the first storage space if the target neural network model is not stored in the first storage space;

a processing unit, configured to process the request based on the target neural network model stored in the first storage space;

and the sending unit is used for sending the processing result to the requesting party.

In a seventh aspect, the present invention provides a storage medium storing a plurality of instructions, the instructions being adapted to be loaded by a processor and to execute the request processing method based on neural network model according to the first or second aspect.

In an eighth aspect, the present invention provides an electronic device comprising a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform a method of request processing based on a neural network model as described in the first or second aspect.

By means of the technical scheme, the request processing method and the device based on the neural network model can store at least one neural network model in a memory with a storage space far larger than that of a video memory in advance, when a request needs to be processed by using at least one target neural network model corresponding to a request queue to be processed, if the target neural network model is not stored in the video memory of a network processor, the target neural network model can be prefetched from the memory, the target neural network model is successfully loaded into the video memory to realize the processing of the request in a mode of covering at least one neural network model stored in the video memory in advance, the target neural network model cannot be loaded due to insufficient video memory space, so that the indirect capacity expansion of the video memory is realized through the memory, and the function of loading a large-scale neural network model by using a single network processor is realized, and the software resources can dynamically and rapidly change along with the change of the flow, so that the flexibility is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a request processing method based on a neural network model according to an embodiment of the present invention;

FIG. 2 is a flow chart of another request processing method based on neural network model according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating a method for processing a request based on a language translation model according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an example of a request process based on a neural network model according to an embodiment of the present invention;

FIG. 5 is a block diagram illustrating a request processing apparatus based on a neural network model according to an embodiment of the present invention;

FIG. 6 is a block diagram illustrating another request processing apparatus based on a neural network model according to an embodiment of the present invention;

FIG. 7 is a block diagram illustrating a request processing apparatus based on a neural network model according to an embodiment of the present invention;

fig. 8 is a block diagram illustrating a request processing apparatus based on a neural network model according to another embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the case of deploying a large-scale neural network model, a service cluster can be constructed by a plurality of network processors, so that the plurality of network processors load a part of the neural network model respectively to implement the service cluster. In order to solve the technical problem, an embodiment of the present invention provides a request processing method based on a neural network model, where the method temporarily stores a required neural network model through a memory with a relatively large storage space, and loads the neural network model stored in the memory into a video memory when the video memory is required to be used, so that an indirect capacity expansion function of the video memory can be realized only by a single network processor. As shown in fig. 1, the specific implementation flow of the method includes:

101. and judging whether the at least one target neural network model is stored in a video memory of the network processor.

The target neural network model comprises a neural network model corresponding to the pending request queue. The NPU includes, but is not limited to, a Graphics Processing Unit (GPU), a TPU, a Tensor Processor (TPU), and the like.

After a new request is received, it may be determined, according to the content (which may be a model identifier) in the request, which neural network model (one or more may be needed) the request needs to be processed, and add the request to a request queue corresponding to the determined neural network model to wait for being processed. In the embodiment of the invention, some neural network models are temporarily stored through the memory, and when the neural network models need to be used, the neural network models are loaded into the video memory, so that the required neural network models can be loaded into the video memory in advance in order to improve the processing efficiency of requests, so that the neural network models can be directly used when the neural network models need to be used. One request queue may correspond to one neural network model or may correspond to a plurality of neural network models, for example, some requests need to be processed by a plurality of neural network models in sequence.

Specifically, all request queues may be polled first; judging whether a request exists in a current polled to-be-processed request queue; if the request exists in the request queue to be processed, judging whether at least one target neural network model corresponding to the request queue to be processed is stored in the video memory, and determining whether the target neural network model needs to be preloaded according to a judgment result; and if no request exists in the request queue to be processed, the request queue to be processed does not need to be processed, and the next request queue is continuously polled.

102. And if the target neural network model is not stored in the video memory, prefetching the target neural network model from a memory and loading the target neural network model into the video memory so as to cover at least one neural network model stored in the video memory in advance.

When the network processor is initially used, the video memory has enough storage space, and at least one neural network model to be used can be directly loaded into the video memory for use; the neural network model to be used later may be stored in the memory first, and when the target neural network model in the memory is to be used, in order to enable the target neural network model to be used successfully, the target neural network model may be prefetched from the memory and loaded into the video memory so as to cover at least one neural network model stored in the video memory in advance.

The specific implementation manner of prefetching the target neural network model from the memory and loading the target neural network model into the video memory to cover at least one neural network model prestored in the video memory may be: prefetching the target neural network model from the memory; judging whether the video memory contains a neural network model in an unexecuted state or not; if the neural network model in the non-execution state is contained, enabling the target neural network model to cover at least one neural network model in the non-execution state based on a covering strategy; if the video memory does not have the neural network model in the non-execution state (namely, the neural network models in the video memory are all in the execution state), the video memory can be covered when the executed neural network model appears in the video memory. In order to fully utilize the storage space of the video memory, if the video memory does not have a neural network model in an unexecuted state, whether the size of the residual space of the video memory is larger than or equal to that of a target neural network model or not can be judged, and if the size of the residual space is larger than or equal to that of the target neural network model, the target neural network model is directly loaded into the video memory; if the size of the residual space is smaller than that of the target neural network model, the target neural network model can be loaded in a mode of covering other neural network models when the executed neural network model appears in the video memory.

In order to fully utilize the storage space of the video memory, when the target neural network model is determined not to be stored in the video memory, whether the size of the residual space of the video memory is larger than or equal to the size of the target neural network model or not can be judged; if the size of the residual space of the video memory is larger than or equal to the size of the target neural network model, the target neural network model is prefetched from the memory, and then the target neural network model is directly loaded into the video memory; and if the size of the residual space of the video memory is smaller than that of the target neural network model, prefetching the target neural network model from the memory and loading the target neural network model into the video memory so as to cover at least one neural network model prestored in the video memory.

The specific implementation of causing the target neural network model to overlay at least one neural network model in an unexecuted state based on an overlay policy may be: if the reserved space of the video memory is larger than a preset space threshold, a neural network model in an unexecuted state is randomly selected, the selected neural network model is covered by using the target neural network model, and a neural network model with an execution sequence behind the target neural network model can also be selected from the neural network models in the unexecuted state and covered by the target neural network model; and if the reserved space of the video memory is smaller than a preset space threshold, covering at least one neural network model in the unexecuted state by using the target neural network model according to the size of the neural network model in the unexecuted state and the size of the target neural network model. The reserved space is not the real-time residual space of the video memory, but is a reserved storage space when the neural network model is loaded.

For example, if 5 neural network models can be stored in the video memory, in the specific implementation, only 3 neural network models can be stored at most, and the reserved space of 2 neural network models is reserved. In this case, regardless of the size of the next target neural network model, the target neural network model can be loaded successfully by randomly selecting one neural network model to overlay.

When the reserved space is smaller than the preset space threshold, random coverage may not be achieved, and which neural network model needs to be covered is selected according to actual conditions. Specifically, the size of each neural network model in the unexecuted state and the size of the target neural network model may be determined first; if the size of a certain neural network model in an unexecuted state is larger than or equal to that of the target neural network model, the target neural network model can be directly used for covering the neural network model; if the sizes of all the neural network models in the non-execution state are smaller than the size of the target neural network model, the sizes of two of the neural network models can be covered by using the target neural network model.

103. And processing the request in the request queue to be processed based on the target neural network model stored in the video memory.

When the target neural network model is pre-loaded into the video memory, the target neural network model can be called to process corresponding requests one by one when the target neural network model needs to be called.

The request processing method based on the neural network model provided by the embodiment of the invention can store at least one neural network model in a memory with a storage space far larger than that of a video memory in advance, when a request needs to be processed by using at least one target neural network model corresponding to a request queue to be processed, if the target neural network model is not stored in the video memory of a network processor, the target neural network model can be prefetched from the memory, and the target neural network model is successfully loaded into the video memory to realize the processing of the request in a mode of covering at least one neural network model stored in the video memory in advance, so that the target neural network model cannot be loaded due to insufficient space of the video memory, the indirect expansion of the video memory is realized through the memory, and the function of loading a large-scale neural network model by using a single network processor is realized, and the software resources can dynamically and rapidly change along with the change of the flow, so that the flexibility is improved.

Optionally, the method embodiment shown in fig. 1 may be implemented by a prefetch thread and a worker thread, as shown in fig. 2, the specific implementation process includes:

201. the prefetching thread judges whether at least one target neural network model is stored in a video memory of the network processor; if the data is stored in the video memory, executing step 202; if not, step 203 and 205 are executed.

Specifically, the prefetch thread polls all the request queues; the pre-fetching thread judges whether the current polled pending request queue contains a request; if the request is contained, the thread is prefetched to judge whether at least one target neural network model is stored in the video memory of the network processor; if there are no requests, then continue polling the next request queue.

The specific implementation manner of the prefetch thread determining whether at least one target neural network model is stored in the video memory of the network processor is the same as that in step 101, and is not described herein again.

202. And the prefetching thread informs the working thread to process the request in the queue of the to-be-processed request based on the target neural network model.

The embodiment of the invention can use a working thread to process the request queues corresponding to different neural network models in a time division multiplexing mode; or starting a plurality of working threads, wherein different working threads respectively process the request queues corresponding to different neural network models.

203. And the prefetching thread informs the network processor to prefetch the target neural network model from the memory and load the model into the video memory so as to cover at least one neural network model prestored in the video memory.

The specific implementation manner of the network processor prefetching the target neural network model from the memory and loading the target neural network model into the video memory to cover at least one neural network model pre-stored in the video memory is the same as that in step 102, and is not described herein again.

204. The prefetch thread adds the name of the target neural network model to a prefetch task queue.

Because the target neural network model is pre-loaded into the video memory and may take a period of time before being used, the name of the target neural network model can be added into the pre-fetching task queue after the target neural network model is loaded into the video memory, so that a subsequent working thread determines which request queue corresponding to the neural network model needs to be processed according to the pre-fetching task queue.

In practical application, when a work thread processes a request queue, a pre-fetching thread performs a neural network model pre-loading process, if there are many requests in the request queue, the time required for the work thread to process the request queue is relatively long, and the neural network model pre-loaded by the pre-fetching thread is increased, so that the pre-loading speed is far higher than the request processing speed, and the pre-fetching thread resource is wasted. In order to avoid resource waste of the prefetch thread, after the prefetch thread adds the name of the target neural network model to a prefetch task queue, whether the number of the names contained in the prefetch task queue is greater than or equal to a preset number threshold value or not can be judged; if the number of the names contained in the pre-fetching task queue is larger than or equal to a preset number threshold, the pre-fetching thread processes the target neural network model corresponding to the next request queue to be processed when the number of the names contained in the pre-fetching task queue is reduced to be smaller than the preset number threshold; and if the number of the names contained in the pre-fetching task queue is smaller than a preset number threshold, the pre-fetching thread processes the target neural network model corresponding to the next request queue to be processed. The preset number threshold may be 2, or may be other values, which is determined according to actual experience.

205. And the prefetching thread informs a working thread to process the request in the request queue corresponding to the name of the target neural network model in the prefetching task queue.

The working thread judges whether the name of the neural network model is contained in the pre-fetching task queue or not; if the name is contained, a batch of requests can be obtained from the request queue corresponding to the name, the obtained requests are submitted to a network processor to be processed based on a target neural network model, after the processing is finished, the working thread informs the pre-fetching thread and returns a processing result.

According to the request processing method based on the neural network model, the pre-fetching thread can pre-fetch at least one target neural network model from the memory and load the target neural network model into the display memory so as to cover at least one neural network model pre-stored in the display memory, and the working thread is informed to process the request queue corresponding to the target neural network model, so that the target neural network model cannot be loaded due to insufficient space of the display memory, indirect capacity expansion of the display memory is realized through the memory, the function of loading a large-scale neural network model by using a single network processor is further realized, software resources can dynamically and rapidly change along with the change of flow, and the flexibility is improved.

Further, the method embodiments described above may be applied in a variety of scenarios, such as a language translation scenario, an image processing scenario, and so on. The method is explained below by taking a language translation scenario as an example, and as shown in fig. 3, the method includes:

301. and judging whether at least one target language translation model is stored in a video memory of the graphics processor.

The target language translation model comprises a neural network model which corresponds to the request queue to be processed and is used for language translation.

302. And if the target language translation model is not stored in the video memory, prefetching the target language translation model from the memory and loading the target language translation model into the video memory so as to cover at least one language translation model pre-stored in the video memory.

303. And processing the request in the current request queue based on the target language translation model stored in the video memory.

The specific implementation manner of steps 301-303 can refer to the embodiment shown in fig. 1 or 2, and will not be described herein again.

304. And outputting a translation result corresponding to the request in the request queue to be processed.

The embodiment of the invention does not limit the output mode of the translation result, can output the request on the left side of the page, output the translation result on the right side of the page, and mark the translation result which is the translation result of which request.

The request processing method based on the neural network model provided by the embodiment of the invention can store at least one language translation model in a memory with a storage space far larger than that of a video memory in advance, when a request needs to be processed by using at least one target language translation model corresponding to a request queue to be processed, if the target language translation model is not stored in the video memory of a network processor, the target neural network model can be prefetched from the memory, the target language translation model is successfully loaded into the video memory to realize the processing of the request and output a translation result in a mode of covering at least one language translation model stored in the video memory in advance, and the target language translation model cannot be loaded due to insufficient space of the video memory, so that the indirect capacity expansion of the video memory is realized through the memory, and the function of loading a large-scale language translation model by using a single network processor is realized, and the software resources can dynamically and rapidly change along with the change of the flow, so that the flexibility is improved.

Further, according to the above method embodiment, another embodiment of the present invention further provides a request processing method based on a neural network model, where the method is applied to a receiving party, and the method includes:

s1, receiving a request sent by a requester;

s2, determining at least one target neural network model required for processing the request according to the request; when a model identification is included in the request, a target neural network model may be determined from the model identification.

S3, judging whether the target neural network model is stored in a first storage space or not;

s4, if the target neural network model is not stored in the first storage space, prefetching the target neural network model from a second storage space and loading the target neural network model into the first storage space so as to cover at least one neural network model stored in the first storage space in advance;

the first storage space is a storage space required when the target neural network model is used for processing the request, and the second storage space may be a storage space other than the first storage space. For example, when the network processor needs to process a request using the target neural network model, the first storage space is a video memory, and correspondingly, the second storage space may be a memory.

S5, processing the request based on the target neural network model stored in the first storage space;

and S6, sending the processing result to the requester.

For example, as shown in fig. 4, if the sender is a client, the receiver is a server, the first storage space is a video memory, and the second storage space is a memory, after the client sends a request to the server, the server determines that the target neural network model is model 1 according to the content of the request, and determines whether the model 1 is stored in the video memory, and since only model 2-model 5 are stored in the video memory, model 1 is not stored in the video memory. At this time, the model 1 can be prefetched from the memory storing the models 1 to 6, the model 1 is loaded to the overlay model 5 in the video memory, so that the existing models in the video memory have the models 1 to 4, the request is processed by using the model 1, and finally the processing result is fed back to the user side.

Further, according to the above method embodiment, another embodiment of the present invention further provides a request processing apparatus based on a neural network model, as shown in fig. 5, the apparatus includes:

a determining unit 41, configured to determine whether at least one target neural network model is stored in a video memory of the network processor, where the target neural network model includes a neural network model corresponding to a request queue to be processed;

a loading unit 42, configured to prefetch the target neural network model from a memory and load the target neural network model into the memory if the target neural network model is not stored in the memory, so as to cover at least one neural network model that is pre-stored in the memory;

and a processing request unit 43, configured to process the request in the to-be-processed request queue based on the target neural network model stored in the video memory.

Optionally, as shown in fig. 6, the loading unit 42 includes:

a prefetching module 421, configured to prefetch the target neural network model from the memory;

a judging module 422, configured to judge whether the video memory contains a neural network model in an unexecuted state;

the overlay module 423 is configured to, if the video memory contains the neural network model in the non-execution state, enable the target neural network model to overlay at least one neural network model in the non-execution state based on an overlay policy.

Optionally, the determining unit 41 is further configured to determine whether the size of the remaining space of the video memory is greater than or equal to the size of the target neural network model if there is no neural network model in the video memory that is in an unexecuted state;

the loading unit 42 is further configured to directly load the target neural network model into the video memory if the size of the remaining space is greater than or equal to the size of the target neural network model.

Optionally, the covering module 423 is configured to randomly select a neural network model in an unexecuted state if the reserved space of the video memory is greater than a preset space threshold, and cover the selected neural network model with the target neural network model; and if the reserved space of the video memory is smaller than a preset space threshold, covering at least one neural network model in the unexecuted state by using the target neural network model according to the size of the neural network model in the unexecuted state and the size of the target neural network model.

Optionally, the loading unit 42 is configured to notify the network processor of prefetching the target neural network model from the memory to be loaded into the video memory by the prefetching thread, so as to cover at least one neural network model pre-stored in the video memory.

Optionally, as shown in fig. 6, the processing request unit 43 includes:

an adding module 431, configured to add, by the prefetch thread, the name of the target neural network model to a prefetch task queue;

and the notification processing module 432 is configured to notify, by the prefetch thread, a worker thread to process a request in a request queue corresponding to the name of the target neural network model in the prefetch task queue.

Optionally, as shown in fig. 6, the apparatus further includes:

a waiting processing unit 44, configured to, after the prefetch thread adds the name of the target neural network model to the prefetch task queue, if the number of names in the prefetch task queue is greater than or equal to a preset number threshold, the prefetch thread waits for the number of names in the prefetch task queue to be reduced to be less than the preset number threshold, and then processes the target neural network model corresponding to the next to-be-processed request queue;

and the processing unit 45 is configured to, if the number of names in the pre-fetch task queue is smaller than a preset number threshold, process the pre-fetch thread with respect to a target neural network model corresponding to a next to-be-processed request queue.

Optionally, the determining unit 41 is configured to determine whether a target neural network model corresponding to the to-be-processed request queue is stored in the video memory if there is a request in the to-be-processed request queue.

The request processing device based on the neural network model provided by the embodiment of the invention can store at least one neural network model in a memory with a storage space far larger than that of a video memory in advance, when a request needs to be processed by using at least one target neural network model corresponding to a request queue to be processed, if the target neural network model is not stored in the video memory of a network processor, the target neural network model can be prefetched from the memory, and the target neural network model is successfully loaded into the video memory to realize the processing of the request in a mode of covering at least one neural network model stored in the video memory in advance, so that the target neural network model cannot be loaded due to insufficient space of the video memory, the indirect expansion of the video memory is realized through the memory, and the function of loading a large-scale neural network model by using a single network processor is realized, and the software resources can dynamically and rapidly change along with the change of the flow, so that the flexibility is improved.

Further, according to the above method embodiment, another embodiment of the present invention further provides a request processing apparatus based on a neural network model, as shown in fig. 7, the apparatus includes:

a judging unit 51, configured to judge whether at least one target language translation model is stored in a video memory of the graphics processor, where the target language translation model includes a neural network model for performing language translation corresponding to a request queue to be processed;

a loading unit 52, configured to prefetch the target language translation model from a memory and load the target language translation model into the memory if the target language translation model is not stored in the memory, so as to cover at least one language translation model pre-stored in the memory;

a processing request unit 53, configured to process a request in the to-be-processed request queue based on the target language translation model stored in the video memory;

and the output unit 54 is configured to output a translation result corresponding to the request in the pending request queue.

The request processing device based on the neural network model provided by the embodiment of the invention can store at least one language translation model in a memory with a storage space far larger than that of a video memory in advance, when a request needs to be processed by using at least one target language translation model corresponding to a request queue to be processed, if the target language translation model is not stored in the video memory of a network processor, the target language translation model can be prefetched from the memory, the target language translation model is successfully loaded into the video memory to realize the processing of the request and output a translation result in a mode of covering at least one language translation model stored in the video memory in advance, and the target translation model cannot be loaded due to insufficient video memory space, so that the indirect expansion of the video memory is realized through the memory, and the function of loading a large-scale language translation model by using a single network processor is realized, and the software resources can dynamically and rapidly change along with the change of the flow, so that the flexibility is improved.

Further, according to the above method embodiment, another embodiment of the present invention further provides a request processing apparatus based on a neural network model, as shown in fig. 8, the apparatus includes:

a receiving unit 61, configured to receive a request sent by a requester;

a determining unit 62, configured to determine, according to the request, at least one target neural network model required for processing the request;

a judging unit 63, configured to judge whether the target neural network model is stored in a first storage space;

a loading unit 64, configured to prefetch the target neural network model from a second storage space and load the target neural network model into the first storage space to cover at least one neural network model pre-stored in the first storage space if the target neural network model is not stored in the first storage space;

a processing unit 65, configured to process the request based on the target neural network model stored in the first storage space;

a sending unit 66, configured to send the processing result to the requesting party.

Further, another embodiment of the present invention also provides a storage medium storing a plurality of instructions, the instructions being adapted to be loaded by a processor and to execute the request processing method based on the neural network model as described above.

Further, another embodiment of the present invention also provides an electronic device including a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform a neural network model-based request processing method as described above.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any references above to specific languages are provided for disclosure of enablement and practice of the present invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the neural network model-based request processing method and apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A request processing method based on a neural network model, the method comprising:

2. The method of claim 1, wherein prefetching the target neural network model from memory for loading into the video memory to overwrite at least one neural network model pre-stored in the video memory comprises:

prefetching the target neural network model from the memory;

3. The method of claim 2, further comprising:

4. The method of claim 2, wherein causing the target neural network model to overlay at least one neural network model in an unexecuted state based on an overlay policy comprises:

5. The method of claim 1, wherein prefetching the target neural network model from memory for loading into the video memory to overwrite at least one neural network model pre-stored in the video memory comprises:

6. The method of claim 1, wherein processing the requests in the pending request queue based on the target neural network model stored in the video memory comprises:

7. The method of claim 6, wherein after a prefetch thread adds a name of the target neural network model to a prefetch task queue, the method further comprises:

8. The method of any one of claims 1-7, wherein determining whether the at least one target neural network model is stored in a video memory of a network processor comprises:

9. A request processing method based on a neural network model, the method comprising:

10. A request processing method based on a neural network model, the method comprising:

receiving a request sent by a requester;

and sending the processing result to the requester.

11. A request processing apparatus based on a neural network model, the apparatus comprising:

12. A request processing apparatus based on a neural network model, the apparatus comprising:

13. A request processing apparatus based on a neural network model, the apparatus comprising:

the receiving unit is used for receiving a request sent by a requester;

14. A storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform a method of processing requests based on a neural network model according to any one of claims 1 to 10.

15. An electronic device, comprising a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform a neural network model-based request processing method as claimed in any one of claims 1 to 10.