CN113674137A

CN113674137A - Model loading method for maximizing and improving video memory utilization rate based on LRU (least recently used) strategy

Info

Publication number: CN113674137A
Application number: CN202111001401.5A
Authority: CN
Inventors: 钟靖; 吴小炎; 吴名朝
Original assignee: Whale Cloud Technology Co Ltd
Current assignee: Whale Cloud Technology Co Ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-11-19

Abstract

The invention discloses a model loading method for maximally improving the utilization rate of a video memory based on an LRU (least recently used) strategy, which comprises the following steps of: constructing and deploying three models of face recognition, portrait comparison and human body analysis, and configuring an example; starting a timing task, acquiring the real-time utilization rate of the GPU in the time period every 10 minutes, and calculating the average GPU utilization rate in the time period; calculating the use rate of the moving average video memory by scheduling the optimal resource scheduling strategy; according to the data information in the period of time, predicting the number of instances required by the next period of time through an optimal resource scheduling strategy; the number of instances is adjusted according to the number of instances required by the model in the next period of time and the number of instances used by the model. Has the advantages that: through the LRU scheduling strategy, the models are dynamically started and stopped, the pain point of low utilization rate of the multi-model shared video memory is solved, the utilization rate of the video memory is improved, and resources are saved.

Description

Model loading method for maximizing and improving video memory utilization rate based on LRU (least recently used) strategy

Technical Field

The invention relates to the technical field of video memory, in particular to a model loading method for maximizing the utilization rate of the video memory based on an LRU (least recently used) strategy.

Background

When a large enterprise carries out digital transformation, an AI scene is bound to face, the requirements of AI application and AI capacity are met, in the production process of real AI capacity, the calling of AI capacity is bound to exist, usually, API realization is provided for the outside based on an AI capacity open platform, the uploading and the deployment of AI capacity are carried out based on a model version, in the capacity deployment, single model and multi-model combined deployment exist, obviously, the value of resource utilization can be better reflected by the multi-model combined deployment, and on the basis of the multi-model deployment, the problem of resource sharing of a CPU, a GPU, a memory and a video memory needs to be solved. In the daily production process of AI (multiple models), the differentiation demands on the model calling amount in different time periods in application certainly exist, and the differentiation between intensive calling of an A model and scattered calling of a B model in the same AI capacity or even zero calling needs to be solved, so that the resources of the A model are insufficient, and the resources of the B model are wasted; and the requirement of model replacement in a running state exists, namely the same capability comprises a plurality of models (A, B, C), each model starts a plurality of instances, the former resources can only support the A and B non-calling requests with calling quantity, and the later production operation can have the requirement that the B non-calling quantity and the C have the calling quantity, so that the occupation and the waste of resources are caused.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a model loading method for maximizing the utilization rate of the video memory based on the LRU strategy, so as to overcome the technical problems in the prior related art.

Therefore, the invention adopts the following specific technical scheme:

the model loading method for maximizing and improving the utilization rate of the video memory based on the LRU strategy comprises the following steps:

constructing and deploying three models of face recognition, portrait comparison and human body analysis, and configuring an example;

starting a timing task, acquiring the real-time utilization rate of the GPU in the time period every 10 minutes, and calculating the average GPU utilization rate in the time period;

calculating the use rate of the moving average video memory by scheduling the optimal resource scheduling strategy;

according to the data information in the period of time, predicting the number of instances required by the next period of time through an optimal resource scheduling strategy;

adjusting the number of the examples according to the number of the examples needed by the model in the next period of time and the number of the examples used by the model;

and finally realizing the maximization of the video memory utilization rate through the optimal resource scheduling strategy.

Further, the construction and deployment of the three models of face recognition, portrait comparison and human body analysis and the configuration of the example comprise the following steps:

three model capabilities of face recognition, portrait comparison and human body analysis are configured through an AI platform;

6 elastically telescopic examples are respectively configured for three models of face recognition, portrait comparison and human body analysis;

configuring three models of face recognition, portrait comparison and human body analysis to the same display card;

and deploying and starting three models of face recognition, portrait comparison and human body analysis through a container management platform.

Further, the starting of the timing task, obtaining the real-time utilization rate of the GPU in the time period every 10 minutes, and calculating the average GPU utilization rate in the time period includes the following steps:

starting a timing task, and acquiring the real-time resource utilization rate of the GPU in the period of time by a resource monitoring tool every 10 minutes;

storing the acquired GPU real-time utilization rate for the scheduling of the optimal resource scheduling policy (LRU);

the optimal resource scheduling strategy scheduling center circularly obtains data of a certain period of time from the remote dictionary service, samples the real-time utilization rate of the GPU in the period of time, and obtains the average GPU utilization rate in the period of time through calculation.

Further, the step of obtaining the real-time resource utilization rate of the GPU in the period of time by the resource monitoring tool every 10 minutes includes the following steps:

respectively acquiring the number of pictures analyzed by the three models in a first time period and a second time period;

and respectively obtaining the number of the pictures analyzed by the three models in the first time period, the number of the pictures analyzed by the three models in the second time period and the maximum number of the pictures analyzed by the three models in 1 second, and calculating to obtain the GPU real-time resource utilization rate.

Further, the formula for calculating the real-time resource utilization rate of the GPU is as follows:

；

wherein A represents the real-time resource utilization rate of the GPU, i, j are respectively a first time period and a second time period, and i>j，C_iRepresenting the number of pictures, C, analyzed by the model during a first time period_jRepresenting the number of pictures j that the model analyzed during the second time period, and M representing the maximum number of pictures that the model can analyze in 1 second.

Further, the calculation formula for obtaining the average GPU utilization within the period of time through calculation is as follows:

；

wherein the content of the first and second substances,

the average GPU utilization rate is represented, I represents the sampling times of the real-time GPU utilization rate in a period of time, and J represents the number of model operation instances.

Further, the formula for calculating the running average video memory usage rate by scheduling the optimal resource scheduling policy is as follows:

；

wherein the content of the first and second substances,

for moving average display memory of model in t periodThe utilization rate of the water-based paint is improved,

average GPU utilization for the model over a period of t, and when a moving average model is not used

=

Beta is a weighted random number of 0 to 1, where beta is set to 0.9;

and the above formula can be expanded as follows:

；

filling the usage rate of each time from t to 1 into a formula to calculate U_tThe running average video memory usage at time t to 1.

Further, the data information includes an average resource utilization rate, a number of used instances of each model, a maximum GPU utilization rate, and a minimum GPU utilization rate.

Further, the calculation formula for predicting the number of instances required for the next period of time by the optimal resource scheduling policy (LRU policy) is as follows:

；

wherein Z represents the number of instances required for the next period of time of the model,

indicating the running average video memory usage, Z_oFor the number of pod that the model has used, p represents the maximum utilization and p represents the minimum utilization.

The invention has the beneficial effects that: aiming at the scene of multi-model shared video memory, the models are dynamically started and stopped through an LRU (least recently used) scheduling strategy, so that the problem that the multi-model shared video memory has low utilization rate is solved, namely, the video memory occupation of the multi-model is effectively distributed, less video memory resources are distributed to the models with low utilization rate, more video memory resources are provided for the models with high utilization rate, the utilization rate of the video memory is improved, and the resources are saved; the real-time performance of container switching is improved through the real-time monitoring of the glanches; and the high speed of model switching is improved through redis fast cache.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flowchart of a model loading method for maximizing utilization of a graphics memory based on an LRU policy according to an embodiment of the present invention;

fig. 2 is a flowchart of a technical implementation of a model loading method for maximally increasing utilization of a video memory based on an LRU policy according to an embodiment of the present invention.

Detailed Description

For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.

According to the embodiment of the invention, a model loading method for maximally improving the utilization rate of a video memory based on an LRU (least recently used) strategy is provided.

Referring to the drawings and the detailed description, the present invention is further described, as shown in fig. 1, a model loading method for maximally increasing utilization rate of a video memory based on an LRU policy according to an embodiment of the present invention includes the following steps:

s1, constructing and deploying three models of face recognition, portrait comparison and human body analysis, and configuring an example;

wherein, step S1 includes the following steps:

s11, configuring three model capabilities of face recognition, portrait comparison and human body analysis through an AI platform;

s12, respectively configuring 6 elastically telescopic examples for three models of face recognition, portrait comparison and human body analysis;

s13, configuring three models of face recognition, portrait comparison and human body analysis to the same display card;

and S14, deploying and starting three models of face recognition, portrait comparison and human body analysis through a container management platform (rancher).

S2, starting a timing task, acquiring the real-time utilization rate of the GPU in the time period every 10 minutes, and calculating the average GPU utilization rate in the time period;

wherein, step S2 includes the following steps:

s21, starting timing tasks, and acquiring the real-time resource utilization rate of the GPU in the period of time by a resource monitoring tool (Glances) every 10 minutes;

further, step S21 includes the steps of:

s211, respectively obtaining the number of the pictures analyzed by the three models in a first time period and a second time period;

the number of pictures processed in 1-10 minutes in the face recognition model is C1: 12021, number of pictures processed in 10-20 minutes C2: 8782 sheets;

figure contrast model, number of pictures processed in 1-10 minutes C1: 49389, number of pictures processed in 10-20 min C2: 30287 sheets of paper;

human analytical model, number of pictures processed in 1-10 minutes C1: 120789 sheets, number of pictures processed in 10-20 minutes C2: 152573 pieces.

S212, respectively obtaining the number of the pictures analyzed by the three models in the first time period, the number of the pictures analyzed by the three models in the second time period and the maximum number of the pictures analyzed by the three models in 1 second, and calculating to obtain the GPU real-time resource utilization rate, wherein the calculation formula is as follows:

；

Furthermore, the face recognition model processes a maximum number of pictures M (per second): 50 sheets;

portrait contrast model maximum number of picture processing M (per second): 112 sheets;

human analysis model maximum number of picture processing M (per second): 258 pieces.

S22, storing the acquired GPU real-time utilization rate for the scheduling of the following optimal resource scheduling policy (LRU);

s23, circularly obtaining data of a certain period of time from a remote dictionary service (redis) by an optimal resource scheduling policy (LRU) scheduling center, sampling the real-time utilization rate of the GPU in the period of time, and obtaining the average GPU utilization rate in the period of time through calculation, wherein the calculation formula is as follows:

；

wherein the content of the first and second substances,

In addition, the average GPU resource utilization U of the face recognition model: 35.20 percent;

average GPU resource utilization U of the portrait comparison model: 81.67 percent;

average GPU resource utilization U of the human body analysis model: 88.29 percent.

S3, calculating the running average video memory utilization rate through the optimal resource scheduling strategy scheduling, wherein the calculation formula is as follows:

；

wherein the content of the first and second substances,

is the running average video memory usage rate of the model in the t period,

=

Beta is a weighted random number of 0 to 1, where beta is set to 0.9;

and the above formula can be expanded as follows:

；

S4, predicting the number of instances required by the next period of time through an optimal resource scheduling policy (LRU policy) according to the data information in the period of time;

the data information comprises average resource utilization rate, the number of used examples of each model, GPU maximum utilization rate and GPU minimum utilization rate.

The calculation formula for predicting the number of instances required for the next period of time by the optimal resource scheduling policy (LRU policy) is as follows:

；

S5, adjusting the number of the instances according to the number of the instances required by the model in the next period of time and the number of the instances used by the model;

and S6, finally realizing the maximization of the utilization rate of the video memory through an optimal resource scheduling strategy (LRU).

As shown in fig. 2, the method is further explained and explained by the following specific technical means and procedures:

and calling the Glances interface every 10 minutes through the timing task to obtain the video memory use condition of each model. Glanches can well monitor the use condition of the model video memory and provide an interface for real-time feedback to an application end.

And obtaining Glances return and writing the Glances return into a redis cache. The LinkedHashMap of Java realizes an LRU algorithm, and the principle is that a linked list is transformed when elements are inserted and accessed based on a rule of inserting and accessing records of a bi-directional linked list. Linked HashMap defaults to insert as sorting, accessOrder can be set as True, so that the sorting is similar to HashMap according to access conditions, specific internal implementation logic can be mainly rewritten access of newNode and afternode access according to the insertion and access sorting, the method realizes operation on a double linked list, elements are updated to the tail of the linked list when the elements are inserted, and data are updated to the head of the linked list when the elements are accessed.

And the timing task acquires the video memory occupancy rate of each model in the LRU cache every minute, calls a ran cher interface, reduces the number of instances of the model which uses the video memory least recently or less frequently, and even stops the model, so as to achieve the optimal utilization of the video memory. The ranker self-forms a set of container modules comprising a network, a storage, a load balance and dns, which run on the Linux, provide unified infrastructure services for the upper layers, and very conveniently provide an interface and an interface to manage the containers.

The monitoring task code is implemented as follows:

package com.iwhalecloud.aiFactory.aiinference；

import com.iwhalecloud.aiFactory.aiGateway.common.RancherUtil；

import com.iwhalecloud.aiFactory.aiGateway.common.interceptor.GpuUseInfo；

import com.iwhalecloud.aiFactory.aiResource.aiCmdb.host.vo.GpuData；

import com.iwhalecloud.aiFactory.aiinference.AirModelService；

import org.quartz.Job；

import org.quartz.JobExecutionContext；

import org.quartz.JobExecutionException；

import java.util.List；

/**

* @author zj

description @ Description: monitoring the use condition of the model video memory at regular time, and starting and stopping the model according to the video memory occupancy rate

* @since 2021/5/20 14：24

*/

public class LRUJob implements Job {

/**

Monitoring the use condition of the model video memory at fixed time, and starting and stopping the model according to the video memory occupancy rate

**/

@Override

public void execute(JobExecutionContext context) throws JobExecutionException {

I/1 query all video memories in use

List<GpuData> gpuDataList = getGpuList()；

for (GpuData gpuData ： gpuDataList) {

I/2 query model lists sharing the same video memory

List<AirModelService> airModelServiceList = getModelByGpu(gpuData)；

for (AirModelService airModelService ： airModelServiceList) {

V/3 calling Glances interface to inquire the video memory occupancy rate of the model

GpuUseInfo gpuUseInfo = getModelGpuInfoByGlances(airModelService)；

V/4, writing the model video memory temporary rate into the redis cache

putModelGpuUseInfo(gpuData.getId().toString() + "-" + airModelService.getId().toString()， gpuUseInfo)；

}

V/5 starting and stopping the model according to the recent use condition of the model

dealModelByGpu(gpuData， airModelServiceList)；

}

/**

Starting and stopping model according to recent use condition of model

**/

private void dealModelByGpu(GpuData gpuData， List<AirModelService> airModelServiceList) {

for (AirModelService airModelService ： airModelServiceList) {

if (! isstart & & isLRUStart (gpuData, airModeservice)) {// model is in the stopped state and the start condition is reached

//5.1 Start-Up model

RancherUtil.start(airModelService)；

}

else if (isStart (airModelservice) & & isLRUStop (gpuData, airModelservice)) {// model is in the startup state and the stop condition is reached

//5.2 stop model

RancherUtil.stop(airModelService)；

}

Glanches monitoring data and interfaces are shown in Table 1:

TABLE 1

Glances provides a monitoring data acquisition interface, calls the Glances interface to store the use condition of the container video memory into a redis cache, and provides data support for the following LRU scheduling.

The LRU cache realizes that:

package com.iwhalecloud.aiFactory.aiinference；

import java.util.LinkedHashMap；

import java.util.Map；

/**

* @author zj

description @ Description: LRU cache

* @since 2021/5/20 15：11

*/

public class LRUCache {

private int cacheSize；

private LinkedHashMap<Integer，Integer> linkedHashMap；

public LRUCache(int capacity) {

this.cacheSize = capacity；

linkedHashMap = new LinkedHashMap<Integer，Integer>(capacity，0.75F，true){

@Override

protected boolean removeEldestEntry(Map.Entry eldest) {

return size()>cacheSize；

}

}；

}

public int get(int key) {

return this.linkedHashMap.getOrDefault(key，-1)；

}

public void put(int key，int value) {

this.linkedHashMap.put(key，value)；

}

According to the utilization rate of the display memory, judging the start-stop code implementation by using LRU strategy cache:

package com.iwhalecloud.aiFactory.aiinference；

import com.iwhalecloud.aiFactory.aiinference.AirModelService；

public class RancherUtil {

// Start model

public static boolean start(AirModelService airModelService) {

v/Call ran cher interface, Start model

return startModelByRancher(airModelService)；

}

// stop model

public static boolean stop(AirModelService airModelService) {

// Call ran interface, stop model

return sotpModelByRancher(airModelService)；

}

In summary, by means of the above technical solution of the present invention, for a scenario of sharing a video memory with multiple models, a model is dynamically started and stopped by an LRU scheduling policy, so as to solve a pain point of low utilization rate of the video memory shared by multiple models, that is, occupation of the video memory of multiple models is effectively allocated, less video memory resources are allocated to the model with low utilization rate, and more video memory resources are provided to the model with high utilization rate, thereby improving utilization rate of the video memory and further saving resources; the real-time performance of container switching is improved through the real-time monitoring of the glanches; and the high speed of model switching is improved through redis fast cache.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. The model loading method for maximizing and improving the utilization rate of the video memory based on the LRU strategy is characterized by comprising the following steps of:

2. The model loading method for maximally improving video memory utilization rate based on LRU policy of claim 1, wherein the constructing and deploying three models of face recognition, portrait comparison and human body analysis and configuring the instance comprises the following steps:

3. A model loading method for maximizing and improving video memory utilization rate based on LRU strategy as claimed in claim 2, wherein said starting the timing task to obtain the real-time GPU utilization rate in the time period every 10 minutes and calculating the average GPU utilization rate in the time period comprises the following steps:

storing the acquired GPU real-time utilization rate for scheduling and using a subsequent optimal resource scheduling strategy;

4. A model loading method for maximally improving utilization rate of a video memory based on an LRU policy according to claim 3, wherein the step of obtaining real-time resource utilization rate of the GPU in the period of time by the resource monitoring tool every 10 minutes comprises the following steps:

5. A model loading method for maximally improving utilization rate of video memory based on LRU policy according to claim 4, wherein the formula for calculating real-time resource utilization rate of GPU is as follows:

；

wherein A represents the real-time resource utilization rate of the GPU, i, j are respectively a first time period and a second time period, and i>j，C_iGraph representing analysis of a model over a first time periodNumber of pieces, C_jRepresenting the number of pictures j that the model analyzed during the second time period, and M representing the maximum number of pictures that the model can analyze in 1 second.

6. A model loading method for maximally improving video memory utilization rate based on LRU policy according to claim 5, wherein the calculation formula for obtaining the average GPU utilization rate in the period of time through calculation is as follows:

；

wherein the content of the first and second substances,

7. The model loading method for maximizing and improving video memory utilization rate based on the LRU policy of claim 6, wherein the formula for calculating the moving average video memory utilization rate by scheduling the optimal resource scheduling policy is as follows:

；

wherein the content of the first and second substances,

is the running average video memory usage rate of the model in the t period,

=

Beta is a weighted random number of 0 to 1, where beta is set to 0.9;

and the above formula can be expanded as follows:

；

8. A method as recited in claim 7, wherein the data information includes average resource utilization, number of instances used per model, maximum GPU utilization, and minimum GPU utilization.

9. The method of claim 8, wherein the calculation formula for predicting the number of instances required for the next period of time by the optimal resource scheduling policy is as follows:

；

represents the running average video memory usage, Z_oFor the number of pod that the model has used, pmax represents the maximum utilization and pmin represents the minimum utilization.