CN117435350B

CN117435350B - Method, device, terminal and storage medium for running algorithm model

Info

Publication number: CN117435350B
Application number: CN202311750295.XA
Authority: CN
Inventors: 徐士立
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-04-09
Anticipated expiration: 2043-12-19
Also published as: CN117435350A

Abstract

The application provides an operation method and device of an algorithm model, a terminal and a storage medium, and is applied to the technical field of terminals. The method comprises the following steps: and acquiring the current running states of the plurality of processor units in the terminal, and calculating the expected time consumption of running the target algorithm model through each processor unit according to the current running states of the plurality of processor units. Therefore, the processor unit capable of efficiently processing is accurately and intuitively adapted to the target algorithm model, and resource conflict between the target algorithm model and other applications of the terminal can be avoided. Thereby determining the target processor unit based on the expected time consumption. In the scheme provided by the application, the terminal performs dynamic scheduling according to the actual running state of each processor unit, and completes running of the algorithm model by cooperation of a plurality of processor units, so that the running efficiency of the model at the terminal is improved, the algorithm model is guaranteed to run at the terminal at high efficiency, and resource conflict between the algorithm model and other applications of the terminal can be avoided.

Description

Method, device, terminal and storage medium for running algorithm model

Technical Field

The embodiment of the application relates to the technical field of terminals, in particular to an algorithm model operation method, an algorithm model operation device, a terminal and a computer readable storage medium.

Background

With the increasing computing power of terminal artificial intelligence (Artificial Intelligence, AI) chips and the increasing application of algorithm models (such as large models of artificial intelligence content generation (AI-Generated Content, AIGC)) and the like, running large models at the terminal side is beneficial to improving user experience. However, the running environment of the terminal is complex, for example, each processor unit of the terminal may be occupied by other applications or operating systems, which is not beneficial to the running of the algorithm model in the terminal. Therefore, a scheme for guaranteeing the algorithm model to operate at the terminal with high efficiency is needed in the related art.

Disclosure of Invention

The application provides an algorithm model operation method, an algorithm model operation device, a terminal and a computer readable storage medium, which can ensure that the algorithm model operates at the terminal with high efficiency.

In a first aspect, the present application provides a method for operating an algorithm model, applied to a terminal, where the method includes: acquiring the current running states of a plurality of processor units in the terminal; calculating first expected time consumption for running the target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units; and determining a target processor unit from the plurality of processor units for running the target algorithm model according to the first expected time consumption.

According to the operation method of the algorithm model, dynamic scheduling is carried out according to the actual operation states of the processor units, and the operation of the algorithm model is completed through cooperation of the processor units, so that the operation efficiency of the model at the terminal is improved, the high-efficiency operation of the algorithm model at the terminal can be further guaranteed, and meanwhile resource conflicts between the algorithm model and other applications of the terminal can be avoided.

In a second aspect, the present application provides an apparatus for running an algorithm model, configured in a terminal, where the apparatus includes: the system comprises a state acquisition module, a time-consuming calculation module and a unit determination module;

the state acquisition module is used for acquiring the current running states of the plurality of processor units in the terminal; the time consumption calculation module is used for calculating first expected time consumption for running the target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units; and the unit determining module is used for determining a target processor unit from the plurality of processor units according to the first expected time consumption, so as to run the target algorithm model.

In some embodiments, based on the above scheme, the running device of the algorithm model includes: the limit value determining module and the model running module;

Wherein, the limit value determining module is used for: determining, by the unit determining module, a target processor unit from the plurality of processor units according to the expected time consumption, and then determining that a resource limit corresponding to the target processor unit is a first upper limit value; and the model operation module is used for: controlling the amount of resources provided to the target algorithm model by the target processor unit not to exceed the first upper limit value during execution of the target algorithm model by the target processor unit.

In some embodiments, based on the above scheme, the running device of the algorithm model further includes: a state acquisition module; the state acquisition module is used for: acquiring the running states of the plurality of processor units; the limit value determining module is used for: and when the load factor of the target processor unit becomes smaller by more than a first threshold value, adjusting the resource limit value corresponding to the target processor unit to be a second upper limit value, and the model running module is further configured to: continuing to run the target algorithm model through the target processor unit; the second upper limit value is greater than the first upper limit value, and the load factor is converted to the load factor of the target processor unit at the highest operating frequency.

In some embodiments, based on the above scheme, the above model running module is further configured to: and when the load factor of the target processor unit becomes smaller by more than a first threshold value, increasing the resource occupation amount of the target algorithm model on the target processor unit.

In some embodiments, based on the above scheme, the above limit determination module is further configured to: after the state acquisition module acquires the running states of the plurality of processor units, if the load rate of the target processor unit becomes larger than a second threshold value, adjusting the resource limit value corresponding to the target processor unit to be a third upper limit value; wherein the third upper limit value is smaller than the first upper limit value.

In some embodiments, based on the above scheme, the time-consuming calculation module is further configured to: after the limit determining module adjusts the resource limit value corresponding to the target processor unit to be a third upper limit value, calculating a second expected time consumption for running the target algorithm model through the target processor unit; and calculating third expected time consumption for running the target algorithm model through the plurality of other processor units according to the current running states of the plurality of other processor units under the condition that the second expected time consumption does not meet the preset condition; the model operation module is further used for: running the target algorithm model by the replacement processor unit if there is a replacement processor unit among the plurality of other processor units; wherein the alternative processor unit is a processor unit that satisfies the predetermined condition for the third expected time consumption.

In some embodiments, based on the above scheme, the above model running module is further configured to: and continuing to run the target algorithm model through the target processor unit and generating early warning information in the case that the alternative processor unit is not present in the plurality of other processor units.

In some embodiments, based on the above scheme, the above model running module is further configured to: and if the second expected time consumption meets the preset condition, continuing to run the target algorithm model through the target processor unit.

In some embodiments, based on the above scheme, the above unit determining module is further configured to: after the state acquisition module acquires the operation states of the plurality of processor units, determining the other processor units as selectable processor units when the load rates of the other processor units become smaller by more than the first threshold value; the model operation module is also used for: and running, by the optional processor unit, the target algorithm model if the second expected time consumption does not meet the preset condition.

In some embodiments, based on the above scheme, the above model running module is further configured to: after the state acquisition module acquires the operation states of the plurality of processor units, if the available memory amount of the target processor unit becomes larger than a third threshold value, the buffer amount of the intermediate state data of the target algorithm model at the terminal is increased.

In some embodiments, based on the above scheme, the running device of the algorithm model further includes: a request module and a cache module; the request module is used for: before the unit determining module calculates the expected time consumption of running the target model through the plurality of processor units according to the current running states of the plurality of processor units, sending a strategy obtaining request to a server so that the server determines a target strategy according to the identity of the terminal and the identity of the target algorithm model, which are contained in the strategy obtaining request; the cache module is used for: caching the target strategy to the terminal; and, the time-consuming computing module is specifically configured to: calculating first expected time consumption for running a target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units and the target strategy; wherein, the target strategy comprises: running, by an ith processor unit, historical running records of the target algorithm model, each of the historical running records including: historical progress, historical load rates corresponding to the historical progress, and historical time consumption from the historical progress to completion of the target model.

In some embodiments, based on the foregoing, the current state of the ith processor unit includes a current load rate of the ith processor unit; the time-consuming calculation module includes: judging a sub-module and determining time consumption of the sub-module;

the judging submodule is used for: determining whether a reference history record exists in the history reference operation records of the ith processor unit according to the current progress of the target algorithm model and the current load rate of the ith processor unit; the time-consuming determination submodule is used for: determining, in the presence of a reference history, a time consumption for completing the object model by the ith processor unit based on historical time consumption in the reference history; the difference value between the historical progress in the reference historical record and the current progress is smaller than a fourth threshold value, and the difference value between the historical load rate in the reference historical record and the current load rate is smaller than a fifth threshold value.

In some embodiments, based on the foregoing, each of the foregoing historical operating records further includes: time-consuming credibility corresponding to the history progress; the time-consuming calculation module includes: screening the submodules; the screening submodule is used for: under the condition that the reference history record does not exist, according to the time-consuming credibility corresponding to the history progress, a multi-item mark history record is screened out from the history reference operation record of the ith processor unit; and, the time-consuming determining submodule is used for: determining the time consumption for completing the target model by the ith processor unit according to the historical time consumption respectively contained in the multi-item mark historical records; the difference between the historical load rate contained in the target historical record and the current load rate is smaller than a second preset value and larger than a first preset value, and the time-consuming reliability contained in the target historical record meets a preset condition.

According to the running device of the algorithm model, dynamic scheduling is carried out according to the actual running states of the processor units, and running of the algorithm model is completed through cooperation of the processor units, so that running efficiency of the model at a terminal is improved, efficient running of the algorithm model at the terminal can be guaranteed, and resource conflicts between the algorithm model and other applications of the terminal can be avoided.

In a third aspect, a terminal is provided that includes a processor and a memory. The memory is used for storing a computer program, and the processor is used for calling and running the computer program stored in the memory to execute the method provided by the first aspect.

In a fourth aspect, a chip is provided for implementing the method in any one of the first aspects or each implementation thereof. Specifically, the chip includes: a processor for calling and running a computer program from a memory, causing a device on which the chip is mounted to perform the method as provided in the first aspect above.

In a fifth aspect, a computer-readable storage medium is provided for storing a computer program that causes a computer to perform the method provided in the first aspect above.

In a sixth aspect, there is provided a computer program product comprising computer program instructions for causing a computer to perform the method provided in the first aspect above.

In a seventh aspect, there is provided a computer program which, when run on a computer, causes the computer to perform the method provided by the first aspect described above.

In summary, in the solution provided in the embodiments of the present application, current running states of a plurality of processor units in a terminal are obtained, and according to the current running states of the plurality of processor units, expected time consumption for running a target algorithm model through the plurality of processor units is calculated. The processor unit of the terminal takes a neural network processor (Neural network Processing Unit, NPU) as an example, in the embodiment of the present application, obtains the current running state of the NPU, and calculates the time consumption required for running the target algorithm model through the NPU according to the current running state. Therefore, the processor unit capable of efficiently processing can be accurately and intuitively adapted to the target algorithm model, and resource conflict between the target algorithm model and other applications of the terminal can be avoided. Further, in the embodiment of the application, in the running process of the target algorithm model, the processor unit capable of running the target algorithm model with high efficiency may be continuously determined by executing the above process multiple times, for example, it is determined that the algorithm model is more efficient to run through the NPU by time-consuming calculation in the early running period of the algorithm model, and it is determined that the algorithm model is more efficient to run through the (Graph network Processing Unit, GPU) by time-consuming calculation in the later running period of the algorithm model, and then the terminal cooperates with the NPU and the GPU to complete the running of the algorithm model at the terminal. Therefore, in the scheme provided by the embodiment of the application, the terminal performs dynamic scheduling according to the actual running state of each processor unit, and completes running of the algorithm model by cooperating with a plurality of processor units, so that the running efficiency of the model at the terminal is improved, the high-efficiency running of the algorithm model at the terminal can be further ensured, and meanwhile, resource conflict between the algorithm model and other applications of the terminal can be avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of an operation scheme of an algorithm model provided in an embodiment of the present application;

FIG. 2 is a flow chart of an algorithm model operation method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of information interaction providing a method for determining expected time consumption according to an embodiment of the present application;

FIG. 4 is a flow chart of a method for determining expected time consumption according to an embodiment of the present application;

FIG. 5 is a flow chart of a method of operating an algorithm model according to another embodiment of the present application;

FIG. 6 is a schematic diagram of initializing interactions when running a large language model task according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an initial scheduling process of resources when running a task of a large language model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a resource rescheduling procedure in a process of running a large language model according to an embodiment of the present application;

FIG. 9 is a schematic block diagram of an apparatus for running an algorithm model provided by an embodiment of the present application;

fig. 10 is a schematic block diagram of a terminal provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or otherwise described herein. In the present embodiment, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. In the description of the present application, unless otherwise indicated, "a plurality" means two or more than two.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function, and works together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. In addition, each module or unit may be part of an overall module or unit that includes the functionality of the module or unit

The terminal includes various processor units such as a central processing unit (Centrol Processing Unit, CPU), NPU, GPU, and the like. Inevitably, the operating system of the terminal, and possibly the running of one or more applications, will occupy the resources of the different processor units of the terminal. Therefore, when the terminal runs the algorithm model, there is a problem of resource utilization conflict.

In the solutions provided by the related art, terminal resource scheduling is generally performed according to a default policy. For example, the end system may assign an algorithm model to a fixed processor unit. For example, according to the correlation based on the provided policy, the image processing related algorithm models are all allocated to the GPU, and if there are a plurality of image processing related algorithm models currently, the image processing related algorithm models are also all allocated to the GPU according to the policy; however, the processor unit NPU of the terminal is currently in an idle state.

As can be seen from the above description, in the solution provided by the related art, the algorithm model existing in the related art may be distributed to a processor unit with relatively low efficiency or to a processor unit with relatively high load, which is not beneficial to the operation efficiency of the algorithm model; meanwhile, there is a problem that resources of each processor unit of the terminal are not fully utilized.

The operation scheme of the algorithm model provided by the embodiment of the application can solve the technical problems. Specifically, in the scheme provided by the embodiment of the application, the current running states of a plurality of processor units in the terminal are obtained, and the expected time consumption of running the target algorithm model through the plurality of processor units is calculated according to the current running states of the plurality of processor units. By determining the expected time consumption of each processor unit, it can be accurately and intuitively determined which processor unit is currently used for running the target algorithm model is more efficient, and resource conflicts between the target algorithm model and other applications of the terminal can be avoided. Further, in the running process of the target algorithm model in the embodiment of the application, the processor unit capable of running the target algorithm model with high efficiency can be continuously determined by executing the above process multiple times. Therefore, in the scheme provided by the embodiment of the application, the terminal performs dynamic scheduling according to the actual load and other conditions of each processor unit, and completes the operation of the algorithm model by cooperating with a plurality of processor units, so that the operation efficiency of the model at the terminal is improved, the high-efficiency operation of the algorithm model at the terminal can be further ensured, meanwhile, the resource conflict between the algorithm model and other applications of the terminal can be avoided, and the full utilization of terminal resources is facilitated.

The application scenario of the embodiment of the present application is described below:

fig. 1 is a schematic view of an application scenario of an operation scheme of an algorithm model according to an embodiment of the present application. As shown in fig. 1, in the embodiment of the present application, a game is taken as an example for the front-end task. Specifically, the terminal 102 interacts with the server 104 information, and an algorithm model may be implemented to run on the terminal 102. In some embodiments, an application specific to model processing may be run at the terminal (e.g., an application for training or pre-training a model, an application that may fine tune after pre-training, etc.); in other embodiments, the algorithm model may be a functional module in other applications (e.g., game applications), and the trained large language model, such as the AIGC model, is run during game running to perform intelligent content generation during game running.

Through the scheme provided by the embodiment of the application, the terminal 102 obtains the current running states of a plurality of processor units, such as the load rate of the CPU, the load rate of the NPU, the available memory of the GPU, and the like. Further, the terminal 102 calculates the expected time consumption of running the target algorithm model through the processor unit based on the current running state of the processor unit. Thus, the terminal 102 may determine a target processor unit for running the target algorithm model based on the expected time consumption corresponding to each processor unit. The technical scheme provided by the embodiment of the application can adapt the target algorithm model to the target processor unit capable of efficiently running, and the terminal dynamically schedules according to the actual load condition of each calculation module, so that the efficient calculation of the algorithm model is facilitated while the normal running of the system is not affected.

The terminal 102 is a smart phone, a tablet, an intelligent audio/video interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, a wearable intelligent device, a medical device, or the like. As mentioned above, the terminal 102 is often configured with a display device, which may also be a display, a display screen, a touch screen, etc., and a touch screen may also be a touch screen, a touch panel, etc. But is not limited thereto. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment.

It should be noted that, the application scenario of the embodiment of the present application includes, but is not limited to, the scenario shown in fig. 1.

The following describes the technical solutions of the embodiments of the present application in detail through some embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 2 is a flowchart of an operation method P200 of the algorithm model provided in the embodiment of the present application. The execution body of the method P200 is a terminal, such as the terminal 102 in fig. 1. In this embodiment, the task executed in the background of the terminal is an algorithm model task. Referring to fig. 2, the method P200 includes: s210 to S230.

In S210, the current operation states of the plurality of processor units in the terminal are acquired.

The above-mentioned processor unit refers to a module having computing power in the terminal. Such as the CPU, GPU, and NPU deployed by the terminal. The operating states of the processor unit include, but are not limited to, a load rate of the processor unit, available memory, and the like.

In S220, according to the current operation states of the plurality of processor units, a first expected time consumption for running the target algorithm model through the plurality of processor units is calculated.

In the embodiment of the present application, the algorithm model may specifically include a Machine Learning (ML) model, a deep Learning model, and the like, and for example, the algorithm model may specifically be a large language model, an AIGC model, and the like. The target algorithm model may be any of a variety of algorithm models. In the embodiment of the application, the target algorithm model is run at the terminal, which can be a pre-training task of the model, a fine-tuning task after pre-training, a model application task after training running at the background of the terminal, and the like. For example, the trained AIGC model is operated at the user terminal, so that the user can conveniently generate the content demand at any time and place, and the user experience is conveniently improved. The tasks related to the algorithm model may be calculation tasks related to the model, and are not limited to the specific tasks listed above.

Fig. 3 is a schematic information interaction diagram of a determination method for expected time consumption according to an embodiment of the present application.

Referring to fig. 3, when the terminal 100 runs a task related to the above-described target algorithm model, S30 is performed: and sending a policy acquisition request to a server. The request comprises the identity of the terminal and the identity of the target algorithm model.

In S32, the server 104 determines a target policy according to the identity of the terminal and the identity of the target algorithm model included in the policy acquisition request. The server 104, illustratively, determines the processor unit disposed at the terminal, e.g., including processor unit a-processor unit d, based on the identity of the terminal (e.g., the handset model). Further, the server 104 may obtain a history of the operation of the algorithm model X on the processor unit a-processor unit d according to the identity X of the target algorithm model.

Exemplary, target policies include: running, by an ith processor unit, historical running records of the target algorithm model, each of the historical running records comprising: historical progress, historical load rates corresponding to the historical progress, and historical time consumption from the historical progress to completion of the target model. For example, processor unit a processes algorithm model X with a current progress of 20%. The current load rate of the processor unit is 4541 ten-thousandths, the historical time consumption is 100 milliseconds, etc.

For example, to determine the target policy, historical operating information of the target algorithm model may be collected. By way of example, table 1 shows a table of historical operating information that may be collected about a target algorithm model.

TABLE 1

1. Regarding "unit_load" in table 1:

the terminal or the server can directly acquire the actual load rate of the processor unit under the actual operating frequency through the system interface. However, in the actual operation process of the terminal, the processor unit may not be able to operate at the highest frequency due to the temperature control policy of the system, so after the current operation frequency and the load rate of the processor unit are obtained, the terminal or the server needs to convert the current operation frequency and the load rate into the load at the highest operation frequency. Therefore, a comparison standard can be determined among a plurality of processor units of the terminal, and further, the method is beneficial to accurately determining the target processor unit with high processing algorithm model efficiency. Specifically, by maintaining information as in Table 2, the load factor of each processor unit scaled to the highest operating frequency may be obtained.

TABLE 2

The "load" in table 2 indicates the available load ratio of the algorithm model at the processor unit, for example, the processor unit may carry 100 units of load, wherein the system and the user mode application occupy 70 units of load, and the algorithm model may theoretically be three thousand (i.e., thirty percent) ten thousand of the available load of the algorithm model at the processor unit. In table 2 "max_load" represents a considerable value of the available load duty cycle, i.e. the available load duty cycle s1 at the actual operating frequency f1 of the processor unit, scaled to the available load duty cycle of the processor unit at the highest operating frequency fmax, can be expressed as: (fmax×s1)/f 1. In the same way, the load conditions of all the processor units of the terminal are converted into the equivalent values under the highest operation frequency, so that the load capacity between the processor units is favorably provided as a reference, and the target processor unit of the high-efficiency operation algorithm model is favorably and accurately determined.

It will be appreciated that the "load" of the processor unit at the actual operating frequency is converted to the particular embodiment of the model user load ratio "max_load" at the highest operating frequency in Table 2. The same applies to converting the load factor of the processor unit at the actual operating frequency to the load factor "unit_load" at the highest operating frequency. Also, either the model available load ratio "max_load" or the load ratio "unit_load" may be used to characterize how busy the processor unit is running. Specifically, the higher the load rate "unit_load", the smaller the model available load ratio "max_load", and the more busy the processor unit running state; conversely, the lower the load rate "unit_load", the greater the model available load ratio "max_load", the more idle the processor unit operating state. Therefore, in the embodiments described below, the load rate "unit_load" or the model available load ratio "max_load" may be used to measure the busyness of the processor unit, and further different operations may be performed according to the load situation of the processor unit.

In the embodiment of the application, the ratio is expressed in a ten-thousand-percent form, namely, the numerical value can be accurately presented in a integer int mode, so that the data size can be reduced while the accuracy is ensured.

2. Regarding "cost" and "cost_rate" in table 1:

"cost" represents the time taken by the model from the current progress "process" to completion of the model calculation. If "process" is 20%, then "cost" represents the length of time required for the model to complete the model calculation from 20% of the current progress.

Specifically, in the process of actually running the algorithm model, the terminal may report multiple schedules to the server. Since the corresponding system resources may be different for each report, the time-consuming "cost" cannot be directly calculated. In the embodiment of the application, the server or the terminal can adopt a conversion mode to improve the accuracy of the determined cost. How the calculations are performed is described below by way of an example.

If the algorithm model X has undergone 4 changes in system resources during the operation, the following is:

progress: 0-10: unit: NPU; load:1000; mem:500MB; when in use: 500ms;

progress: 10-30: unit: NPU; load:5000; mem:600MB; when in use: 200ms;

progress: 30-35: unit: NPU; load:2000; mem:700MB; when in use: 300ms;

progress: 35-60: unit: NPU; load:3000; mem:800MB; when in use: 400ms;

progress: 60-100: unit: NPU; load:500; mem:900MB; when in use: 4000ms;

For example, model X is from 0% of the current progress to the time required to complete the model calculation. The method can be specifically divided into 5 stages for calculation, wherein the first stage is an actual value of 500ms, the available load of the second stage becomes higher, and the available memory becomes higher, and the memory meets the minimum operation condition at this time, namely, the memory space is not a core influencing factor, so the following conversion can be performed based on the processor unit load:

converting the completion time length of the second stage from 200ms into: 200ms 5000/1000=1000 ms,

converting the finishing time length of the third stage from 300ms into: 300ms 2000/1000=600 ms,

converting the completion time length of the fourth stage from 400ms into: 400ms x 3000/1000=1200 ms,

converting the finishing time length of the fifth stage from 4000ms into: 4000ms 500/1000=2000 ms,

the time length required for the model to finish the calculation from the current progress 0% to the model is as follows: 500ms+1000ms+600ms+1200 ms+2000ms=5300 ms;

where "cost_rate" may be expressed as: the "load" is 1000 with a time-consuming rate of 500ms/5300 ms=9.43%.

An embodiment of determining the above target policy is presented by the above tables 1 and 2, wherein the information that can be directly collected is: the different terminals report to the server during the model running process and are thus acquired by the server. The conversion process in the above embodiment may be performed by either a server or a terminal, which is not limited in this embodiment. In the embodiment shown in fig. 3, the information collection and scaling are performed by the server 104 to determine the target policy, and the terminal 102 directly obtains the target policy from the server 104, so that the calculation amount of the terminal 102 may be reduced.

In S34, the terminal 102 caches the target policy locally. And in S36, calculating first expected time consuming running the target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units and the target policy.

FIG. 4 is a flow chart of a method for determining expected time consumption according to an embodiment of the present application. Referring to fig. 4, as a specific embodiment of S220, the method includes: S2202-S2208.

In S2202, it is determined whether or not there is a reference history among the history reference running records on the ith processor unit described above, according to the current progress of the target algorithm model and the current load rate of the ith processor unit.

If N processor units exist in the terminal 102, the i-th processor unit is taken as an example in this embodiment. Wherein the value of N is an integer greater than 1, and the value of i is a positive integer not greater than N.

1) All the historical running records about the ith processor unit in the target strategy are expressed as a historical running record set A;

2) Screening the historical operation record set A according to the current progress of the target algorithm model and the current load rate of the ith processor unit, and judging whether the reference historical record can be screened out in the historical operation record set A; the difference value between the historical progress in the reference historical record and the current progress is smaller than a fourth threshold value, and the difference value between the historical load rate in the reference historical record and the current load rate is smaller than a fifth threshold value.

If the values of the fourth threshold and the fifth threshold are 5%, the current progress of the target algorithm model is 0, and the current load rate of the ith processor unit is x, judging whether the reference history record Y can be screened out from the history running record set A; the historical progress in the reference historical record Y is not more than 5%, and the historical load rate is not more than x+/-5%, so that the historical operation record set A can be determined to be capable of screening the reference historical record Y.

It should be noted that, the current load rate of the ith processor unit is also the load rate converted to the highest operating frequency of the ith processor unit, and the determined target processor unit is improved by using the same comparison standard.

In S2204, in the case where a reference history exists, an expected time consumption for completing the target model by the i-th processor unit is determined according to the historical time consumption in the reference history.

In the case that the reference history record exists in the target strategy, it is stated that the history data similar to the algorithm model of the current progress of the processing of the ith processor unit exists in the target strategy, that is, the data which can be referenced exist already, so that the historical time consumption contained in the reference history record can be directly determined as the expected time consumption of the ith processor unit to complete the target model. It can be seen that the expected time consumption of the ith processor unit is determined by referring to the history record, so that the calculation amount is saved conveniently and quickly.

In S2206, in the case where the reference history record does not exist, a multi-entry target history record is selected from the historical reference operation records of the ith processor unit according to the time-consuming reliability corresponding to the history progress.

In an exemplary embodiment, in the case that the reference history record does not exist in the target policy, it is explained that there is no history data similar to the algorithm model of the current progress of the processing of the ith processor unit in the target policy. According to the method and the device for calculating the expected time consumption, the expected time consumption calculation can be achieved according to the time consumption reliability existing in the historical operation record.

Data contained in a historical operating record: the model, unit, unit_load, mem, process, cost, currently determines "model" as the target algorithm model, "process" may be 0, and "unit" is provided as NPU, so that the data range can be further reduced directly through these three fields, and a data set related to the following fields is obtained: [ unit_load, mem, cost ]. For the field "mem" (available memory space), all records larger than the current free memory space are preferentially selected. Thereby further narrowing the data set. As described above, there is a gap between the historical load rate in the historical operating record in the dataset and the current load rate. In the embodiment of the present application, the record with the closest data is specifically:

And determining a target historical record in the screened data set, wherein the target historical record refers to that the difference between the included historical load rate and the current load rate is smaller than a second preset value and larger than a first preset value, and the first preset value is the fifth threshold. And meanwhile, the time-consuming reliability contained in the target history record meets a preset condition. For example, the time consuming reliability meeting the preset condition may be: the 100 records with highest time-consuming reliability.

In S2208, determining an expected time consumption for completing the object model by the ith processor unit according to the historical time consumption respectively included in the multi-label history records;

in the embodiment of the application, the historical time consumption in the multi-item mark historical record is subjected to statistical calculation, and the statistical calculation result is determined as the expected time consumption for the ith processor unit to complete the target model. For example, the 100 records with the highest "time-consuming reliability" are averaged over the cost (computation time-consuming) field, and the average is determined as the expected time-consuming for the ith processor unit to complete the above-described object model.

The expected time consumption of each processor unit in the terminal to run the target algorithm model described above, respectively, can be determined by the method shown in fig. 4 from the historical running records contained in the target policy described above.

With continued reference to FIG. 2, in S230, a target processor unit is determined from the plurality of processor units for running the target algorithm model according to the first expected time consumption.

In this embodiment of the present application, the processor unit with the least time consumption in the first preset is determined as a target processor unit, and the target algorithm model is run through the target settlement unit. Thereby being beneficial to improving the overall operation efficiency of the model.

In an exemplary embodiment, if there are at least two processor units with equal expected time consumption, the processor unit with the smallest occupied memory space may be selected as the target processor unit. So as to save memory while ensuring the running efficiency of the model. In an exemplary embodiment, if there are at least two processor units whose time periods respectively correspond to the same expected time periods and the same occupied memory space respectively, the target processor unit may be determined according to the type of the processor unit, for example, the NPU, the GPU, and the CPU may be selected in that order.

In the solution provided in embodiment P200 of the present application, current running states of a plurality of processor units in a terminal are obtained, and expected time consumption for running a target algorithm model through the plurality of processor units is calculated according to the current running states of the plurality of processor units. Therefore, the processor unit capable of efficiently processing can be accurately and intuitively adapted to the target algorithm model, and resource conflict between the target algorithm model and other applications of the terminal can be avoided. Further, in the running process of the target algorithm model in the embodiment of the application, the processor unit capable of running the target algorithm model with high efficiency can be continuously determined by executing the above process multiple times. Therefore, in the scheme provided by the embodiment of the application, the terminal performs dynamic scheduling according to the actual running state of each processor unit, and the running of the algorithm model is completed by cooperating with a plurality of processor units, so that the running efficiency of the model at the terminal is improved, the high-efficiency running of the algorithm model at the terminal can be ensured, and meanwhile, the resource conflict between the algorithm model and other applications of the terminal can be avoided.

Fig. 5 is a flowchart of an operation method P400 of an algorithm model according to another embodiment of the present application.

The specific embodiments of S410-S430 are identical to the specific embodiments of S210-S230, and will not be described herein.

In S440, it is determined that the resource limit corresponding to the target processor unit is the first upper limit value. And controlling the amount of resources provided to the target algorithm model by the target processor unit not to exceed the first upper limit value in a process of running the target algorithm model by the target processor unit in S450.

In the process that the terminal runs the target algorithm model, the unavoidable terminal also runs other front-end applications, and in order to reduce the use experience of the terminal user as much as possible, the priority of running the front-end applications can be set to be higher than that of running the algorithm model. Therefore, the influence of resources occupied by the algorithm model on front-end application, such as front-end video playing and blocking, can be reduced as much as possible. In this embodiment of the present application, an upper limit value is set for the core resource of the target processor unit. The core resources may include load occupancy rate, memory, and the like, for example, the load occupancy rate of the target algorithm model to the target processor unit may be set to be less than 20%, and the usage amount of the memory corresponding to the target processor unit by the target algorithm model may be set to be less than 1G. Therefore, in the process that the target processor unit runs the target algorithm model, the terminal can control the resource amount provided by the target processor unit to the target algorithm model not to exceed the first upper limit value, so that adverse effects on front-end application are reduced as much as possible, and the effect of guaranteeing terminal use experience of a user is achieved.

In an exemplary embodiment, if the currently set limit value does not satisfy the front-end application, the limit value may be adjusted according to the resource occupation amount of the front-end application to the target processor unit, so as to ensure that the model operation does not preempt the resource of the front-end application. Specifically, in the process of the model operation of the target algorithm, the terminal system notifies the relevant client of the model operation of the condition when determining that the resource demand of the front-end application to the target processor unit changes. And the client dynamically adjusts the upper limit value of the resource limit according to the notification. For example, the client related to the model operation sets the upper limit value of the memory of the target processor unit to 800MB, and receives a feedback message from the terminal system about that the front-end application needs more memory of the target processor unit (e.g. 200 MB) during the model operation, the upper limit value may be set to 600MB. For another example, the upper limit value of the memory of the target processor unit is set to 800MB by the client related to the model operation, and the feedback message of the terminal system is received in the process of the model operation, and the idle amount of the memory corresponding to the target processor unit reaches 1200MB (including 800M allocated to the model calculation), so that the upper limit of the memory space can be adjusted to 1200MB, so that more data can be cached, and the algorithm model operation speed is increased.

In S460, execution: the operating states of the plurality of processor units are acquired.

In the operation process of the algorithm model, the operation state of each processor unit is dynamically changed, so that the operation state of each processor unit is continuously acquired in the embodiment of the application, so that the operation of the algorithm model is completed by dynamically cooperating with a plurality of processor units, and the operation of the algorithm model is completed with high efficiency under the condition of not preempting front-end application resources.

In the embodiment of the present application, in the operation process of the target algorithm model, S460 may be executed regularly to determine the operation state of each processor unit, and further determine the adjustment of the resource allocation situation or the upper limit value according to the latest operation state, so as to finally achieve the situation that the front end application experience is not affected, and maximally improve the operation efficiency of the model.

In S470, it is determined whether the load rate change amplitude of the target processor unit is large.

It will be appreciated that the load factor can be expressed as: ratio of actual load to capacity of the device. Wherein, the actual load refers to the load actually used in a period of time; the capacity of a device refers to the load that the device is maximally able to withstand. The load change of the target processor unit has a large impact on the model operating efficiency due to the responsibility for operating the target algorithm model. Specifically, when the load rate variation amplitude of the target model is smaller, the algorithm model can be kept to run according to the current parameters; when the load rate change amplitude of the target model is large, different operations are required to be executed according to the load rate becoming large or small.

It should be noted that, in addition to determining whether the load ratio of the target processor unit changes, it may also be determined whether the available load ratio of the target processor unit changes. Wherein the processor unit may take on a ratio between the load from the model and the capacity of the device.

If it is determined that the load factor of the target processor unit becomes smaller by S470 and is greater than the first threshold, it is indicated that the resource amount of the target processor unit is relatively abundant, and it may also be indicated that the resource amount for running the target algorithm model in the target processor unit is increased, so S480 is executed: and adjusting the resource limit value corresponding to the target processor unit to be a second upper limit value. Wherein the second upper limit value is greater than the first upper limit value.

Under the condition that the resource quantity of the target processor unit is determined to be quite abundant, the upper limit value of the resource limit corresponding to the target processor unit is increased, so that more computing resources are provided for the target algorithm model, and the model operation efficiency is improved.

Illustratively, in S480, the terminal increases the resource occupation amount of the target algorithm model for the target processor unit in addition to adjusting the upper limit value. And actively increasing the resource occupation amount of the target algorithm model to the target processor unit through the terminal so as to further improve the model operation efficiency. For example, in the case where the above-described target processor unit is a CPU, threads of the target algorithm model may be added.

With continued reference to fig. 5, if it is determined in S470 that the load factor of the target processor unit becomes substantially greater than the second threshold, it is indicated that the resource amount of the target processor unit is relatively short, or it may be indicated that the resource amount used for running the target algorithm model in the target processor unit is reduced, so S490 is executed: the resource limit value corresponding to the target processor unit is a third upper limit value; wherein the third upper limit value is smaller than the first upper limit value.

And under the condition that the resource quantity of the target processor unit is determined to be relatively short, the upper limit value of the resource limit corresponding to the target processor unit is regulated down, so that the occupation of the computing resource of the target processor unit by the target algorithm model is limited, and the preemption of the resource by the target algorithm model and the front-end application is avoided.

As the amount of resources available to the target algorithm model decreases, the preset time for running the algorithm model through the target processor unit will increase. In order to avoid adverse effects caused by excessively long time consumption, the embodiment of the present application further executes S4100: a second expected time consuming execution of the target algorithm model by the target processor unit is calculated.

The specific implementation manner of determining the preset time consumption has been described in detail in the foregoing embodiments, and will not be described herein.

In S4110, it is determined whether the above second expected time consumption satisfies a preset condition. For example, the preset run time of the algorithm model is 90 milliseconds, and the current executed time is 40 milliseconds; if the second expected time is 80 ms, if the algorithm model continues to be run through the target processor unit, the total time consumption exceeds the preset running time, that is, the second preset time consumption does not meet the preset condition; if the second expected time is determined to be 45 ms, if the algorithm model continues to be run by the target processor unit, the total time consumption is smaller than the preset running time, that is, the second preset time consumption meets the preset condition.

If the second expected time consumption meets the preset condition, the total preset running time of the model is met if the time consumption of running the target algorithm model through the current target processor unit is indicated. S4150 is performed: continuing to run the target algorithm model through the target processor unit.

If the second expected time consumption does not meet the preset condition, it is indicated that the total preset running time of the model will not be met if the time consumption of running the target algorithm model through the current target processor unit. S4120: and calculating third expected time consumption for running the target algorithm model through the plurality of other processor units according to the current running states of the plurality of other processor units.

In S4130, it is determined whether there is an alternative processor unit among the other processor units according to the third expected time consumption.

Wherein the alternative processor unit is a processor unit that satisfies the predetermined condition for the third expected time consumption. Exemplary, the algorithm model has a preset run time of 300 milliseconds and a current executed time of 100 milliseconds; if the algorithm model is run by the processor unit a, the time consumption of the algorithm model is 250 ms, the total time consumption of the algorithm model is longer than the preset running time period due to the running of the algorithm model by the processor unit a, that is, the third preset time consumption of the processor unit a does not meet the preset condition, so that the processor unit a cannot be determined as a substitute processor unit; if the algorithm model is run by the processor unit B with a time consumption of 150 ms as the third expected time consumption, the total time consumption will be smaller than the preset running time period, that is, the third preset time consumption of the processor unit B satisfies the preset condition, the processor unit B may be determined as the replacement processor unit.

Illustratively, in the case where the above-described alternative processor unit exists in other processor units than the above-described target processor unit, S4140 is executed: the target algorithm model is run by the replacement processor unit.

Illustratively, in the case where the above-described alternative processor unit is not present in the other processor units other than the above-described target processor unit, S4150 is performed: continuing to run the target algorithm model through the target processor unit. In addition, alert information may also be generated.

It can be seen that in the embodiments of the present application, if it is determined that the resources that the target processor unit can provide to the target algorithm model are reduced, and will result in a total time consumption exceeding the longest run time allowed by the model, the model expected time consumption of the other processor units is calculated. If the other processor units can be completed on time, the terminal dispatches the algorithm model to the other processor units; if other processor units cannot complete on time, the terminal keeps the calculation task running on the current processor unit, and meanwhile, early warning is generated for the system and the user, and more system resources are released by user intervention so as to reduce the running time of the model as much as possible.

In an exemplary embodiment, after the operation states of the respective processor units of the terminal are acquired through S460, in a case where the load rates of the other processor units become smaller by more than the first threshold, the other processor units may be determined as the optional processor units. Thus, in some embodiments, the target algorithm model may also be run by the optional processor unit if it is determined by 4110 that the second expected time elapsed to continue through the target processor unit does not meet the preset conditions described above.

It can be seen that, in the embodiment of the present application, the load variation of the processor unit (i.e., the other processor units) that is not occupied by the target algorithm model does not directly affect the operation of the computing task, but if the load of the processor unit (i.e., the target processor unit) where the target algorithm model is located is too high and the computing cannot be completed within a specified time, the terminal recalculates whether the other processor unit (i.e., the optional computing resource) with more idle resources can complete the computing faster, and if so, the computing task of the target algorithm model is rescheduled to the optional computing resource. Therefore, the operation of the target algorithm model is completed through a plurality of computing resources of the cooperative terminal, and the terminal resources are fully utilized while the operation efficiency of the model is improved.

In an exemplary embodiment, after the operation states of the respective processor units of the terminal are acquired through S460, if the available memory amount of the target processor unit becomes substantially greater than the third threshold value, the buffer amount of the intermediate state data of the target algorithm model in the terminal is increased.

If the memory space of the target processor unit is sufficient, the calculation task related to the algorithm model can locally cache more intermediate state data, so that the time consumption for reloading the data from the external storage can be reduced; on the contrary, if the memory space of the target processor unit is insufficient, the calculation task related to the algorithm model cannot cache intermediate state data, and the data needs to be frequently loaded from the external storage, so that the overall calculation time is prolonged. Therefore, when the available memory capacity of the target processor unit becomes larger than a third threshold value, the buffer capacity of the intermediate state data of the target algorithm model in the terminal is increased, and the model operation efficiency is improved.

In the solution provided in embodiment P400 of the present application, in the operation process of the target algorithm model, the operation states of each processor unit of the terminal are continuously obtained. According to the load rate change condition of the target processor unit, the upper limit value of the calculation resource of the target processor unit is dynamically adjusted, so that more resources are provided for the target algorithm model as much as possible on the premise of avoiding the operation algorithm model from preempting front-end application resources, and the model operation efficiency is improved. Meanwhile, under the condition that the running target algorithm model of the calculation model is continued and the preset requirement is not met, whether the alternative processor unit exists or not is also determined according to the current running states of other processor units. In addition, there may be alternative processor units where the other processor unit load rates become smaller. Therefore, in the scheme provided by the embodiment of the application, the terminal performs dynamic scheduling according to the actual running state of each processor unit, and completes running of the algorithm model by cooperating with a plurality of processor units, so that the running efficiency of the model at the terminal is improved, the high-efficiency running of the algorithm model at the terminal can be further ensured, meanwhile, resource conflict between the algorithm model and other applications of the terminal can be avoided, and terminal resources are fully utilized.

The method for scheduling terminal resources provided in the embodiment of the present application is described above in its entirety. The method for running the large language model in the terminal provided in the embodiment of the present application is further described below through the embodiments provided in fig. 6 to 8.

FIG. 6 is a schematic diagram of initialization interactions performed during execution of a large language model task according to an embodiment of the present application. Referring to fig. 6, the embodiment shown in fig. 6 includes information interaction between the terminal 102 and the server 104.

In S1, an application client related to a large language model is started; the application client can be an application for model training and fine tuning, and can also be a game application, and a trained large language model, such as an AIGC model, can be run in the game running process so as to perform intelligent content generation in the game process.

In S2, the terminal 102 running the application client requests the server 104 to issue the target policy after the client is started.

In S3 and S4, after receiving the request, the server 104 reads the scheduling policy from the storage module, determines a target policy according to the identity of the terminal and the identity of the model, and further issues the target policy to the terminal 102.

In S5, the terminal 102 caches the received target policy locally for use in the subsequent large language model running process. Meanwhile, the terminal 102 performs S6: a scaling relationship table requesting the normalization of the processor unit with respect to the load is initiated to the server 104. For example, in normalizing the load of the processor unit, the conversion to the available load duty cycle of the processor unit at the highest operating frequency fmax according to the available load duty cycle s1 at the actual operating frequency f1 may be expressed as: (fmax×s1)/f 1, where fmax/f1 may be regarded as scaling relation information about the load factor, and the terminal 102 multiplies the scaling relation information fmax/f1 about the load factor by the actual load factor s1 at the current operating frequency f1, so that normalization processing for the current load factor of the current processor unit may be implemented.

In S7 and S8, after receiving the processor unit load conversion table request sent by the terminal client, the server 104 queries from the storage module and returns the query to the terminal client.

In S9, the terminal 102 caches the performance correspondence, and the client receives the data returned by the server and caches the data locally.

In S10, after the data caching is completed, the initialization flow is ended, and the large language model is waited for starting.

Fig. 7 is a schematic diagram of an initial scheduling flow of resources when a large language model task is executed according to an embodiment of the present application. Referring to fig. 7, the embodiment shown in fig. 7 includes information interaction between an application client and an operating system within the terminal 102.

In S1, in response to starting a calculation task regarding the large language model, the application client obtains the current running state of each processor unit of the terminal from the operating system.

In S2, after the terminal operating system obtains the current running states of the respective processor units, the terminal operating system returns to the application client.

In S3, the application client performs normalization processing on the load information of each processor unit according to the conversion relation table.

In S4, the application client determines whether the terminal has a target processor unit for running the large language model according to the normalized information. For example, whether there is a suitable processor unit or not, and a processor unit is selected that can complete the calculation within a prescribed time. The specific embodiments are described in detail in the foregoing embodiments, and are not repeated herein.

If there is no target processor unit currently, execution S5: generating information of abnormality and ending the flow.

If a target processor unit exists currently, initial parameters required by large model operation are acquired, and calculation of a large model is prepared to start. And performs S6: setting a corresponding upper limit of resources of the target processor unit, specifically setting a memory space and an upper limit of processor unit load according to the idle resource condition of the current terminal obtained in the previous step. And the client reports the related information to the terminal operating system after finishing the preparation work.

In S7, the terminal operating system runs the large language model through the target processor unit. Meanwhile, S8 is also performed: and continuously acquiring the running states of the processor units in the running process of the model, and sending relevant information to the application client under the condition that the running states of the target processor unit or other processor units change greatly.

In S9, the upper resource limit is dynamically adjusted or an alternative processor unit is determined based on the latest operating state of the respective processor unit. A related embodiment will be described in the embodiment shown in fig. 8.

Fig. 8 is a schematic diagram of a resource rescheduling procedure in a process of running a large language model according to an embodiment of the present application. Referring to fig. 8, the embodiment shown in fig. 8 includes information interaction between an application client and an operating system within the terminal 102.

In S1, the operating system of the terminal 102 determines a resource change event; the terminal system starts a core resource change event which comprises load change and memory space change of each processor unit; and if the change amplitude is large, notifying the application client.

In S2, an application client performs normalization processing on system state data; the specific embodiments are described in detail in the foregoing embodiments, and are not repeated herein.

In S3, the application client updates the state data; the application client updates the system state data for subsequent timing module reading and processing.

In S4, the application client starts timing checking the system resource status.

In S5, the application client checks for a change in system resources from the data stored in S3.

In S6, if the target processor unit has no change or has a small change range, it is determined whether the large language model has completed the operation, and if the operation has not been completed, the system resource change condition is continuously acquired.

In S7 and S8, if the large language model has completed the calculation, the timing acquisition of the system resource variation by the application client is ended, and the terminal system is also notified to end the determination of the resource variation time.

In S9, the upper limit of the resources corresponding to the target processor unit is adjusted, and if the terminal system releases more resources, the upper limit of the resources corresponding to the target processor unit is adjusted, and the terminal system is notified to take effect.

In S10, when the available core resources of the terminal system are reduced, the application client determines whether or not the calculation task for the large language model at the target processor unit can be completed within a predetermined time.

In S11, the application client calculates whether the other processor unit can be completed, and if not, calculates the completion time on the other processor unit.

In S12, it is confirmed from the calculation result of S11 whether or not other processor units can complete the calculation within a prescribed time, that is, whether or not there is a substitute processor unit among the other processor units.

In S13, if the above-mentioned alternative processor unit does not exist, that is, the other processor units cannot complete the calculation within the specified time, the calculation task of the large language model keeps running on the current processor unit, and at the same time, gives an early warning to the user that the calculation task cannot be completed on time, and the user determines whether to take further measures (for example, stopping the running of other applications to release more resources or manually scheduling the calculation task to other terminals or cloud computing pool to execute). Further, after the user takes further measures, the resource upper limit of the computing task can be adjusted according to the latest system resource idle condition so as to promote the operation of the large language model.

In S14, if the above-mentioned alternative processor unit exists, the calculation task of the large language model is scheduled to the above-mentioned alternative processor unit by the target processor unit.

In S15, the terminal system runs the large language model on the substitution processor unit, and notifies the application client to continue to determine the system resource change condition at regular time.

According to the technical scheme, the idle computing power of each processor unit of the terminal, such as a CPU (Central processing Unit), a GPU (graphics processing Unit) and an NPU (non-point processing Unit) can be fully utilized while the front end application of the user is guaranteed to have sufficient system resource guarantee, the most efficient processor unit is selected according to the actual computing power idle condition of each processor unit and the adaptation degree of a large model, and the operation of the voice large model at the terminal is guaranteed to be finished at the fastest speed by cooperating all computing models of the terminal.

An embodiment of a method of operating the algorithm model of the present application is described in detail above in connection with fig. 1 to 8, and an embodiment of the apparatus of the present application is described in detail below in connection with fig. 9.

Fig. 9 is a schematic block diagram of an operation device 900 of an algorithm model provided in an embodiment of the present application. The apparatus 900 is configured at a terminal. As shown in fig. 9, the running apparatus 900 of the algorithm model includes: a state acquisition module 910, a time consuming calculation module 920, and a unit determination module 930;

The state obtaining module 910 is configured to obtain current running states of the plurality of processor units in the terminal; the time consumption calculation module 920, configured to calculate first expected time consumption for running the target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units; and the unit determining module 930 configured to determine a target processor unit from the plurality of processor units for running the target algorithm model according to the first expected time consumption.

In some embodiments, based on the above scheme, the running device 900 of the algorithm model includes: the limit value determining module and the model running module;

wherein, the limit value determining module is used for: after determining a target processor unit from the plurality of processor units according to the expected time consumption, the unit determining module 930 determines that a resource limit corresponding to the target processor unit is a first upper limit value; and the model operation module is used for: controlling the amount of resources provided to the target algorithm model by the target processor unit not to exceed the first upper limit value during execution of the target algorithm model by the target processor unit.

In some embodiments, based on the foregoing solution, the running apparatus 900 of the algorithm model further includes: a status acquisition module 910;

the state acquisition module 910 is configured to: acquiring the running states of the plurality of processor units; the limit value determining module is used for: and when the load factor of the target processor unit becomes smaller by more than a first threshold value, adjusting the resource limit value corresponding to the target processor unit to be a second upper limit value, and the model running module is further configured to: continuing to run the target algorithm model through the target processor unit; the second upper limit value is greater than the first upper limit value, and the load factor is converted to the load factor of the target processor unit at the highest operating frequency.

In some embodiments, based on the above scheme, the above limit determination module is further configured to: after the state obtaining module 910 obtains the operation states of the plurality of processor units, if the load factor of the target processor unit becomes greater than a second threshold value, the resource limit value corresponding to the target processor unit is adjusted to a third upper limit value; wherein the third upper limit value is smaller than the first upper limit value.

In some embodiments, based on the above scheme, the time-consuming calculation module 920 is further configured to: after the limit determining module adjusts the resource limit value corresponding to the target processor unit to be a third upper limit value, calculating a second expected time consumption for running the target algorithm model through the target processor unit; and calculating third expected time consumption for running the target algorithm model through the plurality of other processor units according to the current running states of the plurality of other processor units under the condition that the second expected time consumption does not meet the preset condition; the model operation module is further used for: running the target algorithm model by the replacement processor unit if there is a replacement processor unit among the plurality of other processor units; wherein the alternative processor unit is a processor unit that satisfies the predetermined condition for the third expected time consumption.

In some embodiments, based on the above scheme, the unit determining module 930 is further configured to: after the state acquiring module 910 acquires the operation states of the plurality of processor units, if the load factor of the other processor unit becomes smaller by more than the first threshold value, the other processor unit is determined as an optional processor unit; the model operation module is also used for: and running, by the optional processor unit, the target algorithm model if the second expected time consumption does not meet the preset condition.

In some embodiments, based on the above scheme, the above model running module is further configured to: after the state acquisition module 910 acquires the operation states of the plurality of processor units, if the available memory amount of the target processor unit becomes substantially larger than a third threshold value, the buffer amount of the intermediate state data of the target algorithm model at the terminal is increased.

In some embodiments, based on the foregoing solution, the running apparatus 900 of the algorithm model further includes: a request module and a cache module; the request module is used for: before the unit determining module 930 calculates the expected time consumption of running the target model through the multiple processor units according to the current running states of the multiple processor units, a policy obtaining request is sent to a server, so that the server determines a target policy according to the identity of the terminal and the identity of the target algorithm model, which are included in the policy obtaining request; the cache module is used for: caching the target strategy to the terminal; and, the time-consuming calculating module 920 is specifically configured to: calculating first expected time consumption for running a target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units and the target strategy; wherein, the target strategy comprises: running, by an ith processor unit, historical running records of the target algorithm model, each of the historical running records including: historical progress, historical load rates corresponding to the historical progress, and historical time consumption from the historical progress to completion of the target model.

In some embodiments, based on the foregoing, the current state of the ith processor unit includes a current load rate of the ith processor unit; the time-consuming calculation module 920 includes: judging a sub-module and determining time consumption of the sub-module;

In some embodiments, based on the foregoing, each of the foregoing historical operating records further includes: time-consuming credibility corresponding to the history progress; the time-consuming calculation module 920 includes: screening the submodules; the screening submodule is used for: under the condition that the reference history record does not exist, according to the time-consuming credibility corresponding to the history progress, a multi-item mark history record is screened out from the history reference operation record of the ith processor unit; and, the time-consuming determining submodule is used for: determining the time consumption for completing the target model by the ith processor unit according to the historical time consumption respectively contained in the multi-item mark historical records; the difference between the historical load rate contained in the target historical record and the current load rate is smaller than a second preset value and larger than a first preset value, and the time-consuming reliability contained in the target historical record meets a preset condition.

It should be understood that the apparatus embodiment and the method embodiment of operation of the algorithm model may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, no further description is provided here. Specifically, the operation device of the algorithm model shown in fig. 9 may execute the embodiment of the operation method of the algorithm model, and the foregoing and other operations and/or functions of each module in the operation device of the algorithm model are respectively for implementing the method embodiment corresponding to the node in the main node group, which is not described herein for brevity.

The apparatus of the embodiments of the present application are described above in terms of functional modules in conjunction with the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

Fig. 10 is a schematic block diagram of a terminal provided in an embodiment of the present application, where the terminal of fig. 10 may be used to execute the operation method of the algorithm model described above, and the terminal may be a node in a master node group of the first account or a node in a slave node group of the first account.

As shown in fig. 10, the terminal 1000 may include:

a memory 1010 and a processor 1020, the memory 1010 being for storing a computer program 1030 and for transmitting the program code 1030 to the processor 1020. In other words, the processor 1020 may invoke and run the computer program 1030 from the memory 1010 to implement the methods in embodiments of the present application.

For example, the processor 1020 may be configured to perform the steps of the methods described above in accordance with instructions in the computer program 1030.

In some embodiments of the present application, the processor 1020 may include, but is not limited to:

a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

In some embodiments of the present application, the memory 1010 includes, but is not limited to:

volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).

In some embodiments of the present application, the computer program 1030 may be partitioned into one or more modules that are stored in the memory 1010 and executed by the processor 1020 to perform the methods of operating the algorithm model of the present application. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are included to describe the execution of the computer program 1030 in the terminal.

As shown in fig. 10, the terminal 1000 may further include:

a transceiver 1040, the transceiver 1040 being connectable to the processor 1020 or the memory 1010.

The processor 1020 may control the transceiver 1040 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. The transceiver 1040 may include a transmitter and a receiver. The transceiver 1040 may further include antennas, the number of which may be one or more.

It should be appreciated that the various components in terminal 1000 can be connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

According to an aspect of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.

According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the method of the above-described method embodiments.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces, in whole or in part, a flow or function consistent with embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for operating an algorithm model, the method being applied to a terminal, the method comprising:

acquiring the current running states of a plurality of processor units in the terminal;

Calculating first expected time consumption for running the target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units;

determining a target processor unit from the plurality of processor units for running the target algorithm model according to the first expected time consumption;

after said determining a target processor unit from said plurality of processor units based on said expected time consumption, said method further comprises:

determining that the resource limit corresponding to the target processor unit is a first upper limit value;

controlling the amount of resources provided to the target algorithm model by the target processor unit not to exceed the first upper limit value during execution of the target algorithm model by the target processor unit;

the method further comprises the steps of:

acquiring the running states of the plurality of processor units;

when the load rate of the target processor unit becomes smaller and larger than a first threshold value, adjusting the resource limit value corresponding to the target processor unit to be a second upper limit value, and continuing to run the target algorithm model through the target processor unit;

the second upper limit value is larger than the first upper limit value, and the load rate is converted to the load rate of the target processor unit under the highest operating frequency.

2. The method according to claim 1, wherein the method further comprises:

and increasing the resource occupation amount of the target algorithm model to the target processor unit under the condition that the load rate of the target processor unit becomes smaller and larger than a first threshold value.

3. The method of claim 1, wherein after said acquiring the operating states of the plurality of processor units, the method further comprises:

when the load rate of the target processor unit becomes larger than a second threshold value, adjusting the resource limit value corresponding to the target processor unit to be a third upper limit value;

wherein the third upper limit value is smaller than the first upper limit value.

4. The method of claim 3, wherein after said adjusting the resource limit value corresponding to the target processor unit to a third upper limit value, the method further comprises:

calculating a second expected time consuming execution of the target algorithm model by the target processor unit;

calculating third expected time consumption for running the target algorithm model through the plurality of other processor units according to the current running states of the plurality of other processor units under the condition that the second expected time consumption does not meet the preset condition;

Running the target algorithm model by an alternative processor unit if the alternative processor unit is present in the plurality of other processor units;

wherein the replacement processor unit is the processor unit for which the third expected time consumption satisfies the preset condition.

5. The method according to claim 4, wherein the method further comprises:

and continuing to run the target algorithm model through the target processor unit and generating early warning information in the absence of the replacement processor unit from the plurality of other processor units.

6. The method according to claim 4, wherein the method further comprises:

and continuing to run the target algorithm model through the target processor unit if the second expected time consumption meets a preset condition.

7. The method of claim 4, wherein after said acquiring the operating states of the plurality of processor units, the method further comprises:

determining the other processor units as optional processor units in the case that the load rates of the other processor units become smaller by more than the first threshold value;

And running, by the optional processor unit, the target algorithm model if the second expected time consumption does not meet the preset condition.

8. The method of claim 1, wherein after said acquiring the operating states of the plurality of processor units, the method further comprises:

and increasing the buffer storage amount of the intermediate state data of the target algorithm model in the terminal under the condition that the available memory amount of the target processor unit is larger than a third threshold value.

9. The method according to any one of claims 1 to 8, wherein before said calculating the expected time elapsed for running the object model by the plurality of processor units, respectively, based on the current running states of the plurality of processor units, the method further comprises:

sending a strategy acquisition request to a server so that the server determines a target strategy according to the identity of the terminal and the identity of the target algorithm model contained in the strategy acquisition request;

caching the target strategy to the terminal;

the calculating, according to the current running states of the plurality of processor units, first expected time consumption of running the target algorithm model through the plurality of processor units respectively includes:

Calculating first expected time consumption for running a target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units and the target strategy;

wherein the target policy comprises: running, by an ith processor unit, historical running records of the target algorithm model, each of the historical running records comprising: the method comprises the steps of historical progress, historical load rate corresponding to the historical progress and historical time consumption from the historical progress to completion of the target model, wherein i is a positive integer.

10. The method of claim 9, wherein the current state of the i-th processor unit comprises a current load rate of the i-th processor unit;

calculating a first expected time consumption for running a target algorithm model through the plurality of processor units, respectively, according to the current running states of the plurality of processor units and the target policy, including:

determining whether a reference history record exists in the history reference operation records about the ith processor unit according to the current progress of the target algorithm model and the current load rate of the ith processor unit;

determining, in the presence of a reference history, a time consumption for completing the object model by the ith processor unit based on historical time consumption in the reference history;

The difference value between the historical progress in the reference historical record and the current progress is smaller than a fourth threshold value, and the difference value between the historical load rate in the reference historical record and the current load rate is smaller than a fifth threshold value.

11. The method of claim 10, wherein each of the historical operating records further comprises: time-consuming credibility corresponding to the history progress;

the method further comprises the steps of:

under the condition that the reference history record does not exist, according to the time-consuming credibility corresponding to the history progress, a multi-item mark history record is screened out from the history reference operation record of the ith processor unit;

determining the time consumption of completing the target model through the ith processor unit according to the historical time consumption respectively contained in the multi-item mark historical records;

the difference between the historical load rate contained in the target historical record and the current load rate is smaller than a second preset value and larger than a first preset value, and the time-consuming reliability contained in the target historical record meets preset conditions.

12. An apparatus for running an algorithm model, configured in a terminal, the apparatus comprising:

The state acquisition module is used for acquiring the current running states of the plurality of processor units in the terminal;

the time consumption calculation module is used for calculating first expected time consumption for running the target algorithm model through the plurality of processor units according to the current running states of the plurality of processor units;

a unit determination module for determining a target processor unit among the plurality of processor units for running the target algorithm model based on the first expected time consumption;

the apparatus further comprises: the limit value determining module and the model running module;

the limit value determining module is used for determining that resources corresponding to a target processor unit are limited to a first upper limit value after the unit determining module determines the target processor unit in the plurality of processor units according to the expected time consumption;

the model running module is used for controlling the resource amount provided to the target algorithm model by the target processor unit not to exceed the first upper limit value in the process of running the target algorithm model by the target processor unit;

the apparatus further comprises: a state acquisition module;

The state acquisition module is used for acquiring the running states of the plurality of processor units;

the limit value determining module is further configured to adjust a resource limit value corresponding to the target processor unit to be a second upper limit value when the load rate of the target processor unit becomes smaller by an amount greater than a first threshold value; the model running module is further used for continuing to run the target algorithm model through the target processor unit;

13. An electronic device includes a processor and a memory;

the memory is used for storing a computer program;

the processor being configured to execute the computer program to implement a method of operating an algorithm model according to any one of the preceding claims 1 to 11.

14. A computer-readable storage medium storing a computer program;

the computer program causes a computer to perform a method of operating an algorithm model according to any one of the preceding claims 1 to 11.