CN113342538A - Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model - Google Patents

Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model Download PDF

Info

Publication number
CN113342538A
CN113342538A CN202110894802.1A CN202110894802A CN113342538A CN 113342538 A CN113342538 A CN 113342538A CN 202110894802 A CN202110894802 A CN 202110894802A CN 113342538 A CN113342538 A CN 113342538A
Authority
CN
China
Prior art keywords
gpu
processing
cpu
script
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110894802.1A
Other languages
Chinese (zh)
Inventor
唐伟鹏
吴小炎
吴名朝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Whale Cloud Technology Co Ltd
Original Assignee
Whale Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Whale Cloud Technology Co Ltd filed Critical Whale Cloud Technology Co Ltd
Priority to CN202110894802.1A priority Critical patent/CN113342538A/en
Publication of CN113342538A publication Critical patent/CN113342538A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method for designing an inference engine for improving GPU computational throughput by separating a script and a model, which comprises the following steps: carrying out module splitting and abstraction on the logics of CPU processing and GPU processing; serializing data transmitted between CPU processing and GPU processing; containerized process communication; modular containerized multi-instance deployments; reverse proxy and load balancing; the number of instances is adjusted in real time using elastic expansion. Has the advantages that: by abstracting and decoupling the CPU processing and the GPU processing, multiplexing is increased, the problem of serialization is solved, independent distributed computing can be really achieved through the CPU preprocessing and the GPU neural network computing, containerization examples with different proportions are configured according to different models according to actual environments, for example, 20 CPU containers are matched with 4 GPU containers, GPU resources are fully utilized, and the GPU throughput rate is improved.

Description

Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model
Technical Field
The invention relates to the technical field of GPUs (graphics processing units), in particular to a method for designing an inference engine for improving GPU computing throughput by separating scripts and models.
Background
Nowadays, most of one AI capability contains a plurality of algorithms, and the algorithms are data-dependent and interactive with each other, for example, the output of the a algorithm is used as the input of the B algorithm. Moreover, algorithms typically involve a number of pre-and post-processing of the data, such as picture size normalization, etc.
At present, because the script and the model are not separated, the calculation of the CPU and the GPU is not separated and becomes a whole, and the calculation capacity of the CPU is far lower than that of the GPU, so that the throughput of the GPU becomes very low.
For example, when an enterprise performs digital transformation, it must face an AI scenario, and needs an AI application and an AI capability. In the production process of the real AI capability, there must be a call of the AI capability, and usually an API implementation is provided to the outside based on the AI capability open platform. In the AI capability open platform, capabilities of video and image types exist, and there is a hardware acceleration demand for computing resources, which is often solved by using a GPU, which is an expensive computing resource.
Because the preprocessing of data between algorithms is usually based on CPU computation, and the visual matrix neural network relies on GPU computation, there is communication between the two, and on the premise of not being separated, the CPU computation to the GPU computation are serial, and on such a mechanism, the GPU throughput is reduced. This problem is particularly significant when one AI capability contains multiple algorithms with dependencies between them, and GPU throughput is typically less than 40%.
At present, in the prior art, an algorithm for separating a model and a script is not used, as shown in fig. 2, when the script performs operations such as preprocessing on data, the logic of the GPU is idle, and the processing time of the CPU is usually longer than that of the GPU, so that great waste is caused. For another example, a face comparison algorithm generally includes steps of face detection, face verification, face alignment, face recognition, and the like, and an actual flow is shown in fig. 3.
Therefore, in order to solve the phenomenon, the script (CPU preprocessing) and the model (GPU neural network calculation) can be separated and decoupled from each other, so that the independent distributed calculation of the CPU preprocessing and the GPU neural network calculation can be really realized, the serial behaviors of mutual dependence are solved, and the use throughput of the GPU is improved.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides a method for designing an inference engine for improving the GPU computing throughput by separating a script and a model, so as to overcome the technical problems in the prior related art.
Therefore, the invention adopts the following specific technical scheme:
the inference engine design method for improving GPU computational throughput by separating scripts and models comprises the following steps:
carrying out module splitting and abstraction on the logics of CPU processing and GPU processing;
serializing data transmitted between CPU processing and GPU processing;
containerized process communication;
modular containerized multi-instance deployments;
reverse proxy and load balancing;
the number of instances is adjusted in real time using elastic expansion.
Further, the module splitting and abstracting of the logic of the CPU processing and the GPU processing includes the following steps:
aiming at the logics of the CPU processing and the GPU processing, module splitting and abstraction are carried out according to a splitting principle, and the coupled logic is separated and decoupled;
performing second abstraction on the CPU processing;
the GPU processing is abstracted a second time.
Further, the splitting principle is a modularization, atom and multiplexing principle.
Furthermore, the containerized process communication method is to use a remote process call to transmit serialized data to realize data communication between different processes.
Further, the module containerized multi-instance deployment includes the steps of:
a container platform is adopted, and the script and the model are separately deployed based on containerization multi-copy deployment;
deploying the number of instances in a corresponding proportion according to the difference proportion of the computing forces between the CPU and the GPU script logic mutual calling module;
the controller automatically maintains the corresponding instance.
Further, the reverse proxy and load balancing comprises the steps of:
carrying out reverse proxy on multiple instances of the same module and distributing a request;
the reverse proxy server uses different load balancing strategies to dynamically distribute the requests according to the performance of the instance nodes, so as to achieve the optimal performance of the service nodes.
Further, the real-time adjustment of the number of instances by using elastic expansion and contraction comprises the following steps:
monitoring the indexes of the examples by utilizing the elastic telescopic function of the container platform;
setting specific comparison threshold values and jitter duration;
and the number of copies is adjusted in real time, so that the blockage caused by insufficient instances and the resource waste caused by idle instances are avoided.
Further, the index of the example includes the number of calls, the CPU usage and the memory usage.
The invention has the beneficial effects that: through abstracting and decoupling CPU processing and GPU processing, multiplexing is increased, the problem of serialization is solved, independent distributed computing can be achieved through CPU preprocessing and GPU neural network computing, containerization examples with different proportions are configured according to actual environments and different models, all modules can be multiplexed and used in parallel, for example, 20 CPU containers are matched with 4 GPU containers, GPU resources are fully utilized, GPU throughput is improved, and efficiency and convenience are improved through data transmission by using a gPC.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow diagram of a method for designing an inference engine with separation of scripts and models to improve GPU computational throughput, according to an embodiment of the present invention;
FIG. 2 is a logic flow diagram of the algorithmic capability of an existing undivided model and script;
FIG. 3 is a flow diagram of a face comparison algorithm for the algorithmic capability of existing unseparated models and scripts;
FIG. 4 is a flowchart of a system call process of an inference engine design method for improving GPU computational throughput by separating scripts from models, according to an embodiment of the present invention;
FIG. 5 is a structural diagram of an inference engine design method for improving GPU computational throughput by separating scripts and models according to an embodiment of the present invention, in which logic of CPU processing and GPU processing is subjected to module splitting and abstraction;
FIG. 6 is a gRPC communication architecture diagram in an inference engine design method for improving GPU computational throughput by separating scripts from models, according to an embodiment of the present invention;
fig. 7 is a diagram of a multi-container replica deployment structure in an inference engine design method for improving GPU computation throughput by separating scripts and models according to an embodiment of the present invention.
Detailed Description
For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.
According to the embodiment of the invention, the inference engine design method for improving the GPU computing throughput by separating the script and the model is provided.
Referring to the drawings and the detailed description, the invention will be further explained, as shown in fig. 1, in an embodiment of the invention, a method for designing an inference engine for improving GPU computation throughput by separating a script and a model, the method includes the following steps:
s1, carrying out module splitting and abstraction on the logic of CPU processing and GPU processing;
specifically, as shown in fig. 5, the module splitting and abstracting the logic of the CPU processing and the GPU processing includes the following steps:
s11, carrying out module splitting and abstraction according to a splitting principle aiming at the logics of the CPU processing and the GPU processing, and separating and decoupling the coupled logic;
s12, performing second abstraction on the CPU processing;
s13, abstract the GPU processing a second time (re-abstract is needed because many operations in the CPU processing logic are duplicated and multiple different inference capabilities may use the same operations).
The splitting principle is a modularization, atom and multiplexing principle.
S2, serializing data transmitted between the CPU processing and the GPU processing;
the CPU processing and the GPU processing are separated, data are required to be transmitted among different processes and even different machines, the data are required to be serialized, and a multi-bit numpy matrix type with larger data volume of input parameters and output parameters of a model is considered, so that the data can be converted by using a tool such as a pick which has advantages in both access type and reading speed (more efficient) compared with a common file.
S3, communication of the containerization process;
specifically, as shown in fig. 6, the containerized process communication method is to use a remote procedure call (gRPC) to transfer serialized data, so as to implement data communication between different processes. That is, because the CPU processing and the GPU processing are separated, different processes and even different machines are required to communicate with each other. While gRPC is a high-performance, open-source and universal RPC framework, mobile and HTTP/2-oriented design. Designed based on the HTTP/2 standard, brings about features such as bi-directional flow, flow control, header compression, multiple multiplexed requests over a single TCP connection, etc. These characteristics make it perform better on mobile devices, saving more power and space.
S4, deploying modules in a containerized multi-instance mode;
specifically, as shown in fig. 7, the modular containerized multi-instance deployment includes the following steps:
s41, the characteristics of the same part of logic are often multiplexed based on the inequality of the computing power of a CPU and a GPU and different reasoning power, so a container platform is adopted, the deployment is based on containerization multi-copy deployment, and the script and the model are separately deployed;
s42, deploying the number of instances in the corresponding proportion according to the difference proportion of the computing forces between the CPU and the GPU script logic mutual calling module;
and S43, automatically maintaining the corresponding instance by the controller (for example, automatically pulling up when the instance goes wrong, etc.).
S5, reverse proxy and load balancing;
specifically, the reverse proxy and load balancing includes the following steps:
s51, reverse proxy is carried out on multiple instances of the same module, and a request is distributed (the calling of the same module is realized without being specific to each instance, and only needs to be carried out to a corresponding reverse proxy server, so that the logic writing of calling of a client is simplified);
and S52, the reverse proxy server dynamically allocates the requests according to the performance of the instance nodes by using different load balancing strategies to achieve the optimal performance of the service nodes.
In addition, in an actual environment, the service of the container platform is used for realizing the requirements, the client does not need to know each specific instance when calling the same module, only needs to request the corresponding service, and then the service is automatically distributed to the corresponding back-end instance based on a set load balancing strategy (polling, designated weight, hash and the like).
And S6, adjusting the number of the instances in real time by using elastic expansion and contraction.
Specifically, the real-time adjustment of the number of instances by using elastic expansion and contraction comprises the following steps:
s61, monitoring the indexes of the embodiment by using the elastic expansion function of the container platform;
s62, setting specific contrast threshold and jitter duration;
and S63, adjusting the number of copies in real time, and avoiding the blockage caused by insufficient instances and the resource waste caused by idle instances.
The indexes of the examples comprise the calling quantity, the CPU utilization rate and the memory usage amount.
To sum up, a series of processes such as separation and independent operation of CPU processing and GPU processing are realized by combining the above six steps, and then improvement of GPU throughput is realized, as shown in fig. 4, the whole calling process of the system is as follows:
firstly, receiving service calling by inference service;
secondly, CPU processing, such as detecting the base64 of the image, and executing the picture size normalization processing;
third, GPU processing (RNN CNN PNN): transmitting the processed image matrix to a neural network computing link of a GPU through an engine API (application programming interface) and outputting a reasoning computing result, wherein the process comprises a plurality of times of computing processing of the neural network, and is similar to image detection, image alignment and image object identification;
fourthly, CPU subsequent processing: carrying out standardized packaging on a single neural network calculation result in a GPU reasoning process;
fifthly, integrating results: and combining a plurality of standardized results of CPU post-processing, and uniformly outputting the combined standardized results to a calling party.
The following are specific specifications and examples:
1) script structure
(1) Entry function
Take the demo _ face _ test as an example. A face _ test packet exists in a code directory, the packet is an entry of the whole calling of the model capability, and a plurality of algorithms are combined in the script to synthesize the final result.
Py there is a method predict:
def predict(params=None):
start = time.time()
time.sleep(10)
logger = getLogger('video')
logger.info("%s predict start at %s" % (threading.currentThread().getName(), start))
ret = {'data': {}, 'code': 0, 'message': 'success'}
# time.sleep(1)
# return json.dumps(ret)
try:
# read Picture
if params and 'image' in params:
image = params['image']
img = parse_image_file(image, 'cv')
else:
raise RuntimeError('`image` not in params.')
# img = cv2.imread(get_model_own(__file__, 'image.jpg'))
rand = np.random.randint(-5, 6, size=img.shape)
img_new = img + rand
img_new = img_new.astype(np.uint8)
# face detection (GPU)
start2 = time.time()
logger.info("%s face_detect_api start at %s" % (threading.currentThread().getName(), start2))
face_out = face_detect_api(img_new)
end2 = time.time()
logger.info("%s face_detect_api finish at %s, cost totally %s ms." % (
threading.currentThread().getName(), end2, round((end2 - start2) * 1000)))
logger.info('%s shape of face_detect_api: %s ' % (threading.currentThread().getName(), len(face_out)))
ret['data'] = {'len':len(face_out)}
# ret['data'] = '1'
# face audit (GPU)
box_list = []
landmarks_list = []
start3 = time.time()
logger.info("%s check_face start at %s" % (threading.currentThread().getName(), start3))
for i in range(len(face_out)):
x_min = face_out[i]['ext_boxes'][0]
x_max = face_out[i]['ext_boxes'][2]
y_min = face_out[i]['ext_boxes'][1]
y_max = face_out[i]['ext_boxes'][3]
if check_face('video', img, face_out[i]) == FaceStatus.SUCCESS:
box_list.append([x_min, y_min, x_max, y_max])
landmarks_list.append(face_out[i]['landmarks'])
end3 = time.time()
logger.info("%s check_face finish at %s, cost totally %s ms." % (
threading.currentThread().getName(), end3, round((end3 - start3) * 1000)))
# face alignment
aligned_list = []
for i, j in zip(box_list, landmarks_list):
aligned_list.append(face_align(img, [i], [j]))
aligneds = np.concatenate(aligned_list)
# face recognition (GPU)
start4 = time.time()
logger.info("%s face_api2 start at %s" % (threading.currentThread().getName(), start4))
face_feature_output = face_feature_api(aligneds)
end4 = time.time()
logger.info("%s face_api2 finish at %s, cost totally %s ms." % (
threading.currentThread().getName(), end4, round((end4 - start4) * 1000)))
# acquisition result
face_feature_output = np.array([item['feature_value'] for item in face_feature_output])
ret['data']['num'] = len(face_feature_output)
time1 = time.time()
logger.info("%s matrix_distance start at %s" % (threading.currentThread().getName(), time1))
if len(face_feature_output) > 0:
face_num = 10000
faces_new = face_feature_output.repeat(face_num // face_feature_output.shape[0], 0)
matches = matrix_distance(face_feature_output, faces_new) < 0.5
logger.info('num of matches:%s' % np.sum(matches))
time2 = time.time()
logger.info("%s matrix_distance finish at %s, cost totally %s ms." % (
threading.currentThread().getName(), time2, round((time2 - time1) * 1000)))
except Exception as e:
traceback.print_exc()
ret['code'] = -1
ret['message'] = str(e)
return json.dumps(ret)
The data returned by this method is a json string.
(2) Algorithm structure
In one of its algorithm implementation package face _ detection, the directory structure is as follows:
py is where the model loads and provides an inference instance. It has a get _ instance () method that provides a unique instance of the inference object, which is a function that gets this model call instance to the grpc:
def get_instance(gpu_device_id):
return MtcnnModelCaller.instance(gpu_device_id)
as in mtcnmodelcaller, instance (gpu _ device _ id) in the figure, the function is implemented as follows:
@classmethod
def instance(cls, *args, **kwargs):
if not hasattr(MtcnnModelCaller, "_instance"):
with MtcnnModelCaller._instance_lock:
gpu_device_id = args[0]
MtcnnModelCaller.gpu_device_id = gpu_device_id
MtcnnModelCaller._instance = MtcnnModelCaller()
return MtcnnModelCaller._instance
one method, call _ model (params), is provided in classMtcnnModelCaller;
def call_model(self, params):
if params['type'] == 'pnet':
return self.call_PNet(params['data'])
if params['type'] == 'rnet':
return self.call_RNet(params['data'])
if params['type'] == 'onet':
return self.call_ONet(params['data'])
if params['type'] == 'lnet':
return self.call_LNet(params['data'])
return None
this is the entry method provided to the grpc service, which when requested rpc selects a different model instance based on the parameters, calls its call _ model (self, params) method.
(3) Model configuration
Yaml under the model/face _ test package, the program is as follows:
model_info:
- model_service: 'face_detection'
entrypoint: script.face_detection.model_service
- model_service: 'face_feature_extract'
entrypoint: script.face_feature_extract.model_service
- model_service: 'face_structure_angle'
entrypoint: script.face_structure_angle.model_service
the model _ service of all gpu models required by the capability of face _ test is configured in the interface _ test. For example, a model _ service in an example face _ detection in an algorithm structure has a configuration in this configuration, and the field name is model _ service.
(4) Model invocation
The script/model _ separate _ common/model _ controller. py script has a remote model calling instance, and when in use, the instance of the remote model calling instance is acquired first, and the face _ detection in the face _ test is taken as an example, and the name face _ detection of the model _ service is acquired when the instance is called. Then call the call _ model method directly, and the parameters called by the model are introduced.
model_caller = ModelCaller.instance('face_detection')
data = {
'sliced_index':sliced_index,
'img':img,
'scales':scales
}
total_boxes = model_caller.call_model('pnet',data)
In this example, model _ type pnet is also introduced, since multiple models are used in face _ detection. A type needs to be passed in to determine the model of the call.
In mtcnn _ detector in the example, data is received, data processing and transformation are firstly carried out, then a model is called to repeat repeatedly, and finally a result is returned.
After the separation and transformation of the script and the model, the utilization rate of the GPU is greatly improved.
In summary, by means of the technical scheme of the invention, the CPU processing and the GPU processing are abstracted and decoupled, multiplexing is increased, the problem of serialization is solved, independent distributed computing of the CPU preprocessing and the GPU neural network computing is really achieved, different models are configured with containerization examples in different proportions according to actual environments, for example, 20 CPU containers are matched with 4 GPU containers, GPU resources are fully utilized, and the GPU throughput rate is improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. The inference engine design method for improving GPU computational throughput by separating scripts and models is characterized by comprising the following steps:
carrying out module splitting and abstraction on the logics of CPU processing and GPU processing;
serializing data transmitted between CPU processing and GPU processing;
containerized process communication;
modular containerized multi-instance deployments;
reverse proxy and load balancing;
the number of instances is adjusted in real time using elastic expansion.
2. The method for designing an inference engine with script and model separation for improving GPU computational throughput of claim 1, wherein the module splitting and abstraction of the logic of CPU processing and GPU processing comprises the following steps:
aiming at the logics of the CPU processing and the GPU processing, module splitting and abstraction are carried out according to a splitting principle, and the coupled logic is separated and decoupled;
performing second abstraction on the CPU processing;
the GPU processing is abstracted a second time.
3. The method for designing an inference engine with separation of scripts and models and improvement of GPU computational throughput according to claim 2, characterized in that the splitting principle is a modularization, atom and multiplexing principle.
4. The method for designing an inference engine for improving GPU computational throughput by separating scripts and models according to claim 1, wherein the containerized process communication method is to use remote process calls to transfer serialized data to realize data communication between different processes.
5. The method for designing inference engine with script and model separation for improving GPU computational throughput of claim 1, wherein the modular containerized multi-instance deployment comprises the following steps:
a container platform is adopted, and the script and the model are separately deployed based on containerization multi-copy deployment;
deploying the number of instances in a corresponding proportion according to the difference proportion of the computing forces between the CPU and the GPU script logic mutual calling module;
the controller automatically maintains the corresponding instance.
6. The method for designing an inference engine with script and model separation for improving GPU computing throughput of claim 1, wherein the reverse proxy and load balancing comprises the following steps:
carrying out reverse proxy on multiple instances of the same module and distributing a request;
the reverse proxy server uses different load balancing strategies to dynamically distribute the requests according to the performance of the instance nodes, so as to achieve the optimal performance of the service nodes.
7. The method for designing an inference engine with script and model separation for improving GPU computational throughput according to claim 1, wherein the real-time adjustment of the number of instances by elastic scaling comprises the following steps:
monitoring the indexes of the examples by utilizing the elastic telescopic function of the container platform;
setting specific comparison threshold values and jitter duration;
and the number of copies is adjusted in real time, so that the blockage caused by insufficient instances and the resource waste caused by idle instances are avoided.
8. The method of claim 7, wherein the metrics of the instances include a number of calls, a CPU usage rate, and a memory usage amount.
CN202110894802.1A 2021-08-05 2021-08-05 Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model Pending CN113342538A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110894802.1A CN113342538A (en) 2021-08-05 2021-08-05 Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110894802.1A CN113342538A (en) 2021-08-05 2021-08-05 Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model

Publications (1)

Publication Number Publication Date
CN113342538A true CN113342538A (en) 2021-09-03

Family

ID=77480751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110894802.1A Pending CN113342538A (en) 2021-08-05 2021-08-05 Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model

Country Status (1)

Country Link
CN (1) CN113342538A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807539A (en) * 2021-09-06 2021-12-17 北狐数字科技(上海)有限公司 High multiplexing method, system, medium and terminal for machine learning and graph computing power
CN114004730A (en) * 2021-11-03 2022-02-01 奥特贝睿(天津)科技有限公司 Deep neural network multi-model parallel reasoning method based on graphics processor

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933429A (en) * 2019-03-05 2019-06-25 北京达佳互联信息技术有限公司 Data processing method, device, electronic equipment and storage medium
CN111461332A (en) * 2020-03-24 2020-07-28 北京五八信息技术有限公司 Deep learning model online reasoning method and device, electronic equipment and storage medium
CN112988395A (en) * 2021-04-20 2021-06-18 宁波兰茜生物科技有限公司 Pathological analysis method and device of extensible heterogeneous edge computing framework
US20210232399A1 (en) * 2020-01-23 2021-07-29 Visa International Service Association Method, System, and Computer Program Product for Dynamically Assigning an Inference Request to a CPU or GPU

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933429A (en) * 2019-03-05 2019-06-25 北京达佳互联信息技术有限公司 Data processing method, device, electronic equipment and storage medium
US20210232399A1 (en) * 2020-01-23 2021-07-29 Visa International Service Association Method, System, and Computer Program Product for Dynamically Assigning an Inference Request to a CPU or GPU
CN111461332A (en) * 2020-03-24 2020-07-28 北京五八信息技术有限公司 Deep learning model online reasoning method and device, electronic equipment and storage medium
CN112988395A (en) * 2021-04-20 2021-06-18 宁波兰茜生物科技有限公司 Pathological analysis method and device of extensible heterogeneous edge computing framework

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807539A (en) * 2021-09-06 2021-12-17 北狐数字科技(上海)有限公司 High multiplexing method, system, medium and terminal for machine learning and graph computing power
CN113807539B (en) * 2021-09-06 2024-05-14 北狐数字科技(上海)有限公司 Machine learning and graphic computing power high multiplexing method, system, medium and terminal
CN114004730A (en) * 2021-11-03 2022-02-01 奥特贝睿(天津)科技有限公司 Deep neural network multi-model parallel reasoning method based on graphics processor

Similar Documents

Publication Publication Date Title
CN110177118B (en) RDMA-based RPC communication method
CN113342538A (en) Inference engine design method for improving GPU (graphics processing unit) computation throughput by separating script and model
US9894021B2 (en) Cloud messaging services optimization through adaptive message compression
CN109327509A (en) A kind of distributive type Computational frame of the lower coupling of master/slave framework
CN102222213B (en) Distributed vision computing method based on open type Web Service framework
CN102916953A (en) Method and device for realizing concurrent service on basis of TCP (transmission control protocol) connection
CN105610972A (en) Clustered task dispatching system
US10693816B2 (en) Communication methods and systems, electronic devices, and computer clusters
CN102033777A (en) Distributed type job scheduling engine based on ICE (internet communication engine)
CN110489240A (en) Image-recognizing method, device, cloud platform and storage medium
CN109614227A (en) Task resource concocting method, device, electronic equipment and computer-readable medium
CN112035276A (en) Java-based cross-platform extensible Remote Procedure Call (RPC) framework design method
CN111857734B (en) Deployment and use method of distributed deep learning model platform
CN103677983A (en) Scheduling method and device of application
CN114064261A (en) Multi-dimensional heterogeneous resource quantification method and device based on industrial edge computing system
CN114237892A (en) Key value data processing method and device, electronic equipment and storage medium
CN112788124A (en) Distributed registration service method and device for remote sensing image
CN110955461B (en) Processing method, device, system, server and storage medium for computing task
CN108462737A (en) Individual-layer data consistency protocol optimization method based on batch processing and assembly line
CN116743836A (en) Long connection communication link establishment method and device, electronic equipment and storage medium
US10951737B1 (en) Mainframe service request orchestrator and multiplexer
CN111507257A (en) Picture processing method, apparatus, system, medium, and program
CN114640610B (en) Cloud-protogenesis-based service management method and device and storage medium
CN115643271A (en) Method, device, server and medium for synchronizing multi-application data on cloud
CN114237891A (en) Resource scheduling method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210903