CN110955461A - Processing method, device and system of computing task, server and storage medium - Google Patents

Processing method, device and system of computing task, server and storage medium Download PDF

Info

Publication number
CN110955461A
CN110955461A CN201911159702.3A CN201911159702A CN110955461A CN 110955461 A CN110955461 A CN 110955461A CN 201911159702 A CN201911159702 A CN 201911159702A CN 110955461 A CN110955461 A CN 110955461A
Authority
CN
China
Prior art keywords
task
computing
input
cache region
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911159702.3A
Other languages
Chinese (zh)
Other versions
CN110955461B (en
Inventor
黄亮
钟辉
江子豪
蔡云飞
贾宜彬
刘凌志
马宁宁
刘理
杨超
姜春阳
吴俊�
衣敏
白晓航
包能辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Reach Best Technology Co Ltd
Original Assignee
Reach Best Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Reach Best Technology Co Ltd filed Critical Reach Best Technology Co Ltd
Priority to CN201911159702.3A priority Critical patent/CN110955461B/en
Publication of CN110955461A publication Critical patent/CN110955461A/en
Application granted granted Critical
Publication of CN110955461B publication Critical patent/CN110955461B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44568Immediately runnable code
    • G06F9/44578Preparing or optimising for loading

Abstract

The utility model discloses a processing method, device, system, server and storage medium of computing task, wherein the server is provided with a plurality of input buffer areas, the method comprises: generating a plurality of computing tasks based on a data processing request sent by a client; determining an input cache region of each computing task, and respectively inputting task data corresponding to the computing tasks into the input cache regions; under the condition that an input cache region meeting the starting condition exists in each input cache region, starting to execute a calculation task corresponding to the input cache region meeting the starting condition; and returning the calculation result of the executed calculation task to the corresponding client. The method and the device can reduce consumption caused by data transmission, stabilize the throughput capacity of the server and are only related to the hardware computing performance of the slave equipment, and improve the hardware utilization rate of the slave equipment.

Description

Processing method, device and system of computing task, server and storage medium
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, a system, a server, and a storage medium for processing a computing task.
Background
Heterogeneous computing is a parallel and distributed computing technique that best utilizes various computing resources by best matching the type of parallelism (code type) of a computing task to the type of computation that the machine can efficiently support (i.e., machine capabilities). A large number of basic computing units such as matrix multiplication, convolution and the like are included in the neural network model estimation, and the computing units have high adaptability to parallel computing. By adopting a heterogeneous computing technology, especially dedicated parallel acceleration hardware (such as GPU, FPGA, ASIC and the like) can exert the computing advantages, thereby greatly improving the throughput capacity of the back-end service.
Referring to fig. 1, a schematic diagram of acceleration of heterogeneous computation of a server is shown, in a backend service, parallel acceleration hardware is usually used as a "slave device" of a server (hereinafter referred to as a "host") to offload a computation task of a client, and an acceleration scheme of heterogeneous computation of the server has the following characteristics:
1. parallel acceleration hardware usually adopts a single instruction multiple data stream mode to carry out parallel acceleration, and if the computing advantage of slave equipment is to be exerted, a larger batch size needs to be set;
2. the slave device can only access its own memory when executing the calculation task to read data, so the host data needs to be transmitted to the memory of the slave device by means of memory copy, and after the calculation task is completed, the calculation result on the slave device needs to be copied to the memory of the host.
From the characteristics, the calling process of the neural network model comprises the time of two data transmissions and the time of one model inference. In a multi-concurrency scenario of back-end services, data processing requests issued by clients are scattered and frequent, and the back-and-forth copying of such small batches of scattered data between hosts and slave devices is very inefficient. Therefore, how to fully utilize the computing power of the hardware of heterogeneous computing and reduce the consumption caused by data transmission becomes an urgent problem to be solved.
Disclosure of Invention
The present disclosure provides a method, an apparatus, a system, a server, and a storage medium for processing a computing task, so as to at least solve the problem of consumption caused by data transmission in the related art. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a method for processing a computing task, where a server is provided with a plurality of input buffers, the method including:
generating a plurality of computing tasks based on a data processing request sent by a client;
determining an input cache region of each computing task, and respectively inputting task data corresponding to the computing tasks into the input cache regions;
under the condition that an input cache region meeting the starting condition exists in each input cache region, starting to execute a calculation task corresponding to the input cache region meeting the starting condition;
and returning the calculation result of the executed calculation task to the corresponding client.
Optionally, the determining an input buffer area of each of the computing tasks, and respectively inputting the task data corresponding to the computing tasks into the input buffer area includes:
searching an idle input cache region, and reading a memory address corresponding to the idle input cache region;
and respectively inputting the task data of the computing task into the idle input cache region according to the memory address.
Optionally, before starting to execute the computing task corresponding to the input cache region meeting the starting condition, the method further includes:
if the task data of the computing task belonging to the input cache region is input completely and the computing resource used for executing the computing task is not occupied, determining that the input cache region meets the starting condition.
Optionally, the starting and executing the computing task corresponding to the input cache region meeting the starting condition includes:
generating a calculation instruction by adopting the memory address of the input cache region which accords with the starting condition;
and reading task data from the input cache region which meets the starting condition according to the memory address in the calculation instruction so as to execute the calculation task and obtain a calculation result.
Optionally, after the returning the computation result of executing the completed computation task to the corresponding client, the method further includes:
freeing computing resources for performing the computing task.
Optionally, after generating a plurality of computing tasks based on the data processing request sent by the client, the method further includes:
aiming at a certain calculation task, acquiring the sending time of a data processing request corresponding to the calculation task;
and generating a list task sequence according to the sending time.
Optionally, a plurality of output cache regions are disposed in the server, and returning the computation result of the executed computation task to the corresponding client includes:
inputting the calculation result of the executed calculation task into the output cache region;
and sending the calculation result of the output cache region to a corresponding client according to the list task sequence.
Optionally, the computing task is a neural network model inference computing task, and the server has parallel acceleration hardware using heterogeneous computing techniques.
According to a second aspect of the embodiments of the present disclosure, there is provided a processing apparatus for a computing task, which is applied to a server, where a plurality of input buffers are disposed, the apparatus including:
the computing task generating module is configured to generate a plurality of computing tasks based on the data processing request sent by the client;
the task data input module is configured to determine an input cache region of each computing task and input task data corresponding to the computing tasks into the input cache regions respectively;
the computing task starting module is configured to start and execute a computing task corresponding to the input cache region meeting the starting condition under the condition that the input cache region meeting the starting condition exists in each input cache region;
and the calculation result returning module is configured to return the calculation result of the executed calculation task to the corresponding client.
Optionally, the task data input module is configured to: searching an idle input cache region, and reading a memory address corresponding to the idle input cache region; and respectively inputting the task data of the computing task into the idle input cache region according to the memory address.
Optionally, the apparatus further comprises:
a starting condition determining module configured to determine that the input cache region meets a starting condition if input of task data of a computing task belonging to the input cache region is completed and a computing resource for executing the computing task is not occupied.
Optionally, the computing task initiation module is configured to: generating a calculation instruction by adopting the memory address of the input cache region which accords with the starting condition; and reading task data from the input cache region which meets the starting condition according to the memory address in the calculation instruction so as to execute the calculation task and obtain a calculation result.
Optionally, the apparatus further comprises:
a computing resource release module configured to release computing resources for performing the computing task.
Optionally, the apparatus further comprises:
the sending time acquisition module is configured to acquire the sending time of the data processing request corresponding to each computing task;
a list task order generation module configured to generate a list task order according to the transmission time.
Optionally, the computation result returning module is configured to input the computation result of the completed computation task into the output cache region; and sending the calculation result of the output cache region to a corresponding client according to the list task sequence.
Optionally, the computing task is a neural network model inference computing task, and the server has parallel acceleration hardware using heterogeneous computing techniques.
According to a third aspect of the embodiments of the present disclosure, there is provided a processing system of a computing task, the system including a client and a server, wherein:
the client is used for sending a data processing request to the server;
the server is provided with a plurality of input cache regions and output cache regions and is used for generating a plurality of computing tasks based on data processing requests sent by the client; determining an input cache region of each computing task, and respectively inputting task data corresponding to the computing tasks into the input cache regions; under the condition that an input cache region meeting the starting condition exists in each input cache region, starting to execute a calculation task corresponding to the input cache region meeting the starting condition; and returning the calculation result of the executed calculation task to the corresponding client.
Optionally, the server comprises a host and a slave, wherein,
the host is used for receiving a data processing request sent by the client, generating a plurality of computing tasks based on the received data processing request, determining an input cache region of each computing task, and starting to execute the computing task corresponding to the input cache region meeting the starting condition under the condition that the input cache region meeting the starting condition exists in each input cache region;
the slave device is provided with a hardware computing unit and a plurality of the input buffers, wherein,
the input cache region is used for inputting task data of a corresponding computing task;
the hardware computing unit is used for executing the computing task corresponding to the input cache region meeting the starting condition and returning the computing result of the executed computing task to the host machine.
Optionally, a register is further provided in the slave device, wherein,
the register is used for recording whether the computing resources of the hardware computing unit are released or not and informing the host machine based on the released state of the computing resources of the computing unit;
and the host machine is used for generating a computing instruction for starting and executing the computing task corresponding to the input cache region meeting the starting condition under the condition that the computing resources of the computing unit are released.
Optionally, an output buffer is further provided in the slave device, wherein,
the output cache region is used for caching the calculation result of the executed calculation task and returning the cached calculation result to the host machine;
and the host machine is also used for sequentially returning the calculation results to the client.
According to a fourth aspect of embodiments of the present disclosure, there is provided a server, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method as in an embodiment of the first aspect.
According to a fifth aspect of embodiments of the present disclosure, there is provided a storage medium including: the instructions in the storage medium, when executed by a processor of a server, enable the server to perform a method as in an embodiment of the first aspect.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising: computer program code which, when run by a computer, causes the computer to perform the method of the above aspects.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the server is provided with a plurality of input cache regions, generates a plurality of computing tasks based on a data processing request sent by a client, determines the input cache regions of the computing tasks, and respectively inputs task data corresponding to the computing tasks into the input cache regions of the server, wherein under the condition that the input cache regions meeting starting conditions exist in the input cache regions, the computing tasks corresponding to the input cache regions meeting the starting conditions are started to be executed, and finally, the computing results of the executed computing tasks are returned to the corresponding client. According to the method and the device, the plurality of memory cache regions are arranged on the server to cache the task data sent by the server, so that the process of transmitting the task data to the slave device is free from blocking and waiting, the consumption caused by data transmission is reduced, the throughput capacity of the stable server is only related to the hardware computing performance of the server, and the hardware utilization rate of the server is improved.
The invention is suitable for a server with heterogeneous computing technology to realize the computing task of neural network model inference, because a large number of basic computing units such as matrix multiplication and convolution are included in the neural network model inference, and the parallel acceleration hardware of the server is provided with a special hardware circuit design aiming at the basic computing units such as convolution and matrix multiplication, so that the basic computing units can be processed in parallel, and the computing can be completed in less time compared with a CPU.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a diagram illustrating heterogeneous computation acceleration of a server.
FIG. 2 is a schematic diagram illustrating a flow of batch processing data.
Fig. 3 is a schematic diagram illustrating a synchronous data transmission scheme.
Fig. 4 is a schematic diagram illustrating an asynchronous data transmission method.
Fig. 5 is an exception diagram illustrating an asynchronous data transfer.
FIG. 6 is a flow diagram illustrating a method of processing a computing task in accordance with an exemplary embodiment.
Fig. 7 is a schematic diagram illustrating a workflow of a memory cache according to an example embodiment.
FIG. 8 is a diagram illustrating an asynchronous call using a memory cache, according to an example embodiment.
Fig. 9 is a schematic diagram illustrating a host-side and slave-side interaction workflow according to an example embodiment.
FIG. 10 is a schematic diagram illustrating a client-side guaranteed input-output consistency in accordance with an illustrative embodiment.
FIG. 11 is a block diagram illustrating a processing device for computing tasks in accordance with an exemplary embodiment.
FIG. 12 is a block diagram illustrating a processing system for computing tasks, according to an example embodiment.
FIG. 13 is an internal block diagram of a computer device, shown in accordance with an exemplary embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of systems and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
To solve the problem of low efficiency of copying small batches of scattered data back and forth in the background art, a scheme for processing large batches of data is proposed at present, and referring to fig. 2, a schematic diagram of a flow of processing data in batches is shown, and the current scheme for processing large batches of data includes the following steps:
1. collecting data processing requests sent from a client side on a host machine in a task queue mode;
2. after the data in the task queue is gathered to a scale, the data processing requests of the client are combined into a large batch of data (task data);
3. copying a large batch of data from a host machine to a slave device;
4. the slave equipment carries out parallel calculation aiming at mass data;
5. after the slave equipment completes the calculation, copying the calculation result obtained after batch processing to the host machine;
6. and the host machine splits the calculation results obtained after batch processing and returns the calculation results to the client.
In the design of the slave device, all the computing resources are generally used on the reasoning of the neural network model as much as possible in order to exert the maximum computing performance, so that the slave device can be understood as a single consumer model of a task queue, and in this case, the data transmission modes can be divided into a synchronous data transmission mode and an asynchronous data transmission mode according to the design of the slave device.
Referring to fig. 3, a diagram of a synchronous data transmission mode is shown, and the utilization rate of hardware computation can be represented as T1/(T0+T1+T2) Wherein T is0Time of data input, T1Time calculated for hardware, T2The synchronous data transmission mode is used for the data output time, the slave device needs to wait for the input and the output of data, the performance of the slave device is lost, and the processing efficiency of the slave device is lowered.
Referring to fig. 4, a schematic diagram of an asynchronous data transmission manner is shown, which can fully utilize the computing resources of the slave device compared with a synchronous manner, without waiting for the input and output of data, and in practical applications, the asynchronous manner is also used in many cases.
However, in the asynchronous mode, although the computing resources of the slave device can be fully utilized, the time required for data input is necessarily less than the time calculated by the hardware of the slave device, otherwise the slave device waits for data input, which results in low processing efficiency of the slave device. Specifically, referring to fig. 5, an exception diagram of an asynchronous data transmission method is shown, and it can be seen from the diagram that after a slave device completes a hardware calculation, it needs to wait for data input, so that the hardware calculation of the slave device cannot be continuously performed, and the processing efficiency becomes low.
In practical application scenarios, the above situation of waiting for data input from the slave device is very likely to occur, and in this situation, the utilization rate of hardware calculation can be represented as T1/T0There are several reasons for this:
1. the whole process of data input comprises the following parts:
a) unpacking the data processing request;
b) preprocessing data;
c) merging in batches;
d) memory copy between the host and the slave.
Each link may cause uncertainty of data input delay.
2. The calculation force of the slave device is increased and developed faster than the memory transmission speed. Especially in the neural network reasoning scene, the input data volume is large (such as pictures, videos and the like), and especially in the case of large-volume data, the memory copy time of the data between the host and the slave device is a non-negligible part.
In the current slave asynchronous scheme, since the time consumption of data input has uncertainty, the actual utilization rate calculated by the hardware of the slave can be expressed as:
Figure BDA0002285726570000081
if T is1≥T0That is, if the time of hardware calculation is greater than or equal to the time of data input, the utilization rate of the hardware may be considered to reach 100%, and the computing resources of the slave device are fully utilized, whereas if the time of hardware calculation is less than the time of data input, the computing resources of the hardware may be considered to be not fully utilized.
In order to solve the above problems, the present disclosure provides to statically open a "memory cache region" on the memory of the slave device to cache the task data from the host and cache the calculation result, so that the slave device does not need to wait for the input of the data, and the utilization rate of the slave device is improved.
Fig. 6 is a flowchart illustrating a processing method of a computing task, according to an exemplary embodiment, and as shown in fig. 6, the processing method of the computing task is applied to a server, where the server is divided into a host and a slave device using heterogeneous computing technology, and a plurality of input buffers and a plurality of output buffers are divided in a memory of the slave device. The host machine is used for sending out a calculation instruction, and the slave equipment is used for executing the calculation instruction sent by the host machine.
In the disclosure, the slave device is parallel acceleration hardware utilizing heterogeneous computing technology, such as GPU, FPGA, ASIC, etc., is part of the server, and belongs to a server accessory. A memory cache region is statically opened up on a memory of the slave device to cache task data of a calculation task input by a host machine, and a calculation result aiming at the task data is cached.
Assuming that the memory requirement of the task data is MemIn Bytes and the memory requirement of the corresponding output calculation result is MemOut Bytes, the present disclosure adopts a memory area which needs to be opened up by N times on the slave device, that is, a memory area with a size of (MemIn + MemOut) × nbbytes as an input buffer area and an output buffer area, where N may be set according to an actual requirement, and usually N is 3, which may satisfy the requirement.
Exemplarily, referring to fig. 7, a schematic diagram of a work flow of a memory buffer is shown, in fig. 7, three input buffers (input data buffers) are divided on a slave device (acceleration hardware), respectively an input data buffer 0, an input data buffer 1 and an input data buffer 2, and three output buffers (output data buffers), respectively an output data buffer 0, an output data buffer 1 and an output data buffer 2.
Specifically, the processing method of the computing task disclosed by the invention comprises the following steps:
in step S11, a plurality of calculation tasks are generated based on the data processing request transmitted by the client.
In the present disclosure, the server collects the data processing requests sent from the client in a task queue manner, referring to fig. 7, when the data processing requests in the queue are accumulated to a defined threshold, or the time taken to collect the data processing requests exceeds the threshold, a host thread is triggered to execute task data forming the batch of data processing requests into a calculation task, and when the queue is accumulated to the next batch of data processing requests, the next host thread is used to continue the operation to form a new calculation task.
In step S12, an input buffer area for each of the calculation tasks is determined, and task data corresponding to the calculation tasks is input into the input buffer area, respectively.
After a computing task is formed, task data for the computing task may be input into the memory cache to await execution by the slave device. Before the task data of one input buffer area is not completely executed by the slave equipment, other task data cannot be continuously input into the input buffer area.
When the server inputs the task data to the input cache region, the input cache region is parallel and has no thread blocking, because the input cache regions are arranged in the method, when the input of the task data of one host machine thread is finished, the input of the task data of the previous computing task is not required to be finished.
In the previous scheme, when the task data of a new computing task is input, the task data of another computing task needs to be released from the device memory, namely the task data of the other computing task is released from the device memory first, so that the task data input is in waiting for a long time.
It should be noted that, the present disclosure is applicable to the computing task of implementing neural network model inference from devices with heterogeneous computing technologies, because a large number of basic computing units such as matrix multiplication and convolution are included in the neural network model inference, and there is a special hardware circuit design for these basic computing units such as convolution and matrix multiplication on parallel acceleration hardware (slave devices), and these basic computing units can be processed in parallel, and the computation can be completed in less time compared with a CPU.
In an alternative embodiment, the step S12 may include: searching an idle input cache region, and reading a memory address corresponding to the idle input cache region; and respectively inputting the task data of the computing task into the idle input cache region according to the memory address.
In the present disclosure, a corresponding modification is made to cooperate with reading a memory address from a device, so that the memory address directly specified by a host can be read on hardware logic. Specifically, the slave device has a first register, and the memory address (head address) of each memory cache region is recorded in the register of the slave device, and the host can query the memory address of each input cache region.
Whether the input buffer area is available and other task data are not written in can be determined by circularly reading the input buffer area. Specifically, the memory cache region of the present disclosure is designed as a circular cache region. For example, when N is 10, there are 10 buffers in total from 0 to 9, the host thread may cyclically read the memory buffers in sequence, if the i-th buffer is already occupied, the i + 1-th buffer is continuously read, and if i is 9, the i-th buffer jumps back to the 0-th buffer.
In step S13, if there is an input buffer satisfying the start condition in each of the input buffers, the execution of the calculation task corresponding to the input buffer satisfying the start condition is started.
The method adopts an asynchronous calling mode, and after the input of task data of a single host machine thread is completed, namely the copying of the memory on the slave equipment is completed, the computing task can be executed only after the computing resource of the slave equipment is released.
In an optional embodiment, if the task data input of the computing task belonging to the input buffer is completed and the computing resource for executing the computing task is not occupied, it is determined that the input buffer meets the start condition. That is, the input buffer may be determined to be eligible for startup if the task data input for the computing task at the input buffer is complete and the computing resources of the slave device are unoccupied.
Referring to fig. 8, a schematic diagram of an asynchronous call using a memory buffer is shown, in which the slave device includes a hardware computing unit, and after task data is input into an input buffer, if the hardware computing unit is still executing another computing task, it needs to wait for the release of computing resources of the slave device. Therefore, the global thread lock is created on the host side, so that the monopolization of the host thread to the computing resources of the slave device is guaranteed, and meanwhile, the computing resources are fully utilized.
Specifically, when the global thread lock is occupied, the slave device is required to wait while executing other computing tasks, and when the global thread lock is not occupied, the slave device is allowed to process the task data written into the input buffer.
In an alternative embodiment, the step S13 may include: generating a calculation instruction by adopting the memory address of the input cache region which accords with the starting condition; and reading task data from the input cache region which meets the starting condition according to the memory address in the calculation instruction so as to execute the calculation task and obtain a calculation result.
For an input cache region meeting the starting condition, a computing instruction for the slave device can be generated according to the memory address of the input cache region.
After the calculation instruction for the slave equipment is generated, the calculation instruction is sent to the slave equipment, the slave equipment can be positioned to the corresponding input cache region according to the memory address in the calculation instruction, the task data is read from the cache region to execute the calculation task, and after the calculation task is executed, the calculation result can be obtained.
In step S14, the calculation result of the completed calculation task is returned to the corresponding client.
In the disclosure, the input cache region is correspondingly provided with the output cache region, when the slave device completes the calculation task to obtain the calculation result, the calculation result can be input into the output cache region, and the calculation result is further copied to the memory of the host machine, and the host machine splits the calculation result into the calculation result which is fed back to the corresponding client. Of course, the output process of the calculation result is parallel and non-blocking.
In an optional embodiment, the method further comprises: freeing computing resources for performing the computing task.
In a specific implementation, after task data of a certain input buffer is executed by the slave device, the computing resource for executing the computing task may be released, and the register state of the register of the slave device may be modified to be a release state from an occupied state, which indicates that the computing resource of the slave device is released, and the computing task of a next host thread may be executed.
In order to make the embodiment of the present disclosure better understood, a specific example is used for the following description, and fig. 9 is shown as a schematic diagram of a host side and an accelerated hardware side (slave device) interworking workflow, and it should be noted that fig. 9 is a schematic diagram of an interworking workflow between a single host thread and a slave device during execution process of the single host thread, and when the host side is actually used, a plurality of same host threads are used, and when the host side uses a global thread lock to ensure the monopolization of the computing resources of the slave device, once the computation is completed, the computing resources are released and immediately preempted by another host thread. The first address (memory address) of each input buffer is recorded in a register on the slave device side.
In each host machine thread, before starting thread circulation, the memory address of the memory cache region corresponding to the thread needs to be acquired, and the execution sequence in the circulation process is as follows:
1. copying the host task data to an input cache region of the slave device;
2. acquiring a global thread lock; if the global thread lock is occupied, the host machine continuously inquires until the global thread lock is obtained;
3. appointing a slave device to execute a calculation instruction, wherein the memory address of the memory cache region bound by the thread is used as a parameter of the calculation instruction and is transmitted into a hardware execution logic, and thus, the slave device reads task data from the corresponding input data cache region according to the memory address according to the calculation instruction so as to execute a calculation task;
4. waiting for a signal of computing resource release, and releasing a global thread lock; specifically, after the computing resources are released, the state data of the register of the slave device is modified, the host can read the state data of the register of the slave device, and when the state data is in a release state, the global thread lock can be released;
5. and copying and outputting the calculation result of the cache region to the host machine memory.
By applying the embodiment of the disclosure, the computing resource utilization rate of the slave device using the heterogeneous computing technology of the server can be guaranteed, and the computing efficiency can be effectively guaranteed in the multi-concurrency scene of the server, so that the capacity of the availability service is improved, the condition that the slave device is idle due to the delay of input data can be avoided, and the utilization rate of the slave device is improved. It can be seen that the embodiments of the present disclosure can effectively utilize the computing resources of the slave device, and the stable service throughput capacity is only related to the computing performance of the heterogeneous computing hardware.
In an optional embodiment, the method for processing the computing task further includes: aiming at each computing task, acquiring the sending time of a data processing request corresponding to the computing task; generating a list task sequence according to the sending time; inputting the calculation result of the executed calculation task into the output cache region; and sending the calculation result of the output cache region to a corresponding client according to the list task sequence.
In this disclosure, in order to ensure consistency between an output calculation result and a data processing request of a client, a list task sequence of a current batch (calculation task) needs to be listed in each host thread of a host, and data in a single batch can be guaranteed to be first in first out, so that when the host receives the calculation result output in batches from a device, a corresponding client can be found according to the list task sequence after splitting.
Specifically, when the data processing requests of the clients in the queue are accumulated to a defined threshold value, or the time spent on collecting the data processing requests exceeds the threshold value, the data processing requests may be merged into task data of one calculation task according to a first-in first-out (such as sending time), and then a list task sequence is generated according to the sending time, the list task sequence records the marks of the clients in sequence, and the calculation results may be fed back to the corresponding clients through the marks of the clients.
After the list task sequence is generated, the task data are sent to the slave equipment, and the slave equipment executes according to the first-in first-out sequence during execution, so that the output calculation results after processing are in one-to-one correspondence with the input data processing requests in sequence, and thus after the calculation results are output in batches, the host can split the calculation results and find the corresponding client according to the list task sequence to inform the client that the task is finished.
Referring to fig. 10, a schematic diagram of ensuring input and output consistency of a client is shown, taking a single host thread as an example, assuming that in one host thread, task data of a client a, a client B, a client C, and a client D are used as computing tasks of a same batch according to a time sequence, and then a list task sequence is generated according to a sending time of the task data, where the list task sequence records a flag for the client.
When the computing resources of the slave device are idle, the task data in the queue are sequentially processed according to a first-in first-out mode, and then the data (namely the computing result) is output in batches, wherein when the computing result is output by the slave device, the data are output in batches according to the first-in first-out mode, so that after the host receives the computing result, the marks of the corresponding clients can be found in the list task sequence after splitting, and then the corresponding clients are informed of the completion of the session task according to the marks of the clients, so that the clients are guaranteed to receive the computing result corresponding to the task data sent by the clients, and the consistency of input and output is guaranteed.
FIG. 11 is a block diagram illustrating a processing device for a computing task in accordance with an exemplary embodiment. Referring to fig. 11, a plurality of input buffers are provided in the server, and the apparatus includes:
a computation task generation module 111 configured to generate a plurality of computation tasks based on the data processing request sent by the client;
a task data input module 112 configured to determine an input buffer area of each of the computing tasks, and input task data corresponding to the computing tasks into the input buffer areas, respectively;
a computation task starting module 113 configured to start to execute a computation task corresponding to an input cache region meeting a starting condition when the input cache region meeting the starting condition exists in each of the input cache regions;
and a calculation result returning module 114 configured to return the calculation result of the completed calculation task to the corresponding client.
Optionally, the task data input module 112 is configured to: searching an idle input cache region, and reading a memory address corresponding to the idle input cache region; and respectively inputting the task data of the computing task into the idle input cache region according to the memory address.
Optionally, the apparatus further comprises: a starting condition determining module configured to determine that the input cache region meets a starting condition if input of task data of a computing task belonging to the input cache region is completed and a computing resource for executing the computing task is not occupied.
Optionally, the computing task initiation module is configured to: generating a calculation instruction by adopting the memory address of the input cache region which accords with the starting condition; and reading task data from the input cache region which meets the starting condition according to the memory address in the calculation instruction so as to execute the calculation task and obtain a calculation result.
Optionally, the apparatus further comprises: a computing resource release module configured to release computing resources for performing the computing task.
Optionally, the method further comprises: the sending time acquisition module is configured to acquire the sending time of the data processing request corresponding to each computing task; a list task order generation module configured to generate a list task order according to the transmission time.
Optionally, the computation result returning module 114 is configured to input the computation result of the completed computation task into the output cache region; and sending the calculation result of the output cache region to a corresponding client according to the list task sequence.
Optionally, the computing task is a neural network model inference computing task, and the server has parallel acceleration hardware using heterogeneous computing techniques.
FIG. 12 is a block diagram illustrating a processing system for a computing task in accordance with an exemplary embodiment. Referring to fig. 11, the system includes a client 121 and a server 122.
The client 121 is configured to send a data processing request to the server 122;
the server 122, in which a plurality of input cache regions and output cache regions are arranged, and the server 112 is configured to generate a plurality of computing tasks based on a data processing request sent by the client 121; determining an input cache region of each computing task, and respectively inputting task data corresponding to the computing tasks into the input cache regions; under the condition that an input cache region meeting the starting condition exists in each input cache region, starting to execute a calculation task corresponding to the input cache region meeting the starting condition; and returning the calculation result of the executed calculation task to the corresponding client 121.
The server 122 optionally includes a host and a slave, wherein,
the host is used for receiving a data processing request sent by the client, generating a plurality of computing tasks based on the received data processing request, determining an input cache region of each computing task, and starting to execute the computing task corresponding to the input cache region meeting the starting condition under the condition that the input cache region meeting the starting condition exists in each input cache region;
the slave device is provided with a hardware computing unit and a plurality of the input buffers, wherein,
the input cache region is used for inputting task data of a corresponding computing task;
the hardware computing unit is used for executing the computing task corresponding to the input cache region meeting the starting condition and returning the computing result of the executed computing task to the host machine.
Optionally, a register is further provided in the slave device, wherein,
the register is used for recording whether the computing resources of the hardware computing unit are released or not and informing the host machine based on the released state of the computing resources of the computing unit;
and the host machine is used for generating a computing instruction for starting and executing the computing task corresponding to the input cache region meeting the starting condition under the condition that the computing resources of the computing unit are released.
Optionally, an output buffer is further provided in the slave device, wherein,
the output cache region is used for caching the calculation result of the executed calculation task and returning the cached calculation result to the host machine;
and the host machine is also used for sequentially returning the calculation results to the client.
With regard to the system in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 13 is a diagram illustrating a computer device, which may be a server, according to an example embodiment, an internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of processing a computing task. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The present disclosure also provides a server, including: a processor; a memory for storing processor-executable instructions; wherein, the processor is configured to execute the instructions to implement the corresponding steps and/or flows in the processing method embodiment of the above computing task.
The present disclosure also provides a storage medium comprising: the instructions in the storage medium, when executed by the processor of the server, enable the server to perform the respective steps and/or flows corresponding to the processing method embodiments of the computing task described above.
The present disclosure also provides a computer program product comprising: computer program code which, when run by a computer, causes the computer to perform the respective steps and/or flows corresponding to the above-described method embodiments of processing of computing tasks.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A processing method of a computing task is applied to a server, and is characterized in that a plurality of input cache regions are arranged in the server, and the method comprises the following steps:
generating a plurality of computing tasks based on a data processing request sent by a client;
determining an input cache region of each computing task, and respectively inputting task data corresponding to the computing tasks into the input cache regions;
under the condition that an input cache region meeting the starting condition exists in each input cache region, starting to execute a calculation task corresponding to the input cache region meeting the starting condition;
and returning the calculation result of the executed calculation task to the corresponding client.
2. The method according to claim 1, wherein the determining an input buffer area for each of the computing tasks and inputting the task data corresponding to the computing task into the input buffer area respectively comprises:
searching an idle input cache region, and reading a memory address corresponding to the idle input cache region;
and respectively inputting the task data of the computing task into the idle input cache region according to the memory address.
3. The method for processing the computing task according to claim 1, wherein before starting execution of the computing task corresponding to the input buffer meeting the starting condition, the method further comprises:
if the task data of the computing task belonging to the input cache region is input completely and the computing resource used for executing the computing task is not occupied, determining that the input cache region meets the starting condition.
4. The method for processing the computing task according to claim 3, wherein the starting and executing the computing task corresponding to the input buffer meeting the starting condition comprises:
generating a calculation instruction by adopting the memory address of the input cache region which accords with the starting condition;
and reading task data from the input cache region which meets the starting condition according to the memory address in the calculation instruction so as to execute the calculation task and obtain a calculation result.
5. The method for processing the computing task according to claim 3, wherein after the returning the computing result of executing the completed computing task to the corresponding client, the method further comprises:
freeing computing resources for performing the computing task.
6. The method according to claim 1, wherein after generating the plurality of computing tasks based on the data processing request sent by the client, the method further comprises:
aiming at each computing task, acquiring the sending time of a data processing request corresponding to the computing task;
and generating a list task sequence according to the sending time.
7. A processing device of computing task, applied to a server, wherein a plurality of input buffers are arranged in the server, the device comprising:
the computing task generating module is configured to generate a plurality of computing tasks based on the data processing request sent by the client;
the task data input module is configured to determine an input cache region of each computing task and input task data corresponding to the computing tasks into the input cache regions respectively;
the computing task starting module is configured to start and execute a computing task corresponding to the input cache region meeting the starting condition under the condition that the input cache region meeting the starting condition exists in each input cache region;
and the calculation result returning module is configured to return the calculation result of the executed calculation task to the corresponding client.
8. A system for processing computing tasks, the system comprising a client and a server, wherein:
the client is used for sending a data processing request to the server;
the server is provided with a plurality of input cache regions and output cache regions and is used for generating a plurality of computing tasks based on data processing requests sent by the client; determining an input cache region of each computing task, and respectively inputting task data corresponding to the computing tasks into the input cache regions; under the condition that an input cache region meeting the starting condition exists in each input cache region, starting to execute a calculation task corresponding to the input cache region meeting the starting condition; and returning the calculation result of the executed calculation task to the corresponding client.
9. A server, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement a method of processing a computing task as claimed in any one of claims 1 to 6.
10. A storage medium in which instructions, when executed by a processor of a server, enable the server to perform a method of processing a computing task as claimed in any one of claims 1 to 6.
CN201911159702.3A 2019-11-22 2019-11-22 Processing method, device, system, server and storage medium for computing task Active CN110955461B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911159702.3A CN110955461B (en) 2019-11-22 2019-11-22 Processing method, device, system, server and storage medium for computing task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911159702.3A CN110955461B (en) 2019-11-22 2019-11-22 Processing method, device, system, server and storage medium for computing task

Publications (2)

Publication Number Publication Date
CN110955461A true CN110955461A (en) 2020-04-03
CN110955461B CN110955461B (en) 2024-01-12

Family

ID=69978310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911159702.3A Active CN110955461B (en) 2019-11-22 2019-11-22 Processing method, device, system, server and storage medium for computing task

Country Status (1)

Country Link
CN (1) CN110955461B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112491963A (en) * 2020-11-03 2021-03-12 泰康保险集团股份有限公司 Data transmission method, device, equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130046811A1 (en) * 2011-08-18 2013-02-21 International Business Machines Corporation Stream processing using a client-server architecture
CN108415771A (en) * 2018-02-01 2018-08-17 深圳市安信智控科技有限公司 Multi-chip distributed parallel computing acceleration system
US10109030B1 (en) * 2016-12-27 2018-10-23 EMC IP Holding Company LLC Queue-based GPU virtualization and management system
CN109376004A (en) * 2018-08-20 2019-02-22 中国平安人寿保险股份有限公司 Data batch processing method, device, electronic equipment and medium based on PC cluster
CN109933429A (en) * 2019-03-05 2019-06-25 北京达佳互联信息技术有限公司 Data processing method, device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130046811A1 (en) * 2011-08-18 2013-02-21 International Business Machines Corporation Stream processing using a client-server architecture
US10109030B1 (en) * 2016-12-27 2018-10-23 EMC IP Holding Company LLC Queue-based GPU virtualization and management system
CN108415771A (en) * 2018-02-01 2018-08-17 深圳市安信智控科技有限公司 Multi-chip distributed parallel computing acceleration system
CN109376004A (en) * 2018-08-20 2019-02-22 中国平安人寿保险股份有限公司 Data batch processing method, device, electronic equipment and medium based on PC cluster
CN109933429A (en) * 2019-03-05 2019-06-25 北京达佳互联信息技术有限公司 Data processing method, device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112491963A (en) * 2020-11-03 2021-03-12 泰康保险集团股份有限公司 Data transmission method, device, equipment and readable storage medium
CN112491963B (en) * 2020-11-03 2023-11-24 泰康保险集团股份有限公司 Data transmission method, device, equipment and readable storage medium

Also Published As

Publication number Publication date
CN110955461B (en) 2024-01-12

Similar Documents

Publication Publication Date Title
CN108647104B (en) Request processing method, server and computer readable storage medium
WO2019223596A1 (en) Method, device, and apparatus for event processing, and storage medium
CN110308984B (en) Cross-cluster computing system for processing geographically distributed data
CN109783255B (en) Data analysis and distribution device and high-concurrency data processing method
CN111338769B (en) Data processing method, device and computer readable storage medium
CN113515320A (en) Hardware acceleration processing method and device and server
CN116069493A (en) Data processing method, device, equipment and readable storage medium
CN115878301A (en) Acceleration framework, acceleration method and equipment for database network load performance
CN110955461A (en) Processing method, device and system of computing task, server and storage medium
CN115391053B (en) Online service method and device based on CPU and GPU hybrid calculation
WO2023160484A1 (en) Image processing method, related apparatus and system
CN111831408A (en) Asynchronous task processing method and device, electronic equipment and medium
CN111443898A (en) Method for designing flow program control software based on priority queue and finite-state machine
CN114741166A (en) Distributed task processing method, distributed system and first equipment
CN114741165A (en) Processing method of data processing platform, computer equipment and storage device
CN113076180B (en) Method for constructing uplink data path and data processing system
CN113296972A (en) Information registration method, computing device and storage medium
EP4191413A1 (en) Message management method, device, and serverless system
CN111416872A (en) High-speed cache file system communication method and system based on MP and RDMA
CN113923212B (en) Network data packet processing method and device
CN113395302B (en) Asynchronous data distributor, related apparatus and method
CN111782482B (en) Interface pressure testing method and related equipment
CN116010126B (en) Service aggregation method, device and system
US11467836B2 (en) Executing cross-core copy instructions in an accelerator to temporarily store an operand that cannot be accommodated by on-chip memory of a primary core into a secondary core
CN111143078B (en) Data processing method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant