CN116431315B - Batch processing task processing method and device, electronic equipment and storage medium - Google Patents

Batch processing task processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116431315B
CN116431315B CN202310665911.5A CN202310665911A CN116431315B CN 116431315 B CN116431315 B CN 116431315B CN 202310665911 A CN202310665911 A CN 202310665911A CN 116431315 B CN116431315 B CN 116431315B
Authority
CN
China
Prior art keywords
subtask
task
batch
list
access address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310665911.5A
Other languages
Chinese (zh)
Other versions
CN116431315A (en
Inventor
杨媛静
刘军
王鸥
段茗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Denglin Technology Co ltd
Original Assignee
Chengdu Denglin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Denglin Technology Co ltd filed Critical Chengdu Denglin Technology Co ltd
Priority to CN202310665911.5A priority Critical patent/CN116431315B/en
Publication of CN116431315A publication Critical patent/CN116431315A/en
Application granted granted Critical
Publication of CN116431315B publication Critical patent/CN116431315B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to a method and a device for processing batch processing tasks, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a subtask list corresponding to a total task to be executed; the subtask list comprises a plurality of subtasks, the sum of the task amounts of the subtasks is consistent with the task amount of the total task to be executed, and different subtasks correspond to different batch sizes; determining a base coordinate and a relative coordinate of each subtask in the subtask list from the batch dimension, and determining an access address of an instruction in the subtask according to the base coordinate and the relative coordinate; and distributing the instruction carrying the access address contained in each subtask in the subtask list to the AI chip so that the hardware equipment in the AI chip executes the instruction carrying the access address contained in each subtask and acquires corresponding data from the access address to complete the corresponding subtask. The method can improve the processing efficiency of the AI calculation task and avoid the error caused by the combined execution of the total task quantity by utilizing a plurality of subtasks.

Description

Batch processing task processing method and device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for processing batch processing tasks, electronic equipment and a storage medium.
Background
Typically, the sizes of executing batch and compiling batch in AI (Artificial Intelligence ) networks are different. The method comprises the steps that a maximum batch processing task size max batch is provided during compiling, and a compiler compiles and generates execution files containing different batch size task sizes according to different batch sizes based on the max batch. When actually executing, the user can operate the AI network by uploading execution batch with the size smaller than or equal to max batch. In order to find the task amount with the same batch size as the execution batch size from the execution file generated by the compiler after the execution batch with the size smaller than or equal to max batch is transmitted by the user, generally, in the early compiling, the execution file containing the task amounts of various batch sizes (1 batch to max batch) needs to be sequentially generated according to the max batch, for example, if the max batch is 5 batches, the execution file containing 1batch task amount, 2batch task amount, 3batch task amount, 4batch task amount and 5batch task amount needs to be generated. This approach may result in long compilation time, and if the value of max batch is large, the number of task types that need to be generated based on max batch is greater at the compilation time, which may result in the need to generate very large execution files, for example, if max batch is 2000 batch, the need to generate execution files containing 2000 different batch-size task amounts may result in the disk failing to store execution files with excessive space usage.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for processing a batch processing task, so as to solve the problems of low processing efficiency and large required storage space existing in the current execution of a computing task corresponding to an AI network.
Embodiments of the present application are implemented as follows:
in a first aspect, an embodiment of the present application provides a method for processing a batch task, including: acquiring a subtask list corresponding to a total task to be executed; the subtask list comprises a plurality of subtasks, wherein the subtasks are selected from the subtasks with different batch sizes in an execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes; determining a base coordinate and a relative coordinate of each subtask in the subtask list from a batch dimension, and determining an access address of an instruction in the subtask according to the base coordinate and the relative coordinate; and distributing the instruction with the access address contained in each subtask in the subtask list to an AI chip so that hardware equipment in the AI chip executes the instruction with the access address contained in each subtask, and acquiring corresponding batch task data based on the access address to complete the corresponding subtask.
In the embodiment of the application, the plurality of subtasks are utilized to splice the total tasks, so that the execution files containing various batch-size task quantities are not required to be generated during the early compiling, but only the execution files containing the specified batch-size task quantities can be generated, thereby greatly improving the running efficiency of the processor, and simultaneously, the large storage space can be saved due to the reduction of the generated execution files. Meanwhile, after the subtask list corresponding to the total task to be executed is obtained, the access address of the instruction in each subtask in the subtask list is determined from the batch dimension (namely, the absolute coordinates of batch data required for completing each subtask in the subtask list in all batch data required for completing the total task to be executed are determined), so that the corresponding batch task data is obtained from the access address to complete the corresponding subtask, and therefore, the total task quantity can be ensured to be executed by utilizing a plurality of subtasks without errors.
With reference to a possible implementation manner of the first aspect embodiment, determining, from a batch dimension, a base coordinate and a relative coordinate of each subtask in the subtask list, and determining, according to the base coordinate and the relative coordinate, an access address of an instruction in the subtask includes: determining the base coordinates and the relative coordinates of instructions in each subtask in the subtask list from batch dimensions according to a preset strategy, wherein the preset strategy comprises a calculation formula for determining the base coordinates and the relative coordinates of instructions in each subtask in the subtask list; and determining the access address of the instruction in each subtask in the subtask list according to the base coordinates and the relative coordinates of the instruction in the subtask.
In the embodiment of the application, the base coordinates and the relative coordinates of the instructions in each subtask in the subtask list are determined according to the calculation formula contained in the preset strategy, and then the access address is determined according to the base coordinates and the relative coordinates, for example, the access address=base coordinates+relative coordinates, so that the access address of the instructions in each subtask can be rapidly determined, and the flexibility of the scheme is improved.
With reference to a possible implementation manner of the embodiment of the first aspect, the preset policy further includes: the number of hardware devices executing the subtask list and the execution strategies, and the calculation formulas of the base coordinates and the relative coordinates corresponding to different execution strategies are different.
In the embodiment of the application, whether the subtask list is executed by one hardware device or the subtask list is executed by a plurality of hardware devices in parallel can be realized by setting the number of the hardware devices, meanwhile, different execution modes can be selected by changing the execution strategy, and the different execution modes correspond to different calculation formulas, so that the flexibility of the scheme is improved.
With reference to one possible implementation manner of the embodiment of the first aspect, if the number of hardware devices is one, the calculation formulas of the base coordinates and the relative coordinates are as follows:
Wherein, ai represents the batch size of the ith seed task in the subtask list, bi represents the number of times the ith seed task has been executed, and n represents the type number of the subtask with the access address to be determined in the subtask list.
In the embodiment of the application, the access address of the instruction in each subtask in the subtask list can be quickly and accurately determined by adopting the calculation formula when each subtask in the subtask list is executed by one hardware device, so that the total task quantity is executed by utilizing a plurality of subtasks without error.
With reference to one possible implementation manner of the embodiment of the first aspect, if the number of hardware devices is m, where m is an integer greater than or equal to 2, the execution policy requires that access addresses that are accessed by each hardware device for multiple times are consecutive; the calculation formula of the base coordinates and the relative coordinates is as follows:
wherein Ai represents the batch size of the ith seed task in the subtask list, bi represents the number of times the ith seed task has been executed, n represents the type number of the subtask of the access address to be determined in the subtask list, avgbatch is the batch size of all the subtasks required to be executed by each hardware device, idx represents the number x of the hardware device currently ready to execute the subtask of the access address to be determined, and the value of x is [0, m-1].
In the embodiment of the application, when the computing formula is adopted to rapidly and accurately determine the access address of the instruction in each subtask in the subtask list executed by a plurality of hardware devices, the access address of each hardware device which is accessed for a plurality of times can be ensured to be continuous, the access efficiency and the storage efficiency are improved, and the performance of each hardware device can be balanced.
With reference to one possible implementation manner of the embodiment of the first aspect, if the number of hardware devices is m, where m is an integer greater than or equal to 2, the execution policy requires that access addresses between the hardware devices in one access are consecutive; the calculation formula of the base coordinates and the relative coordinates is as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein Ai represents the batch size of the ith seed task in the subtask list, bi represents the executed times of the ith seed task, and n represents the type number of the subtask of the access address to be determined in the subtask list; idx denotes the number x of the hardware device currently ready to perform the subtask of the access address to be determined, x having the value 0, m-1]。
In the embodiment of the application, when the computing formula is adopted to rapidly and accurately determine the access address of the instruction in each subtask in the subtask list executed by a plurality of hardware devices, the access address among the hardware devices in one access can be ensured to be continuous, so that the task quantity executed by the hardware devices can be unbalanced, the mode can support that the total task is not required to be adjusted upwards in the earlier stage, the total execution data quantity can be reduced, and in addition, the task quantity which is additionally increased due to the upward adjustment is not required to be processed.
In one possible implementation manner of the embodiment of the first aspect, the total task to be executed is an original task or an original task after upward adjustment; if the original task is executed by a plurality of hardware devices and the original task cannot be equally divided to each hardware device, the size of the original task is adjusted upwards, so that the adjusted original task is equally divided to each hardware device.
In the embodiment of the application, if the original task is executed by a plurality of hardware devices and the original task cannot be uniformly distributed to each hardware device, the original task after adjustment is uniformly distributed to each hardware device by upwardly adjusting the size of the original task, so that the calculation tasks required to be executed by each hardware device are the same, the hardware resources of the hardware devices can be fully utilized, and meanwhile, the time required by each hardware device to execute the task is consistent, thereby being convenient for synchronous control.
In a second aspect, an embodiment of the present application further provides a device for processing a batch task, including: the device comprises an acquisition module, a processing module and a sending module; the acquisition module is used for acquiring a subtask list corresponding to the total task to be executed; the subtask list comprises a plurality of subtasks, wherein the subtasks are selected from the subtasks with different batch sizes in an execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes; the processing module is used for determining the base coordinate and the relative coordinate of each subtask in the subtask list from the batch dimension and determining the access address of the instruction in the subtask according to the base coordinate and the relative coordinate; and the sending module is used for distributing the instruction with the access address contained in each subtask in the subtask list to the AI chip so that the hardware equipment in the AI chip executes the instruction with the access address contained in each subtask and acquires corresponding batch processing task data based on the access address to complete the corresponding subtask.
In a fourth aspect, an embodiment of the present application further provides a processor, including: a core and a transceiver; the kernel is used for acquiring a subtask list corresponding to a total task to be executed, determining the base coordinate and the relative coordinate of each subtask in the subtask list from batch dimensions, and determining the access address of an instruction in the subtask according to the base coordinate and the relative coordinate; the subtask list comprises a plurality of subtasks, wherein the subtasks are selected from the subtasks with different batch sizes in an execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes; and the transceiver is used for distributing the instruction with the access address contained in each subtask in the subtask list to the AI chip so that the hardware equipment in the AI chip executes the instruction with the access address contained in each subtask and acquires corresponding batch processing task data based on the access address to complete the corresponding subtask.
In a fourth aspect, an embodiment of the present application further provides an electronic device, including: the device comprises a memory and a processor, wherein the processor is connected with the memory; the memory is used for storing programs; the processor is configured to invoke the program stored in the memory, so as to perform the foregoing embodiment of the first aspect and/or the method provided in connection with any possible implementation manner of the embodiment of the first aspect.
In a fifth aspect, embodiments of the present application further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method provided by the embodiments of the first aspect and/or any of the possible implementations in combination with the embodiments of the first aspect.
The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:
the method is beneficial to shortening the pre-compiling time and reducing the storage space occupied by an execution file, and can determine the absolute coordinates of batch data required by each subtask in the subtask list in the batch data required by completing the total task to be executed under the condition that a part of subtasks are selected based on the execution file to form the subtask list so as to complete the total task to be executed, so that the corresponding batch task data are acquired from the corresponding access address to complete the corresponding subtask, and therefore, the total task quantity can be executed by utilizing a plurality of subtasks without errors.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. The above and other objects, features and advantages of the present application will become more apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the several views of the drawings. The drawings are not intended to be drawn to scale, with emphasis instead being placed upon illustrating the principles of the application.
Fig. 1 is a schematic flow chart of a method for processing a batch task according to an embodiment of the present application.
Fig. 2 is a flow chart illustrating a processing method of another batch processing task according to an embodiment of the present application.
FIG. 3 is a block diagram of an apparatus for processing batch processing tasks according to an embodiment of the present application.
Fig. 4 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, relational terms such as "first", "second", and the like are used solely to distinguish the description and do not necessarily require or imply any such actual relationship or order. Furthermore, the term "and/or" in the present application is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone.
At present, when an AI network is compiled, execution files containing task amounts of various batch sizes (1 batch-max batch) are sequentially generated according to max batch, so that when a calculation task corresponding to the AI network is executed, task amounts (tile) with batch sizes matched with the sizes of the execution batches (namely total task amount actually required to be executed) can be found from the execution files generated by a compiler, and therefore the compiling time is greatly increased, the execution files are very large, the processing efficiency is low, and the problem that a disk cannot store the large execution files is caused.
Therefore, the embodiment of the application provides a processing method, a device, a processor, an electronic device and a storage medium for batch processing tasks, which are used for solving the problems of low processing efficiency and large storage space required by compiling generated execution files existing in the current execution of the calculation tasks corresponding to the AI network. Based on the principle of the application, assuming that max batch is 50batch, only an execution file containing a task amount with a specified batch size can be generated during compiling, for example, an execution file containing 1batch subtask amount is generated, an execution file containing 2batch subtask amounts is generated, an execution file containing 4batch subtask amounts is generated, an execution file containing 10batch subtask amounts is generated, and the like, without sequentially generating execution files containing 50 seed task amounts in a range of 1batch to 50 batch. In this way, when executing the computing task corresponding to the AI network, a subtask list corresponding to the total task amount may be acquired according to the total task to be executed (the total task to be executed may be uploaded by the user), where the subtask list includes a plurality of subtasks, and a sum of task amounts of the plurality of subtasks is consistent with the task amount of the total task to be executed. That is, when executing the computing task corresponding to the AI network, no matter how large the user actually uploads the designated execution batch, the total task amount required by the user can be formed by selecting the subtasks with different batch sizes in the execution file, and the instruction corresponding to each subtask for piecing the total task amount is distributed to the hardware device in the AI chip for execution. Different subtasks correspond to different task amounts, namely, the batch processing tasks corresponding to different subtasks are different in size, and the size of each subtask represents the task amount of how many batches the subtask has.
Generally, AI networks process data sets to be processed corresponding to total tasks in batches (such as batch training or batch reasoning), and thus, in the AI field, task amounts are generally described in terms of batch task (batch) sizes. The size of a batch task is referred to as batch size, sometimes referred to as batch size. batch size is also a parameter commonly used in machine learning, and represents the number of samples used to train an AI network during training, and when training a model, a training data set is typically divided into multiple batches of training data, where a batch of sample data to be processed may be, for example, the number of images to be processed at one time. The data processed by the AI network is of a dimension attribute, such as a batch dimension. Dimensions are also called axes, for example AI networks typically process Tensor (Tensor) data, and when Tensor data of only one dimension is batched, there is only one axis in the batch process, possibly the batch axis, for example the axis used to represent the number of image sheets.
In some application scenarios, various task data (e.g., task data such as two-dimensional pictures, three-dimensional pictures, video, text, etc. to be processed) may be processed through an AI network. The compiler compiles various operations in the AI network or the sub-network into a plurality of small templates (i.e. sub-tasks with different task amounts available for selection in the execution file) which can be called by the driver, each template corresponds to a batch of data shapes with different sizes, when the AI network is formally operated, the driver selects a proper template from the plurality of compiled templates according to the size of the picture or video to be actually processed (i.e. the total task amount provided or appointed by a user) to splice the data amount to be actually processed (corresponding to obtaining a sub-task list), before each time the template (i.e. the sub-task containing an instruction) is sent to hardware for execution, the coordinates (used for describing the position and the access address of the data to be processed corresponding to the sub-task) executed at the time are configured, after the hardware equipment obtains the calculation instruction in the template, the data (such as images, videos and the like) of the received or determined positions can be calculated and the calculation processing result is provided for the user through the driver. For the division of the task template, if a graph is taken as the segmentation granularity, and a complete graph is not segmented, only the factor of the dimension for representing the number of the images is considered.
Because the AI network performs tasks in batches, at least 1batch, when performing data processing tasks, the tasks of the total task amount need to be performed according to the selected subtask amounts from the batch dimension (which may also be referred to as batch dimension, batch dimension or batch processing task dimension), in order to ensure that after the total tasks are formed by using a plurality of subtasks, the total tasks can still be accurately performed, after the subtask list corresponding to the total tasks to be performed is obtained, an access address of an instruction in each subtask in the subtask list is also determined from the batch dimension, that is, an absolute coordinate of batch data required for completing each subtask in the subtask list in all batch data required for completing the total tasks to be performed is determined, so that an instruction carrying an access address contained in each subtask in the subtask list is distributed to the AI chip, so that hardware devices in the AI chip execute the instruction carrying the access address contained in each subtask, and the corresponding subtask is obtained based on the access address (for example, the hardware devices directly calculate the corresponding image data according to the instruction carrying the access address or indirectly implement the image processing of the data). In this way, execution scheduling for each subtask in the subtask list is achieved, ensuring that the total amount of tasks is executed with multiple subtasks without errors.
In order to facilitate understanding of the solution provided by the embodiment of the present application, a method for processing a batch task provided by the embodiment of the present application will be described below with reference to fig. 1. The processing method of the batch processing task provided by the embodiment of the application can be applied to a scene of needing to run an AI network for data processing, and can be applied to a processor, wherein the processor can be a isomorphic processor or a heterogeneous processor. The processor may be any conventional processor, such as a central processor.
S1: and acquiring a subtask list corresponding to the total task to be executed.
The required subtasks can be selected from the subtasks with different batch sizes in the execution file generated in advance to obtain a subtask list corresponding to the total task to be executed, the subtask list comprises a plurality of subtasks, the sum of the task amounts of the plurality of subtasks is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes.
For example, assuming that the total task to be executed is 12batch, access addresses corresponding to access data required for completing the 12batch task are denoted as batch0 to batch11. In some application scenarios, the total task may be composed of two subtasks of tile1 and tile2, where the task amount of tile1 is 2 bach, the task amount of tile2 is 1batch, each subtask needs to be executed 4 times, the corresponding subtask list is shown in table 1, and batch0 to batch11 in table 1 may be regarded as absolute coordinates.
TABLE 1
The total task to be executed can be executed by one hardware device in the AI chip, or can be executed by a plurality of hardware devices in the AI chip at the same time, so as to improve the execution efficiency. For example, the subtasks in Table 1 described above, if executed by 1 hardware device, need to be executed 8 times (execute 4 times tile1 and execute 4 times tile 2). If executed in parallel by 4 hardware devices, it is only necessary to execute 2 times (tile 1 is executed once, tile2 is executed once).
The number of hardware devices performing the computing task may be manually configured, for example, the number of hardware devices that need to run the network model may be determined according to the amount of tasks to be performed for the total task, where invoking one hardware device may be sufficient to satisfy the requirement, and where the total amount of tasks is large, multiple hardware devices may be required to run in parallel.
When the total task to be executed is executed by a plurality of hardware devices, the task amounts executed by the plurality of hardware devices may be the same or different, for example, assuming that the total task to be executed is 31batch, when executed by 4 hardware devices, there may be a task amount in which 3 hardware devices each execute 8 batch, and a task amount in which one hardware device executes 7 batch.
If the original task is executed by a plurality of hardware devices and the original task cannot be uniformly distributed to each hardware device, the size of the original task can be adjusted upwards according to the number of the hardware devices, so that the adjusted original task is uniformly distributed to each hardware device. For example, the original task is 31batch, and is executed by 4 hardware devices, and since the original task cannot be equally divided into each hardware device, the original task of 31batch can be upwardly adjusted to the original task of 32batch, so that the task amount required to be executed by each hardware device is 8batch.
It will be appreciated that if the overall task is sized up, factors such as whether the efficiency of execution is improved after the adjustment may be considered, and if positive benefit is achieved after the adjustment, the overall task may be sized up. For example, after adjustment, the performance of each hardware device may be balanced, and/or the positive benefits such as reducing the execution times of the hardware device may be adjusted upwards.
If the sizes of the subtasks used to compose the overall task in the execution file are 4-batch subtasks (tile 0), 2-batch subtasks (tile 1), and 1-batch subtasks (tile 2), respectively. Under the condition that the total tasks of 31batch are distributed to 4 hardware devices for execution, if the total tasks of 31batch are not adjusted upwards to be the total tasks of 32batch, at this time, the total tasks cannot be uniformly distributed to each hardware device, for example, 3 hardware devices respectively distribute computing tasks of 8batch, and 1 hardware device distributes computing tasks of 7 batch. For an 8-batch computing task, 2 tile 0's of 4-batch may be selected to be composed, where the minimum number of times of starting the hardware device is 2 times, and for a 7-batch computing task, even if tile2 of 1 tile0+1 tile1+1 tile of 2-batch is selected, the minimum number of times of starting the hardware device is 3 times. It can be seen that after the total task is adjusted upwards, positive benefits can be brought, such as reducing the number of times of starting the hardware device.
When compiling the AI network, various operations (such as convolution operation, pooling operation, activation, normalization, classification processing, etc.) included in the AI network are compiled into various execution files that can be executed by the target hardware device, wherein the execution files include a plurality of subtasks (tiles) with specified batch sizes, and each subtask includes various instructions required for completing the subtask. After the AI network is compiled, the (instructions of the) sub-tasks in the obtained execution file may be provided to the corresponding hardware device for execution when the network model corresponding to the execution file needs to be executed. The execution file (executable) contains a computing task, and the computing task contains various instructions required for computing, and the instructions can be executed by corresponding hardware devices in the AI chip, so as to complete the corresponding computing task. The AI chip may be a dedicated computing accelerator chip (or accelerator) designed to take on heavy computing tasks, such as a graphics processor (Graphic Processing Unit, GPU), tensor processor (Tensor Processing Unit, TPU), etc., although other AI computing task oriented processors are also possible.
Alternatively, one AI chip may include a plurality of hardware devices, and a hardware device in the AI chip that can be used to perform a computing task corresponding to the subtask list may be referred to as a target computing device. Alternatively, one hardware device may include a plurality of hardware execution units, and one hardware device in the AI chip may also be regarded as a computing cluster including a plurality of hardware execution units. The number of hardware execution units contained by different types of hardware devices may be different and the variety may also be different.
S2: an access address for an instruction in each subtask in the subtask list is determined from the batch dimension.
After the subtask list corresponding to the total task to be executed is obtained, the access address of the instruction in each subtask in the subtask list can be determined from the batch dimension, so that the corresponding batch processing task data can be accurately obtained from the access address to complete the corresponding subtask.
Alternatively, the implementation procedure of S2 may be: the base coordinates and the relative coordinates of each subtask in the subtask list are determined from the batch dimension, and the access address of the instruction in the subtask is determined according to the base coordinates and the relative coordinates, for example, the base coordinates and the relative coordinates of the instruction in each subtask in the subtask list can be determined from the batch dimension according to a preset strategy, and the access address of the instruction in the subtask is determined according to the base coordinates and the relative coordinates of the instruction in the subtask for each subtask in the subtask list. For example, access address=base coordinates+relative coordinates.
The preset strategy is a strategy formulated in advance, and may include: and a calculation formula for determining the base coordinates and the relative coordinates of the instructions in each subtask in the subtask list. In addition, the method can also comprise the number of hardware devices for executing the subtask list and corresponding execution strategies, wherein the calculation formulas of the base coordinates and the relative coordinates corresponding to different execution strategies are different, so that different execution modes are corresponding.
It will be appreciated that in some embodiments, the preset policy may not include the number of hardware devices executing the subtask list and the corresponding execution policy, and thus, should not be construed as a unique limitation on the preset policy. The scheme is more flexible by adding 2 parameters of the number of hardware devices and corresponding execution strategies, a subtask list is executed by a plurality of hardware devices through setting the number of the hardware devices, and different execution modes are selected through setting the execution strategies.
The number of hardware devices for executing the subtask in the subtask list may be different, and the corresponding execution policy may be different, for example, if one hardware device for executing the subtask in the subtask list is used, the calculation formulas of the corresponding base coordinates and the corresponding relative coordinates may be as follows:
Wherein Ai represents the batch size of the ith seed task in the subtask list, bi represents the number of times the ith seed task has been executed, n represents the type number of the subtask with the access address to be determined in the subtask list, and the maximum value is the type number of the subtask in the subtask list.
For better understanding, the following description is given in connection with table 1. The subtask list of Table 1 relates to 2 seed tasks (tile 1 and tile 2) altogether. For tile1, the type number in the subtask list is 1, and the corresponding base coordinate is an initial value, for example, batch0. For tile2, the type number in the subtask list is 2, and because tile1 has a batch size of 2 and is executed 4 times when tile2 is executed, the base coordinates corresponding to tile2 are batch0 (initial value) +batch 8=batch 8.
When tile1 of 2batch is executed 1 st time (tile 1 of table 1), at this time, the corresponding base coordinate of tile1 is batch0, ai is 2, bi is 0, the corresponding relative coordinate is 2*0 =batch 0, and thus the corresponding access address is batch0 (base coordinate) +batch0 (relative coordinate) =batch 0, and since the size of tile1 is 2batch, batch0 to batch1 is executed this time.
When tile1 of 2batch is executed 2 nd time (tile 1 of table 1), at this time, the corresponding base coordinates of tile1 are batch0, ai is 2, bi is 1, the corresponding relative coordinates are 2*1 =batch 2, and thus the corresponding access address is batch0+batch 2=batch 2, and since tile1 has a size of 2batch, batch2 to batch3 are executed this time.
When tile1 of 2batch is executed 3 times (3 rd tile1 in table 1), at this time, the corresponding base coordinates of tile1 are batch0, ai is 2, bi is 2, the corresponding relative coordinates are 2×2=batch 4, and therefore, the corresponding access address is batch0+batch 4=batch 4, and since tile1 has a size of 2batch, batch4 to batch5 are executed this time.
When tile1 of 2batch is executed 4 times (4 th tile1 in table 1), at this time, the corresponding base coordinates of tile1 are batch0, ai is 2, bi is 3, the corresponding relative coordinates are 2*3 =batch 6, and therefore the corresponding access address is batch0+batch 6=batch 6, and since tile1 has a size of 2batch, batch6 to batch7 are executed this time.
To this end, the access coordinates of 4 tlies 1 in table 1 have been determined. Then the access coordinates of tile2 are determined.
When tile2 of 1batch is executed 1 st time (1 st tile2 in table 1), at this time, the base coordinate corresponding to tile2 is batch8, ai is 1, bi is 0, and the corresponding relative coordinate is batch0, so the corresponding access address is batch8+batch 0=batch 8, and since tile2 has a size of 1batch, batch8 is executed this time.
When tile2 of 1batch is executed 2 nd time (tile 2 of table 1), at this time, the base coordinates corresponding to tile2 are batch8, ai is 1, bi is 1, the corresponding relative coordinates are batch1, and thus the corresponding access address is batch8+batch 1=batch 9, and since tile2 has a size of 1batch, this time batch9 is executed.
When tile2 of 1batch is executed 3 times (tile 2 of table 1), at this time, the base coordinates corresponding to tile2 are batch8, ai is 1, bi is 2, the corresponding relative coordinates are batch2, and thus the corresponding access address is batch8+batch 2=batch 10, and since tile2 has a size of 1batch, this time batch10 is executed.
When tile2 of 1batch is executed 4 times (tile 2 of table 1), at this time, the base coordinates corresponding to tile2 are batch8, ai is 1, bi is 3, the corresponding relative coordinates are batch3, and thus the corresponding access address is batch8+batch 3=batch 11, and since tile2 has a size of 1batch, this time batch11 is executed.
Up to this point, the access coordinates of 4 tlei 1 and tlei 2 in table 1 are all determined. Corresponding subtasks can be executed by sequentially acquiring corresponding data according to the access coordinates, so that corresponding total tasks can be executed.
The above is the case where each subtask in the subtask list is executed using only one hardware device. If the number of hardware devices for executing the subtasks in the subtask list is m, m is an integer greater than or equal to 2. The corresponding execution policy may be either of the following: one of the execution policies requires that the access addresses of multiple accesses by each hardware device be consecutive (referred to as vertical consecutive), and the other execution policy requires that the access addresses between the individual hardware devices in one access be consecutive (referred to as horizontal consecutive).
For better understanding, also taking the above table 1 as an example, it is assumed that 4 hardware devices performing the subtasks in table 1 are respectively denoted as hardware device 0, hardware device 1, hardware device 2, and hardware device 3. At this time, only tile1 of 2batch is required to be executed on different hardware devices, and tile2 of 1batch is required to be executed on different hardware devices. The relationship of each hardware device to execute the subtasks may be as shown in table 2 or table 3.
TABLE 2
For table 2, the access addresses of multiple accesses by each hardware device are consecutive, i.e., longitudinally consecutive, e.g., for hardware device 0, it can be seen that the access addresses of the subtasks accessed by the hardware device twice are consecutive, e.g., for hardware device 0, the access addresses are from batch0 to batch2.
TABLE 3 Table 3
For table 3, the access addresses between the individual hardware devices in one access are consecutive, i.e. laterally consecutive.
For longitudinal continuity, the corresponding base coordinates and relative coordinates are calculated as follows:
wherein Ai represents the batch size of the ith seed task in the subtask list, bi represents the number of times the ith seed task has been executed, n represents the type number of the subtask of the access address to be determined in the subtask list, avgbatch is the batch size of all the subtasks executed by each hardware device, that is, avgbatch represents the batch size of the subtask of each hardware device when the subtasks in the subtask list are uniformly distributed to m hardware devices for execution, idx represents the number x of the hardware device currently ready to execute the subtask of the access address to be determined, and the value of x is [0, m-1].
For better understanding, the following description is made in connection with the subtask list of table 1 and the task execution relationship shown in table 2. For Table 2, the subtask list refers to 2 seed tasks (tile 1 and tile 2) altogether. Each hardware device will execute a task size of 3batch, i.e., avgbatch is 3. For tile1, the type number in the subtask list is 1, and the corresponding base coordinate is an initial value, for example, batch0. For tile2, the type number of the sub-task list is 2, and when tile2 is executed, tile1 of 2batch has been executed 1 time, that is, ai is 2, bi is 1, and based on the above base coordinate formula, the base coordinate corresponding to tile2 can be obtained as batch0 (initial value) +batch 2=batch 2.
For hardware device 0 (where idx is 0), the base coordinate corresponding to tile1 is batch0, and since A1 is 2, B1 is 0, and avgbatch is 3, the relative coordinate corresponding to tile1 is 2x0+3x0=batch 0. Therefore, the access address corresponding to tile1 is "batch 0+batch 0=batch 0", and since tile1 is 2 batches, the actual execution of batch0 to batch1 is that is, the hardware device 0 accesses the data of two batches, namely, batch0 to batch1, based on the instruction of tile 1.
Based on this, the hardware device 0 executes the base coordinate corresponding to tile2 as the batch size of tile2 is 1, that is, A2 is 1, B2 is 0, and avgbatch is 3, so the relative coordinate corresponding to tile2 is 1×0+3×0=batch 0. Therefore, the access address corresponding to the tile2 is the base coordinates+batch 0 (relative coordinates) =batch 2, and the size of tile1 is 1batch, so that the data of batch2 is executed this time.
For hardware device 1 (corresponding idx is 1), the base coordinate corresponding to tile1 is batch0, and because the batch size of tile1 is 2, i.e. A1 is 2, B1 is 0, avgbatch is 3, the relative coordinate corresponding to tile1 is 2×0+3×1=batch 3. Therefore, the hardware device 1 executes the access address corresponding to tile1 as batch0+batch 3=batch 3, and since tile1 has a size of 2batch, the execution is batch3 to batch4.
Based on this, the hardware device 1 executes the base coordinate corresponding to tile2 as the batch size of tile2 is 1, that is, A2 is 1, B2 is 0, avgbatch is 3, and thus the relative coordinate corresponding to tile2 is 1×0+3×1=batch 3. Therefore, the access address corresponding to tile2 is batch2+batch 3=batch 5, and since tile1 has a size of 1batch, this time, batch5 is executed.
Similarly, for the hardware device 2 (where idx is 2), the base coordinate corresponding to tile1 is batch0, and because the batch size of tile1 is 2, i.e. A1 is 2, B1 is 0, and avgbatch is 3, the relative coordinate corresponding to tile1 is 2×0+3×2=batch 6. Therefore, the access address corresponding to tile1 is batch0+batch 6=batch 6, and since tile1 has a size of 2batch, the execution is batch6 to batch7.
Based on this, the hardware device 2 executes the base coordinate corresponding to tile2 as the batch size of tile2 is 1, that is, A2 is 2, B2 is 0, the corresponding idx is 2, and avgbatch is 3, so the relative coordinate corresponding to tile2 is 1×0+3×2=batch 6. Therefore, the access address corresponding to tile2 is batch2+batch 6=batch 8, and since tile1 has a size of 1batch, this time, batch8 is executed.
Similarly, for the hardware device 3 (corresponding idx is 3), the base coordinate corresponding to tile1 is batch0, and because the batch size of tile1 is 2, i.e. A1 is 2, B1 is 0, avgbatch is 3, the relative coordinate corresponding to tile1 is 2×0+3×3=batch 9. Therefore, the hardware device 3 executes the access address corresponding to tile1 as batch0+batch 9=batch 9, and since tile1 has a size of 2batch, the execution is batch9 to batch10.
Based on this, the hardware device 3 executes the base coordinate corresponding to tile2 as the batch size of tile2 is 1, that is, A2 is 2, B2 is 0, the corresponding idx is 3, and avgbatch is 3, so the relative coordinate corresponding to tile2 is 1×0+3×3=batch 9. Therefore, the hardware device 3 executes the access address corresponding to tile2 as batch2+batch 9=batch 11, and since tile1 has a size of 1batch, this time, batch11 is executed.
It will be appreciated that for vertical continuation, the amount of tasks performed by the individual hardware devices needs to be consistent, and if the original overall task cannot be equally divided by the individual hardware devices, the original overall task needs to be adjusted upward so that the amount of tasks performed by each hardware device is consistent. In the embodiment, the access addresses of the multiple accesses of the hardware devices are continuous, so that the access efficiency is improved, the storage efficiency is improved, and the subtask data which are required to be accessed by each hardware device can be stored in a centralized manner.
For lateral succession, the corresponding base coordinates and relative coordinates are calculated as follows:
wherein, ai represents the batch size of the ith seed task in the subtask list, bi represents the number of times the ith seed task has been executed, n represents the type number of the subtask of the access address to be determined in the subtask list, idx represents the number x of the hardware device currently ready to execute the subtask of the access address to be determined, and the value of x is [0, m-1].
For better understanding, the following description is given in connection with table 3. For Table 3, a total of 2 seed tasks (tile 1 and tile 2) are involved. For tile1, the corresponding base coordinates are initial values, e.g., batch0. For tile2, when tile2 is executed, tile1 is executed 1 time, i.e., bi is 1, and since tile1 has a size of 2batch, i.e., ai is 2, based on the above-mentioned base coordinate formula, the base coordinate corresponding to tile2 can be obtained as 0+2x1x4=batch 8.
For hardware device 0 (where idx is 0), the base coordinate corresponding to tile1 is batch0, and since the batch size of tile1 is 2, i.e., A1 is 2, B1 is 0, and idx is 0, the relative coordinate corresponding to tile1 is 2×4×0+0=batch 0. Therefore, the access address corresponding to the tile1 is patch0+patch0=patch0, and since the size of tile1 is 2 patche, the execution is patch0 to patch1.
The hardware device 0 executes the basic coordinate corresponding to tile2 as batch8, and since the batch size of tile2 is 1, that is, A2 is 1 and B2 is 0, the relative coordinate corresponding to tile2 is 1×4×0+0=batch 0. Therefore, the access address corresponding to tile2 is batch8+batch 0=batch 8, and since tile1 has a size of 1batch, this time, batch8 is executed.
For the hardware device 1, the base coordinate corresponding to tile1 is batch0, and since the batch size of tile1 is 2, A1 is 2, B1 is 0, and the corresponding idx is 1, the relative coordinate of tile1 is 2×4×0+1=batch 2. Therefore, the access address corresponding to tile1 is batch0+batch 2=batch 2, and since tile1 has a size of 2batch, the execution is batch2 to batch3.
The hardware device 1 executes the basic coordinate corresponding to tile2 as batch8, and since the batch size of tile2 is 1, A2 is 1, B2 is 0, and the corresponding idx is 1, the relative coordinate of tile2 is 1×1 (4×0+1) =batch 1. Therefore, the hardware device 1 executes the access address corresponding to tile2 as batch8+batch 1=batch 9, and since tile1 has a size of 1batch, this time, batch9 is executed.
For the hardware device 2, the base coordinate corresponding to tile1 is batch0, and since the batch size of tile1 is 2, i.e. A1 is 2, B1 is 0, and the corresponding idx is 2, the relative coordinate of tile1 is 2×4×0+2=batch 4. Therefore, the access address corresponding to tile1 is batch0+batch 4=batch 4, and since tile1 has a size of 2batch, the execution is batch4 to batch5.
The hardware device 2 executes the basic coordinate corresponding to tile2 as batch8, and since the batch size of tile2 is 1, i.e. A2 is 1, B2 is 0, and the corresponding idx is 3, the relative coordinate of tile2 is 1×4×0+2=batch 2. Therefore, the access address corresponding to tile2 is batch8+batch 2=batch 10, and since tile1 has a size of 1batch, this time, batch10 is executed.
For the hardware device 3, at this time, the base coordinate corresponding to tile1 is batch0, and since the batch size of tile1 is 2, that is, A1 is 2, B1 is 0, and the corresponding idx is 3, the relative coordinate of tile1 is 2×4×0+3=batch 6. Therefore, the hardware device 3 executes the access address corresponding to tile1 as batch0+batch 6=batch 6, and since tile1 has a size of 2batch, the execution is batch6 to batch7.
The hardware device 3 executes the basic coordinate corresponding to tile2 as batch8, and since the batch size of tile2 is 1, i.e. A2 is 1, B2 is 0, and the corresponding idx is 3, the relative coordinate of tile2 is 1×4×0+3=batch 3. Therefore, the hardware device 3 executes the access address corresponding to tile2 as batch8+batch 3=batch 11, and since tile1 has a size of 1batch, this time, batch11 is executed.
It will be appreciated that for lateral continuation, the amount of tasks performed by the individual hardware devices may not be required to be uniform, and if the original overall task cannot be equally divided by the individual hardware devices, it may not be necessary to adjust the original overall task upward so that the amount of tasks performed by each hardware device is uniform.
To better understand the above-described calculation formulas for the base coordinates and the relative coordinates under the horizontal continuous and vertical continuous access mechanisms, the following example is given by using a subtask of 2 latches (e.g., tile 3) and a subtask of 3 latches (e.g., tile 4) to complete the total task with a task size of 112 latches. The total task can be executed by executing tile3 of 2batch five times and tile4 of 3batch six times by using 4 hardware devices. According to practical situations, the dispatching access can be performed according to table 4 or table 5, and the batch0-batch111 in table 4 and table 5 represent absolute coordinates in the total task amount and can be used as access address basis for obtaining batch task data and calculating.
TABLE 4 Table 4
The corresponding execution policy of table 4 is that the access addresses for multiple accesses by each hardware device are consecutive, i.e. longitudinally consecutive accesses.
TABLE 5
The corresponding execution policy of table 5 is that the access addresses between the individual hardware devices in one access are consecutive, i.e. laterally consecutive accesses.
S3: and distributing the instruction carrying the access address contained in each subtask in the subtask list to an AI chip.
After determining the access address of the instruction in each subtask in the subtask list from the batch dimension, the instruction with the access address contained in each subtask in the subtask list can be distributed to the AI chip, so that the hardware equipment in the AI chip executes the instruction with the access address contained in each subtask, and acquires the corresponding batch task data based on the access address to complete the corresponding subtask, namely acquires the corresponding batch task data from the access address to complete the corresponding subtask.
If each subtask in the subtask list is determined according to the up-adjusted total tasks, after S3, the method further includes: and the control AI chip stores the execution result of the subtask list into a temporary cache, and controls the AI chip to acquire the execution result of the original total task size from the temporary cache and store the execution result into a cache space designated for the original total task.
In this embodiment, the adjusted total task may be larger than the original total task due to the upward adjustment of the total task, for example, the total task of 31batch is adjusted to the total task of 32 batch. This would be an extra 1batch of computing tasks that the user does not know, so this extra amount of tasks needs to be handled. Since the user only allocates the execution buffer space for the 31batch total task when allocating the buffer space, but the task becomes 32batch when actually executing, the execution result of the AI chip needs to be stored in the temporary buffer, and then the execution result of the original total task size (e.g. 31 batch) is obtained from the temporary buffer and stored in the buffer space designated for the original total task (31 batch).
It can be appreciated that if the total task is not adjusted, the execution result of the AI chip is directly stored into the cache space designated for the original total task while the execution result of the AI chip is stored.
In order to better understand the above-mentioned method for processing a batch task, in one embodiment, a flow chart of the method is shown in fig. 2. It will be appreciated that the method shown in fig. 2 is only one of many examples provided for embodiments of the present application and, therefore, the application is not to be construed as being limited by the method shown in fig. 2.
Based on the same inventive concept, the embodiment of the application also provides a processor, which comprises a kernel and a transceiver.
The kernel is used for acquiring a subtask list corresponding to a total task to be executed, determining the base coordinate and the relative coordinate of each subtask in the subtask list from batch dimensions, and determining the access address of an instruction in the subtask according to the base coordinate and the relative coordinate; the subtask list comprises a plurality of subtasks, the plurality of subtasks are selected from the subtasks with different batch sizes in the execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes.
And the transceiver is used for distributing the instruction with the access address contained in each subtask in the subtask list to the AI chip so that the hardware equipment in the AI chip executes the instruction with the access address contained in each subtask and acquires corresponding batch processing task data based on the access address to complete the corresponding subtask.
Optionally, the kernel is configured to determine, according to a preset policy, a base coordinate and a relative coordinate of an instruction in each subtask in the subtask list from a batch dimension; and determining the access address of the instruction in each subtask in the subtask list according to the base coordinates and the relative coordinates of the instruction in the subtask.
The processor provided in the embodiments of the present application has the same implementation principle and technical effects as those of the foregoing method embodiments, and for a brief description, reference may be made to corresponding matters in the foregoing method embodiments where the processor embodiment is not mentioned.
Based on the same inventive concept, the application also provides a batch processing task processing device 100, as shown in fig. 3. The processing apparatus 100 for batch processing tasks includes: an acquisition module 110, a processing module 120, and a transmission module 130.
An obtaining module 110, configured to obtain a subtask list corresponding to a total task to be executed; the subtask list comprises a plurality of subtasks, the plurality of subtasks are selected from the subtasks with different batch sizes in the execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes.
Processing module 120 is configured to determine, from the batch dimension, an access address for an instruction in each subtask in the subtask list. Specifically, the processing module 120 is configured to determine, from a batch dimension, a base coordinate and a relative coordinate of each subtask in the subtask list, and determine, according to the base coordinate and the relative coordinate, an access address of an instruction in the subtask.
And the sending module 130 is configured to distribute an instruction with an access address included in each subtask in the subtask list to an AI chip, so that a hardware device in the AI chip executes the instruction with the access address included in each subtask, and acquire corresponding batch task data based on the access address to complete the corresponding subtask.
Optionally, the processing module 120 is configured to determine, according to a preset policy, a base coordinate and a relative coordinate of an instruction in each subtask in the subtask list from a batch dimension; and determining the access address of the instruction in each subtask in the subtask list according to the base coordinates and the relative coordinates of the instruction in the subtask.
The apparatus 100 for processing batch processing tasks according to the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing method embodiment, and for brevity, reference may be made to the corresponding content in the foregoing method embodiment where the apparatus embodiment portion is not mentioned.
As shown in fig. 4, fig. 4 shows a block diagram of an electronic device 200 according to an embodiment of the present application. The electronic device 200 includes: a transceiver 210, a memory 220, a communication bus 230, and a processor 240. The components of the transceiver 210, the memory 220, and the processor 240 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via a communication bus 230 or signal lines. Wherein the transceiver 210 is configured to transmit and receive data. The memory 220 is used to store a computer program, such as the processing device 100 storing the software functional modules shown in fig. 3, i.e. batch processing tasks. The processing device 100 for batch processing tasks includes at least one software function module that may be stored in the memory 220 in the form of software or Firmware (Firmware) or cured in an Operating System (OS) of the electronic device 200. A processor 240 for executing executable modules stored in the memory 220, such as software functional modules or computer programs included in the processing apparatus 100 for batch processing tasks. For example, the processor 240 is configured to obtain a subtask list corresponding to a total task to be executed; the subtask list comprises a plurality of subtasks, wherein the subtasks are selected from the subtasks with different batch sizes in the execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes; determining an access address of an instruction in each subtask in the subtask list from the batch dimension; and distributing the instruction with the access address contained in each subtask in the subtask list to the AI chip so that hardware equipment in the AI chip executes the instruction with the access address contained in each subtask, and acquiring corresponding batch task data based on the access address to complete the corresponding subtask.
The Memory 220 may be, but is not limited to, a random access Memory (RandomAccessMemory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
The processor 240 may be an integrated circuit chip with signal processing capabilities. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor 240 may be any conventional processor or the like.
The electronic device 200 includes, but is not limited to, a smart phone, a tablet, a computer, an industrial personal computer, a vehicle-mounted device, a server, an intelligent wearable device, an edge box, and the like.
The embodiment of the present application further provides a non-volatile computer readable storage medium (hereinafter referred to as a storage medium) storing a computer program, where the computer program, when executed by a computer such as the electronic device 200, performs the above-described processing method for batch processing tasks.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described embodiments are merely illustrative, and each block in a flowchart or block diagram may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a computer-readable storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or an electronic device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned computer-readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application.

Claims (10)

1. A method for processing a batch processing task, comprising:
acquiring a subtask list corresponding to a total task to be executed; the subtask list comprises a plurality of subtasks, wherein the subtasks are selected from the subtasks with different batch sizes in an execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes;
determining a base coordinate and a relative coordinate of each subtask in the subtask list from a batch dimension, and determining an access address of an instruction in the subtask according to the base coordinate and the relative coordinate;
distributing the instruction with the access address contained in each subtask in the subtask list to an AI chip so that hardware equipment in the AI chip executes the instruction with the access address contained in each subtask, and acquiring corresponding batch task data based on the access address to complete the corresponding subtask;
The method for determining the base coordinates and the relative coordinates of each subtask in the subtask list from the batch dimension, and determining the access address of the instruction in the subtask according to the base coordinates and the relative coordinates comprises the following steps:
determining the base coordinates and the relative coordinates of instructions in each subtask in the subtask list from batch dimensions according to a preset strategy, wherein the preset strategy comprises a calculation formula for determining the base coordinates and the relative coordinates of instructions in each subtask in the subtask list;
and determining the access address of the instruction in each subtask in the subtask list according to the base coordinates and the relative coordinates of the instruction in the subtask.
2. The method of claim 1, wherein the preset policy further comprises: the number of hardware devices executing the subtask list and the execution strategies, and the calculation formulas of the base coordinates and the relative coordinates corresponding to different execution strategies are different.
3. The method of claim 2, wherein if the number of hardware devices is one, the base coordinates and the relative coordinates are calculated as follows:
wherein, ai represents the batch size of the ith seed task in the subtask list, bi represents the number of times the ith seed task has been executed, and n represents the type number of the subtask with the access address to be determined in the subtask list.
4. The method according to claim 2, wherein if the number of hardware devices is m, where m is an integer greater than or equal to 2, the execution policy requires that access addresses of multiple accesses by each hardware device be consecutive; the calculation formula of the base coordinates and the relative coordinates is as follows:
wherein Ai represents the batch size of the ith seed task in the subtask list, bi represents the number of times the ith seed task has been executed, n represents the type number of the subtask of the access address to be determined in the subtask list, avgbatch is the batch size of all the subtasks executed by each hardware device, idx represents the number x of the hardware device currently ready to execute the subtask of the access address to be determined, and the value of x is [0, m-1].
5. The method according to claim 2, wherein if the number of hardware devices is m, where m is an integer greater than or equal to 2, the execution policy requires that access addresses between the respective hardware devices in one access be consecutive; the calculation formula of the base coordinates and the relative coordinates is as follows:
wherein Ai represents the batch size of the ith seed task in the subtask list, and Bi represents the number of times the ith seed task has been executed; n represents the kind number of the subtask of the access address to be determined in the subtask list, idx represents the number x of the hardware device currently ready to execute the subtask of the access address to be determined, and the value of x is [0, m-1].
6. The method according to any one of claims 1-5, wherein the total task to be performed is an original task or an original task after upward adjustment; if the original task is executed by a plurality of hardware devices and the original task cannot be equally divided to each hardware device, the size of the original task is adjusted upwards, so that the adjusted original task is equally divided to each hardware device.
7. A batch processing task processing apparatus, comprising:
the acquisition module is used for acquiring a subtask list corresponding to the total task to be executed; the subtask list comprises a plurality of subtasks, wherein the subtasks are selected from the subtasks with different batch sizes in an execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes;
the processing module is used for determining the base coordinate and the relative coordinate of each subtask in the subtask list from the batch dimension and determining the access address of the instruction in the subtask according to the base coordinate and the relative coordinate;
the sending module is used for distributing the instruction with the access address contained in each subtask in the subtask list to the AI chip so that hardware equipment in the AI chip executes the instruction with the access address contained in each subtask and acquires corresponding batch processing task data based on the access address to complete the corresponding subtask;
The processing module is used for determining the base coordinates and the relative coordinates of the instructions in each subtask in the subtask list from batch dimensions according to a preset strategy, wherein the preset strategy comprises a calculation formula for determining the base coordinates and the relative coordinates of the instructions in each subtask in the subtask list; and determining the access address of the instruction in each subtask in the subtask list according to the base coordinates and the relative coordinates of the instruction in the subtask.
8. A processor, comprising:
the kernel is used for acquiring a subtask list corresponding to a total task to be executed, determining the base coordinate and the relative coordinate of each subtask in the subtask list from batch dimensions, and determining the access address of an instruction in the subtask according to the base coordinate and the relative coordinate; the subtask list comprises a plurality of subtasks, wherein the subtasks are selected from the subtasks with different batch sizes in an execution file, the sum of the task amounts is consistent with the task amount of the total task to be executed, and the different subtasks correspond to different batch sizes;
the transceiver is used for distributing the instruction with the access address contained in each subtask in the subtask list to the AI chip so that hardware equipment in the AI chip executes the instruction with the access address contained in each subtask and acquires corresponding batch processing task data based on the access address to complete the corresponding subtask;
The kernel is used for determining the base coordinates and the relative coordinates of the instructions in each subtask in the subtask list from batch dimensions according to a preset strategy, and the preset strategy comprises a calculation formula for determining the base coordinates and the relative coordinates of the instructions in each subtask in the subtask list; and determining the access address of the instruction in each subtask in the subtask list according to the base coordinates and the relative coordinates of the instruction in the subtask.
9. An electronic device, comprising:
the device comprises a memory and a processor, wherein the processor is connected with the memory;
the memory is used for storing programs;
the processor is configured to invoke a program stored in the memory to perform the method of any of claims 1-6.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, performs the method according to any of claims 1-6.
CN202310665911.5A 2023-06-07 2023-06-07 Batch processing task processing method and device, electronic equipment and storage medium Active CN116431315B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310665911.5A CN116431315B (en) 2023-06-07 2023-06-07 Batch processing task processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310665911.5A CN116431315B (en) 2023-06-07 2023-06-07 Batch processing task processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116431315A CN116431315A (en) 2023-07-14
CN116431315B true CN116431315B (en) 2023-08-29

Family

ID=87089328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310665911.5A Active CN116431315B (en) 2023-06-07 2023-06-07 Batch processing task processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116431315B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076092B (en) * 2023-10-13 2024-01-19 成都登临科技有限公司 Multi-dimensional data task processing method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003196156A (en) * 2001-12-28 2003-07-11 Fujitsu Ltd Information processor and information processing method
CN101563674A (en) * 2006-12-21 2009-10-21 国际商业机器公司 A method and system to manage memory accesses from multithread programs on multiprocessor systems
CN107643943A (en) * 2016-07-20 2018-01-30 大唐移动通信设备有限公司 The management method and device of a kind of task stack
CN114721726A (en) * 2022-06-10 2022-07-08 成都登临科技有限公司 Method for obtaining instructions in parallel by multithread group, processor and electronic equipment
CN115379420A (en) * 2021-05-21 2022-11-22 华为技术有限公司 Communication method and communication device for executing perception task
CN115576699A (en) * 2022-11-25 2023-01-06 成都登临科技有限公司 Data processing method, data processing device, AI chip, electronic device and storage medium
CN116048742A (en) * 2022-05-30 2023-05-02 荣耀终端有限公司 Data processing method and electronic equipment
CN116126646A (en) * 2023-04-13 2023-05-16 紫金诚征信有限公司 Method and device for determining execution progress of batch data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416733B2 (en) * 2018-11-19 2022-08-16 Google Llc Multi-task recurrent neural networks

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003196156A (en) * 2001-12-28 2003-07-11 Fujitsu Ltd Information processor and information processing method
CN101563674A (en) * 2006-12-21 2009-10-21 国际商业机器公司 A method and system to manage memory accesses from multithread programs on multiprocessor systems
CN107643943A (en) * 2016-07-20 2018-01-30 大唐移动通信设备有限公司 The management method and device of a kind of task stack
CN115379420A (en) * 2021-05-21 2022-11-22 华为技术有限公司 Communication method and communication device for executing perception task
CN116048742A (en) * 2022-05-30 2023-05-02 荣耀终端有限公司 Data processing method and electronic equipment
CN114721726A (en) * 2022-06-10 2022-07-08 成都登临科技有限公司 Method for obtaining instructions in parallel by multithread group, processor and electronic equipment
CN115576699A (en) * 2022-11-25 2023-01-06 成都登临科技有限公司 Data processing method, data processing device, AI chip, electronic device and storage medium
CN116126646A (en) * 2023-04-13 2023-05-16 紫金诚征信有限公司 Method and device for determining execution progress of batch data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
设备部署方式对云制造中子任务调度的影响;李文翔 等;《计算机工程与设计》;第2857-2861页 *

Also Published As

Publication number Publication date
CN116431315A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN110262901B (en) Data processing method and data processing system
CN111488205B (en) Scheduling method and scheduling system for heterogeneous hardware architecture
CN116431315B (en) Batch processing task processing method and device, electronic equipment and storage medium
CN116382880B (en) Task execution method, device, processor, electronic equipment and storage medium
US9104491B2 (en) Batch scheduler management of speculative and non-speculative tasks based on conditions of tasks and compute resources
CN108021449B (en) Coroutine implementation method, terminal equipment and storage medium
US9471387B2 (en) Scheduling in job execution
WO2015099562A1 (en) Methods and apparatus for data-parallel execution of operations on segmented arrays
Abbasi et al. A preliminary study of incorporating GPUs in the Hadoop framework
CN116701001B (en) Target task allocation method and device, electronic equipment and storage medium
CN117271136A (en) Data processing method, device, equipment and storage medium
CN115775199A (en) Data processing method and device, electronic equipment and computer readable storage medium
US20020156611A1 (en) Performance simulation process, and multiprocessor application production process, and devices for implementing said processes
CN112433847B (en) OpenCL kernel submitting method and device
WO2022252091A1 (en) Model processing method and apparatus
CN112783651A (en) Load balancing scheduling method, medium and device for vGPU of cloud platform
WO2024055168A1 (en) Resource allocation method, processor, and computing platform
CN110543351A (en) Data processing method and computer device
US10630957B2 (en) Scalable distributed computation framework for data-intensive computer vision workloads
CN111858036B (en) Tensorflow system acceleration method, device and equipment based on FPGA equipment and storage medium
Zhao et al. ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource Management
Xie et al. Parallel Acceleration of ELAS on ARM
US20130091504A1 (en) Data flows and their interaction with control flows
CN117873712A (en) Scientific computing task processing method, device and computing system
CN114860433A (en) Method for performing pooling calculation operation on simulator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant