CN115878250A - Method for managing AI training task and related product - Google Patents

Method for managing AI training task and related product Download PDF

Info

Publication number
CN115878250A
CN115878250A CN202110893221.6A CN202110893221A CN115878250A CN 115878250 A CN115878250 A CN 115878250A CN 202110893221 A CN202110893221 A CN 202110893221A CN 115878250 A CN115878250 A CN 115878250A
Authority
CN
China
Prior art keywords
training
task
tasks
response
training task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110893221.6A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202110893221.6A priority Critical patent/CN115878250A/en
Publication of CN115878250A publication Critical patent/CN115878250A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present disclosure provides a method, and related product, for managing AI training tasks that may be implemented as a computing processing device that may be included in a combined processing device that may also include a universal interconnect interface and other processing devices. And the computing processing device interacts with other processing devices to jointly complete computing operation specified by a user. The combined processing device may further include a storage device connected to the computing processing device and the other processing device, respectively, for storing data of the computing processing device and the other processing device.

Description

Method for managing AI training task and related product
Technical Field
The present disclosure relates to the field of computers, and more particularly, to management of Artificial Intelligence (AI) training tasks.
Background
The AI model training usually requires strong expertise and complexity, and the algorithm engineer needs to use different algorithms and parameters many times to repeat the training and compare the training results, finally obtaining the training model with the best effect. This is a continuously repeated, multiple iterative process that includes multiple model training tasks. The algorithm engineer needs to compare together the training tasks that solve the same problem and select the optimal model, and the independent training tasks have limited value for the final model. At present, no calculation power cluster provides an automatic and scientific training experiment method for an algorithm engineer.
In the existing calculation force cluster, a scientific means is not provided to help an algorithm engineer manage training tasks, the algorithm engineer usually needs to repeatedly create training tasks with different algorithm-parameter combinations on a data set describing a problem, a batch of training tasks which should belong to a model training process are split into independent training tasks, and the relation existing among the training tasks is ignored. The algorithm engineer needs to repeat similar task training operations multiple times while creating the training model. In the stage of training result screening, similar training result processing and visualization operations need to be repeated for multiple times. In the training phase, an algorithm engineer cannot visually compare the training effect of each task in the training process.
The existing privatized cluster lacks a priority concept, high-priority training tasks cannot be guaranteed to be scheduled or started preferentially, and an algorithm engineer cannot know when a group of training tasks to solve the same problem is scheduled.
Disclosure of Invention
One object of the present disclosure is to solve the drawback of the prior art that AI training tasks are not well managed.
According to a first aspect of the present disclosure, there is provided a method for managing an AI training task, comprising: receiving a request to create a batch of training tasks, the request including at least one algorithm-parameter combination as a variable and a dataset identification for the algorithm-parameter combination; in response to receiving the request, creating a memory area for each algorithm-parameter combination to facilitate storing training results; and creating an AI training task for each algorithm-parameter combination in response to the creation of the memory area.
According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.
According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.
At least one beneficial effect of the technical scheme of the disclosure lies in that batch management can be performed on a plurality of AI training tasks.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 illustrates a method for managing AI training tasks in accordance with one embodiment of the disclosure;
FIG. 2 shows a schematic diagram of a batch training task constructed in accordance with one embodiment of the present disclosure;
FIG. 3 illustrates a flowchart of a method of cloning a training task to facilitate modification of an AI training task, according to one embodiment of the present disclosure;
FIG. 4 illustrates a combined treatment apparatus; and
fig. 5 provides an exemplary board card.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
The present disclosure may be based on kubernets distributed clusters. Kubernets is an open source for managing containerized applications on multiple hosts in a cloud platform, and aims to make deploying containerized applications simple and efficient and to provide a mechanism for application deployment, planning, updating, and maintenance.
Fig. 1 illustrates a method for managing AI training tasks according to one embodiment of the present disclosure, including: receiving a request to create a batch of training tasks, the request including at least one algorithm-parameter combination as a variable and a dataset identification for the algorithm-parameter combination, in operation S110; in operation S120, in response to receiving the request, creating a storage area for each algorithm-parameter combination so as to store the training result; and creating an AI training task for each algorithm-parameter combination in response to the creation of the memory area, at operation S130.
According to an embodiment of the present disclosure, a plurality of independent training tasks may be uniformly managed in one training process to become batch training tasks, in other words, each batch training task includes a plurality of training tasks, and the training tasks are combined together through a specific relationship.
Each training task may include a combination of "algorithm-parameters" as variables, while the invariants for the training task are the operating environment of the system, resource configuration, and so on. These algorithm-parameter combinations may share or point to the same data set and be trained through the same data set.
More specifically, data sets for training may be stored in the server, and the data sets may be various data sets classified according to tasks. For example, for flower identification, the data set may be pictures of various angles, various time periods of various flowers; for vehicle identification, the data set may be various colors, various vehicle types, various angles, pictures of various sizes of vehicles, and the like.
The data set may also be continuously updated, thereby providing richer data for training. The data set identification may be the name, memory address, etc. of the data set that a task needs to employ in order for the task to be able to accurately use the specified data set.
When the server receives the request, a memory area may be created for each algorithm-parameter combination to facilitate storing the training results. The storage area described herein may be a Persistent Volume (PV). In the present disclosure, a data set is stored in a storage called a "volume", which may exist on a local physical machine (computer) or in a remote network storage. When the "volume" is used, the "volume" can be mounted to a predetermined position based on the absolute path of the "volume". In the kubernets cluster, each path of each volume has a name corresponding to the path, and when the volume is used, the volume can be directly mounted in an environment for running an AI training task through the name.
Further, the server may create a respective task for each algorithm-parameter set, either upon receiving a request or in response to creation of a memory area. A plurality of such task combinations is called a training task set.
It should be understood that the creation of the task and the creation of the storage area may be performed in parallel, or the corresponding task may be created after the storage area is created.
FIG. 2 shows a schematic diagram of a batch training task constructed according to one embodiment of the present disclosure.
As shown in fig. 2, for example, if a user needs to create a task set a, the task set may include a plurality of tasks, such as a task A1, a task A2, and a task A3, where the task A1 includes an algorithm 1 and a parameter set 1, the task A2 includes an algorithm 1 and a parameter set 2, and the task A3 includes an algorithm 2 and a parameter set 3. The three tasks A1-A3 share the same data set.
According to one embodiment of the present disclosure, the batch training tasks are a set of AI training tasks for solving the same technical problem; and/or a set of AI training tasks that share the same data set.
The same technical problem as described herein may be, for example, flower recognition, vehicle recognition, automatic driving, automatic translation, etc., and the same data set is essentially set for the same or similar technical problem.
According to an embodiment of the present disclosure, the method of the present disclosure further comprises: and triggering a task monitor in response to creating the AI training task, wherein the task monitor is used for monitoring the running state of the AI training task.
In this embodiment, the task monitor may be triggered after the AI training task is created. The task monitor may be one or more visual interfaces. The visualization interface may display the results generated when each task is performed individually or may display a plurality of results generated when a plurality of tasks are performed in a batch. Displaying multiple results in batches facilitates a user's timely comparison of results produced by multiple tasks.
According to an embodiment of the present disclosure, further comprising: in response to a request to create the batch of training tasks, a priority is assigned to each AI training task.
In the existing scheme, there is no priority difference between tasks, which may result in some tasks being unable to be allocated with computing resources for a long time, so that the tasks that are unable to obtain computing resources cannot be executed for a long time.
In actual operation, there may be multiple businesses in a company at the same time, and there may be multiple tasks under each business, and each task may need to use a power cluster to solve the actual problem. Even if the tasks belong to the same service, the corresponding sequence can be discharged according to the degree of urgency of the tasks, and AI training tasks corresponding to the tasks need different priorities. The training task with high priority can acquire the needed computing resources more timely. When the resources are sufficient, the high-priority training tasks can be preferentially allocated to the computing resources; when the resources are insufficient, the high-priority training task can obtain computing resources at the first time when the resources are released; during distributed training, the training task expects that each computing node can simultaneously meet the requirements of the high-priority training task on computational power resources.
The granularity of the priority can be various, for example, the priority can be divided into three levels, namely, high, medium and low levels, or can be divided into ten levels by the numbers 1-10.
According to an embodiment of the present disclosure, further comprising: in response to creating the AI training task, placing the created AI training task in a wait queue; and allocating resources for the AI training task according to the priority of the AI training task.
After the server creates AI training tasks upon request, these created AI training tasks may be placed in a wait queue. Waiting for the AI training tasks in the queue to wait for the end of the AI training tasks that are still executing or for the release of sufficient computational resources. It should be understood that AI training tasks with different priorities are in different queues, while AI training tasks in the same queue are of the same priority, so that different AI training tasks in the same queue are executed sequentially, i.e. for AI training tasks with the same priority, a first-in first-out strategy is adopted. Thus, the algorithm engineer can have a psychological expectation of when the training task can be performed based on the priority and creation time of the current AI training task.
For example, when the server does not have enough computational resources to support the AI training tasks in the queue, then the training tasks in the queue remain in a wait state; when the computing resources are released, resources can be preferentially allocated to the AI training task with the high priority, and when the AI training task with the high priority is executed, the AI training task with the lower priority is executed next. Thus, in actual running operation, the queue of each AI training task will be executed from high to low in priority.
According to one embodiment of the present disclosure, all AI training tasks of the same batch of training tasks may be set to have the same priority, which makes it possible to set priorities for a plurality of AI training tasks at once without setting priorities for all AI training tasks one by one. In addition, the situation that key training tasks in the same batch of training tasks cannot be started for a long time is avoided.
Further, when the priority of one AI training task is adjusted, the priorities of other non-running AI training tasks in the same batch of training tasks are adjusted identically.
In this embodiment, the priority of each AI training task may be adjusted at any time, and when the priority of one AI training task is adjusted, the priorities of other AI training tasks in the same batch of training tasks as the AI training task are adjusted accordingly. Therefore, when the priority of the AI training tasks in a certain batch of training tasks is adjusted, the queue where the AI training tasks are located also changes, that is, the AI training tasks with the adjusted priority are added to the waiting queue with the corresponding priority and are removed from the waiting queue with the current priority.
It should also be understood that, according to the embodiments of the present disclosure, the priority of the AI training task in operation may or may not be adjusted, in other words, the AI training task in operation may not be affected by the adjustment of the priority thereof, but always keeps executing until the AI training task is completely executed.
From the perspective of resource occupation, the resources occupied by the AI training task in operation are maintained until the AI training task in operation is finished. According to the embodiment, the interruption of the running AI training task is beneficial to maintaining the continuity of the AI training task, and the resource waste caused by the rerun of a certain AI training task after the interruption is avoided.
According to one embodiment of the present disclosure, wherein a copy of an AI training task is generated in response to receiving a clone request for the AI training task.
Cloning refers to replicating one or more AI training tasks to produce the same AI training task as the cloned AI training task. For example, when the user needs to perform a training task again to ensure that the training result is stable and can be reproduced, the task does not need to be re-established, but the established AI training task can be cloned once to rerun the AI training task. According to one embodiment of the present disclosure, a single AI training task may be cloned, as well as batch cloning of batch training tasks.
The generated clone AI training task can be directly stored in a server without being displayed to a user, and can also be sent to the user so as to be convenient for the user to view.
Further, according to one embodiment of the present disclosure, the AI training task may also be cloned to facilitate modifications to the original AI training task.
FIG. 3 illustrates a flow diagram of a method of cloning a training task to facilitate modification of an AI training task, according to one embodiment of the disclosure.
As shown in fig. 3, the method of the present disclosure further comprises: in operation S310, in response to receiving a clone request for a batch of training tasks, sending copies of the batch of training tasks for modification; receiving a new request to create a modified batch of training tasks, the new request including at least one modified algorithm-parameter combination as a new variable and a dataset identification for the modified algorithm-parameter combination, in operation S320; in operation S330, in response to receiving the new request, creating a new memory area for each modified algorithm-parameter combination so as to store a new training result; and in operation S340, in response to the creation of the new memory area, creating a new AI training task for each modified algorithm-parameter combination.
After the server receives the clone request, the generated AI training task may be returned to the user for the user to view the cloned AI training task as needed, as shown in operation S310.
Next, the user may send a request to modify the AI training task and send variables (e.g., algorithms, parameters, etc.) involved in the modification to the server. In addition, a new dataset identification for the cloned AI training task may also be sent to the server. It should be understood that the data set identification may be the same as or different from the original AI training task by default, e.g., the cloned AI training task may use the same name, stored path, etc. as the original AI training task by default, but may also use a different name and stored path according to the user's request.
Next, after the new AI training task is created, a new memory area may be allocated for the new algorithm-parameters, or the memory area of the original AI training task may be employed by default. The storage area may also be a persistent volume as described above.
When all the preparation work is ready, the cloned AI training task can be created according to the request, and new algorithm-parameters and the like are adopted as variables of the cloned AI training task.
In practice, an algorithm engineer may find that some combinations may not obtain expected results in the AI training process, so that such AI tasks should not continuously occupy computing resources, and should be stopped or deleted in time to release valuable computing resources. In the present disclosure, each AI training task in the same batch of training tasks may be stopped, deleted individually without affecting other AI tasks in the same batch.
Through the cloning mode, a certain amount of modification can be carried out on certain AI training tasks without reestablishing the AI training tasks, so that the establishment time of the training tasks is saved, and the working efficiency of a user is improved.
The result of the AI training can also be visually presented. According to one embodiment of the disclosure, a visualization component is created in response to a received visualization request, and the contents of a storage area of at least a portion of AI training tasks in the batch of training tasks are mounted into the visualization component.
In embodiments of the present disclosure, the storage space for the batch training task may be essentially a "volume". When a batch of training tasks are created, the name of a "volume" can be specified, and a directory corresponding to subtasks can be created in the "volume", that is, the same batch of AI training tasks can use a "volume" together, but the true address stored by the AI training tasks is a subfolder created under the "volume", and the subfolders are independent from each other and do not interfere with each other. By mounting these volumes into the corresponding visualization components, the content in the volumes can be displayed in a visualized manner.
The training intermediate files and the training result files of all the training tasks in the same batch can be integrated, the training effects of all the training tasks in the same batch are displayed in a visual chart by means of an AI (artificial intelligence) training visual tool, and a user is helped to visually compare the training effects of all the training tasks.
Further, in accordance with an update occurring in response to the content of the storage area, the updated content is presented in the visualization component. In this embodiment, if the content in the volume changes, the content presented by the visualization component changes as the content in the volume changes, so that the updated results can be presented to the user in real time.
In actual operation, at the user end, a template can be provided for the user to select or add the required content. The user can make an experimental scheme for screening the optimal model. In addition, due to the cloning and multiplexing, the repeated work of a user in the model screening process is eliminated, and the contrast experiment can be flexibly expanded.
The AI training task is managed in a scientific experiment mode, the original independent training task is endowed with an incidence relation, all possible repeated labor of an algorithm engineer in the process of selecting an optimal model is eliminated, a user is helped to timely and effectively obtain a training result, and the training effects of different training tasks in the experiment are visually compared.
The AI training tasks in the same batch of tasks have the same scheduling priority, so that the training tasks can be started orderly, the training with high priority can be carried out in time, and the training tasks in the same batch can be started in sequence within a short time range.
According to an embodiment of the present disclosure, there is also provided an electronic device including: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.
The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
The technical scheme disclosed by the invention can be applied to the field of artificial intelligence and is realized or realized in an artificial intelligence chip. The chip may exist alone or may be included in a computing device.
Fig. 4 shows a combined processing device 400 that includes the computing device 402, the universal interconnect interface 404, and other processing devices 406 described above. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user. Fig. 4 is a schematic view of a combined treatment apparatus.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.
A universal interconnect interface for transferring data and control instructions between a computing device (including, for example, a machine learning computing device) and other processing devices. The computing device acquires required input data from other processing devices and writes the input data into a storage device on the computing device chip; control instructions can be obtained from other processing devices and written into a control cache on a computing device slice; the data in the memory module of the computing device can also be read and transmitted to other processing devices.
Optionally, the architecture may further comprise a storage device 408, which is connected to said computing device and said other processing device, respectively. The storage device is used for storing data in the computing device and the other processing devices, and is particularly suitable for storing all data which cannot be stored in the internal storage of the computing device or the other processing devices.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
Referring to fig. 5, an exemplary board card is provided that may include other kits in addition to the chip 502, including but not limited to: a memory device 504, an interface apparatus 506, and a control device 508.
The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of groups of memory cells 510. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC checking. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The interface device is electrically connected with a chip in the chip packaging structure. The interface means is used to enable data transfer between the chip and an external device 512, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.
In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card.
Electronic devices or apparatuses include data processing apparatuses, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It should be noted that for simplicity of description, the above-described method embodiments are shown as a series of combinations of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. With this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims (17)

1. A method for managing AI training tasks, comprising:
receiving a request to create a batch of training tasks, the request including at least one algorithm-parameter combination as a variable and a dataset identification for the algorithm-parameter combination;
in response to receiving the request, creating a memory area for each algorithm-parameter combination to facilitate storing training results; and
an AI training task is created for each algorithm-parameter combination in response to the creation of the memory area.
2. The method of claim 1, wherein the batch of training tasks is a set of AI training tasks to solve the same technical problem; and/or a set of AI training tasks that share the same data set.
3. The method of claim 1 or 2, further comprising: and triggering a task monitor in response to creating the AI training task, wherein the task monitor is used for monitoring the running state of the AI training task.
4. The method of any of claims 1-3, further comprising:
in response to a request to create the batch of training tasks, a priority is assigned to each AI training task.
5. The method as recited in claim 4, further comprising:
in response to creating the AI training task, placing the created AI training task in a wait queue; and
and allocating resources for the AI training task according to the priority of the AI training task.
6. The method of claim 4 or 5, wherein all AI training tasks of the same batch of training tasks have the same priority.
7. The method according to any of claims 4-6, wherein resources are allocated preferentially for AI training tasks with high priority.
8. The method of any of claims 4-7, wherein when the priority of one AI training task is adjusted, the priorities of other non-running AI training tasks in the same batch of training tasks are adjusted identically.
9. The method according to any of claims 4-8, wherein the resources occupied by a running AI training task remain until the end of the running AI training task.
10. The method according to any one of claims 4-9, wherein a first-in-first-out strategy is employed for AI training tasks with the same priority.
11. The method of any of claims 1-10, wherein the copy of the AI training task is generated in response to receiving a clone request for the AI training task.
12. The method of any of claims 1-10, further comprising:
in response to receiving a clone request for an AI training task, sending a copy of the AI training task for modification;
receiving a new request to create a modified AI training task, the new request including at least one modified algorithm-parameter combination as a new variable and a dataset identification for the modified algorithm-parameter combination;
in response to receiving the new request, creating a new memory area for each modified algorithm-parameter combination to facilitate storing new training results; and
in response to the creation of the new memory area, a new AI training task is created for each modified algorithm-parameter combination.
13. The method of any of claims 1-12, wherein, in response to the received visualization request, a visualization component is created and contents of a memory area of at least a portion of the AI training tasks in the batch of training tasks are mounted into the visualization component.
14. The method of claim 13, wherein the updated content is presented in the visualization component in response to an update to the content of the storage area.
15. The method of any one of claims 1-14, wherein the method is performed in a distributed cluster of kubernets.
16. An electronic device, comprising:
one or more processors; and
memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-15.
17. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any one of claims 1-15.
CN202110893221.6A 2021-08-04 2021-08-04 Method for managing AI training task and related product Pending CN115878250A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110893221.6A CN115878250A (en) 2021-08-04 2021-08-04 Method for managing AI training task and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110893221.6A CN115878250A (en) 2021-08-04 2021-08-04 Method for managing AI training task and related product

Publications (1)

Publication Number Publication Date
CN115878250A true CN115878250A (en) 2023-03-31

Family

ID=85762128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110893221.6A Pending CN115878250A (en) 2021-08-04 2021-08-04 Method for managing AI training task and related product

Country Status (1)

Country Link
CN (1) CN115878250A (en)

Similar Documents

Publication Publication Date Title
KR102281739B1 (en) Resource scheduling method, scheduling server, cloud computing system, and storage medium
CN106933669B (en) Apparatus and method for data processing
CN113641457B (en) Container creation method, device, apparatus, medium, and program product
CN109558234A (en) A kind of timed task dispatching method and device
CN103197968A (en) Thread pool processing method and system capable of fusing synchronous and asynchronous features
CN113110938A (en) Resource allocation method and device, computer equipment and storage medium
CN114416352A (en) Computing resource allocation method and device, electronic equipment and storage medium
CN113204353B (en) Big data platform assembly deployment method and device
CN112433823A (en) Apparatus and method for dynamically virtualizing physical card
CN115686805A (en) GPU resource sharing method and device, and GPU resource sharing scheduling method and device
CN111767995B (en) Operation method, device and related product
CN109033184B (en) Data processing method and device
CN113961353A (en) Task processing method and distributed system for AI task
CN112711522B (en) Cloud testing method and system based on docker and electronic equipment
US20050050480A1 (en) Notifying status of execution of jobs used to characterize cells in an integrated circuit
CN115484187B (en) Method, equipment and storage medium for testing container network interface in container environment
CN115878250A (en) Method for managing AI training task and related product
CN116820758A (en) Job processing method, apparatus, computer device, storage medium, and program product
CN115809126A (en) Job scheduling method and device in mixed deployment scene and electronic equipment
CN110851245A (en) Distributed asynchronous task scheduling method and electronic equipment
CN115357433A (en) Database backup method, device, equipment and storage medium under container environment
US20150189013A1 (en) Adaptive and prioritized replication scheduling in storage clusters
CN109614242A (en) A kind of computing capability sharing method, device, equipment and medium
US20230153157A1 (en) Inter-node communication method and device based on multiple processing nodes
CN111984510B (en) Performance test method and device for dispatching system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination