CN115878250A - Method for managing AI training task and related product - Google Patents
Method for managing AI training task and related product Download PDFInfo
- Publication number
- CN115878250A CN115878250A CN202110893221.6A CN202110893221A CN115878250A CN 115878250 A CN115878250 A CN 115878250A CN 202110893221 A CN202110893221 A CN 202110893221A CN 115878250 A CN115878250 A CN 115878250A
- Authority
- CN
- China
- Prior art keywords
- training
- task
- tasks
- response
- training task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 211
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000015654 memory Effects 0.000 claims description 29
- 230000004044 response Effects 0.000 claims description 29
- 238000012800 visualization Methods 0.000 claims description 12
- 238000012986 modification Methods 0.000 claims description 7
- 230000004048 modification Effects 0.000 claims description 7
- 238000012544 monitoring process Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 38
- 238000013473 artificial intelligence Methods 0.000 description 97
- 238000010367 cloning Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 101100498818 Arabidopsis thaliana DDR4 gene Proteins 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004806 packaging method and process Methods 0.000 description 2
- 230000002085 persistent effect Effects 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000003362 replicative effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present disclosure provides a method, and related product, for managing AI training tasks that may be implemented as a computing processing device that may be included in a combined processing device that may also include a universal interconnect interface and other processing devices. And the computing processing device interacts with other processing devices to jointly complete computing operation specified by a user. The combined processing device may further include a storage device connected to the computing processing device and the other processing device, respectively, for storing data of the computing processing device and the other processing device.
Description
Technical Field
The present disclosure relates to the field of computers, and more particularly, to management of Artificial Intelligence (AI) training tasks.
Background
The AI model training usually requires strong expertise and complexity, and the algorithm engineer needs to use different algorithms and parameters many times to repeat the training and compare the training results, finally obtaining the training model with the best effect. This is a continuously repeated, multiple iterative process that includes multiple model training tasks. The algorithm engineer needs to compare together the training tasks that solve the same problem and select the optimal model, and the independent training tasks have limited value for the final model. At present, no calculation power cluster provides an automatic and scientific training experiment method for an algorithm engineer.
In the existing calculation force cluster, a scientific means is not provided to help an algorithm engineer manage training tasks, the algorithm engineer usually needs to repeatedly create training tasks with different algorithm-parameter combinations on a data set describing a problem, a batch of training tasks which should belong to a model training process are split into independent training tasks, and the relation existing among the training tasks is ignored. The algorithm engineer needs to repeat similar task training operations multiple times while creating the training model. In the stage of training result screening, similar training result processing and visualization operations need to be repeated for multiple times. In the training phase, an algorithm engineer cannot visually compare the training effect of each task in the training process.
The existing privatized cluster lacks a priority concept, high-priority training tasks cannot be guaranteed to be scheduled or started preferentially, and an algorithm engineer cannot know when a group of training tasks to solve the same problem is scheduled.
Disclosure of Invention
One object of the present disclosure is to solve the drawback of the prior art that AI training tasks are not well managed.
According to a first aspect of the present disclosure, there is provided a method for managing an AI training task, comprising: receiving a request to create a batch of training tasks, the request including at least one algorithm-parameter combination as a variable and a dataset identification for the algorithm-parameter combination; in response to receiving the request, creating a memory area for each algorithm-parameter combination to facilitate storing training results; and creating an AI training task for each algorithm-parameter combination in response to the creation of the memory area.
According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.
According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.
At least one beneficial effect of the technical scheme of the disclosure lies in that batch management can be performed on a plurality of AI training tasks.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 illustrates a method for managing AI training tasks in accordance with one embodiment of the disclosure;
FIG. 2 shows a schematic diagram of a batch training task constructed in accordance with one embodiment of the present disclosure;
FIG. 3 illustrates a flowchart of a method of cloning a training task to facilitate modification of an AI training task, according to one embodiment of the present disclosure;
FIG. 4 illustrates a combined treatment apparatus; and
fig. 5 provides an exemplary board card.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
The present disclosure may be based on kubernets distributed clusters. Kubernets is an open source for managing containerized applications on multiple hosts in a cloud platform, and aims to make deploying containerized applications simple and efficient and to provide a mechanism for application deployment, planning, updating, and maintenance.
Fig. 1 illustrates a method for managing AI training tasks according to one embodiment of the present disclosure, including: receiving a request to create a batch of training tasks, the request including at least one algorithm-parameter combination as a variable and a dataset identification for the algorithm-parameter combination, in operation S110; in operation S120, in response to receiving the request, creating a storage area for each algorithm-parameter combination so as to store the training result; and creating an AI training task for each algorithm-parameter combination in response to the creation of the memory area, at operation S130.
According to an embodiment of the present disclosure, a plurality of independent training tasks may be uniformly managed in one training process to become batch training tasks, in other words, each batch training task includes a plurality of training tasks, and the training tasks are combined together through a specific relationship.
Each training task may include a combination of "algorithm-parameters" as variables, while the invariants for the training task are the operating environment of the system, resource configuration, and so on. These algorithm-parameter combinations may share or point to the same data set and be trained through the same data set.
More specifically, data sets for training may be stored in the server, and the data sets may be various data sets classified according to tasks. For example, for flower identification, the data set may be pictures of various angles, various time periods of various flowers; for vehicle identification, the data set may be various colors, various vehicle types, various angles, pictures of various sizes of vehicles, and the like.
The data set may also be continuously updated, thereby providing richer data for training. The data set identification may be the name, memory address, etc. of the data set that a task needs to employ in order for the task to be able to accurately use the specified data set.
When the server receives the request, a memory area may be created for each algorithm-parameter combination to facilitate storing the training results. The storage area described herein may be a Persistent Volume (PV). In the present disclosure, a data set is stored in a storage called a "volume", which may exist on a local physical machine (computer) or in a remote network storage. When the "volume" is used, the "volume" can be mounted to a predetermined position based on the absolute path of the "volume". In the kubernets cluster, each path of each volume has a name corresponding to the path, and when the volume is used, the volume can be directly mounted in an environment for running an AI training task through the name.
Further, the server may create a respective task for each algorithm-parameter set, either upon receiving a request or in response to creation of a memory area. A plurality of such task combinations is called a training task set.
It should be understood that the creation of the task and the creation of the storage area may be performed in parallel, or the corresponding task may be created after the storage area is created.
FIG. 2 shows a schematic diagram of a batch training task constructed according to one embodiment of the present disclosure.
As shown in fig. 2, for example, if a user needs to create a task set a, the task set may include a plurality of tasks, such as a task A1, a task A2, and a task A3, where the task A1 includes an algorithm 1 and a parameter set 1, the task A2 includes an algorithm 1 and a parameter set 2, and the task A3 includes an algorithm 2 and a parameter set 3. The three tasks A1-A3 share the same data set.
According to one embodiment of the present disclosure, the batch training tasks are a set of AI training tasks for solving the same technical problem; and/or a set of AI training tasks that share the same data set.
The same technical problem as described herein may be, for example, flower recognition, vehicle recognition, automatic driving, automatic translation, etc., and the same data set is essentially set for the same or similar technical problem.
According to an embodiment of the present disclosure, the method of the present disclosure further comprises: and triggering a task monitor in response to creating the AI training task, wherein the task monitor is used for monitoring the running state of the AI training task.
In this embodiment, the task monitor may be triggered after the AI training task is created. The task monitor may be one or more visual interfaces. The visualization interface may display the results generated when each task is performed individually or may display a plurality of results generated when a plurality of tasks are performed in a batch. Displaying multiple results in batches facilitates a user's timely comparison of results produced by multiple tasks.
According to an embodiment of the present disclosure, further comprising: in response to a request to create the batch of training tasks, a priority is assigned to each AI training task.
In the existing scheme, there is no priority difference between tasks, which may result in some tasks being unable to be allocated with computing resources for a long time, so that the tasks that are unable to obtain computing resources cannot be executed for a long time.
In actual operation, there may be multiple businesses in a company at the same time, and there may be multiple tasks under each business, and each task may need to use a power cluster to solve the actual problem. Even if the tasks belong to the same service, the corresponding sequence can be discharged according to the degree of urgency of the tasks, and AI training tasks corresponding to the tasks need different priorities. The training task with high priority can acquire the needed computing resources more timely. When the resources are sufficient, the high-priority training tasks can be preferentially allocated to the computing resources; when the resources are insufficient, the high-priority training task can obtain computing resources at the first time when the resources are released; during distributed training, the training task expects that each computing node can simultaneously meet the requirements of the high-priority training task on computational power resources.
The granularity of the priority can be various, for example, the priority can be divided into three levels, namely, high, medium and low levels, or can be divided into ten levels by the numbers 1-10.
According to an embodiment of the present disclosure, further comprising: in response to creating the AI training task, placing the created AI training task in a wait queue; and allocating resources for the AI training task according to the priority of the AI training task.
After the server creates AI training tasks upon request, these created AI training tasks may be placed in a wait queue. Waiting for the AI training tasks in the queue to wait for the end of the AI training tasks that are still executing or for the release of sufficient computational resources. It should be understood that AI training tasks with different priorities are in different queues, while AI training tasks in the same queue are of the same priority, so that different AI training tasks in the same queue are executed sequentially, i.e. for AI training tasks with the same priority, a first-in first-out strategy is adopted. Thus, the algorithm engineer can have a psychological expectation of when the training task can be performed based on the priority and creation time of the current AI training task.
For example, when the server does not have enough computational resources to support the AI training tasks in the queue, then the training tasks in the queue remain in a wait state; when the computing resources are released, resources can be preferentially allocated to the AI training task with the high priority, and when the AI training task with the high priority is executed, the AI training task with the lower priority is executed next. Thus, in actual running operation, the queue of each AI training task will be executed from high to low in priority.
According to one embodiment of the present disclosure, all AI training tasks of the same batch of training tasks may be set to have the same priority, which makes it possible to set priorities for a plurality of AI training tasks at once without setting priorities for all AI training tasks one by one. In addition, the situation that key training tasks in the same batch of training tasks cannot be started for a long time is avoided.
Further, when the priority of one AI training task is adjusted, the priorities of other non-running AI training tasks in the same batch of training tasks are adjusted identically.
In this embodiment, the priority of each AI training task may be adjusted at any time, and when the priority of one AI training task is adjusted, the priorities of other AI training tasks in the same batch of training tasks as the AI training task are adjusted accordingly. Therefore, when the priority of the AI training tasks in a certain batch of training tasks is adjusted, the queue where the AI training tasks are located also changes, that is, the AI training tasks with the adjusted priority are added to the waiting queue with the corresponding priority and are removed from the waiting queue with the current priority.
It should also be understood that, according to the embodiments of the present disclosure, the priority of the AI training task in operation may or may not be adjusted, in other words, the AI training task in operation may not be affected by the adjustment of the priority thereof, but always keeps executing until the AI training task is completely executed.
From the perspective of resource occupation, the resources occupied by the AI training task in operation are maintained until the AI training task in operation is finished. According to the embodiment, the interruption of the running AI training task is beneficial to maintaining the continuity of the AI training task, and the resource waste caused by the rerun of a certain AI training task after the interruption is avoided.
According to one embodiment of the present disclosure, wherein a copy of an AI training task is generated in response to receiving a clone request for the AI training task.
Cloning refers to replicating one or more AI training tasks to produce the same AI training task as the cloned AI training task. For example, when the user needs to perform a training task again to ensure that the training result is stable and can be reproduced, the task does not need to be re-established, but the established AI training task can be cloned once to rerun the AI training task. According to one embodiment of the present disclosure, a single AI training task may be cloned, as well as batch cloning of batch training tasks.
The generated clone AI training task can be directly stored in a server without being displayed to a user, and can also be sent to the user so as to be convenient for the user to view.
Further, according to one embodiment of the present disclosure, the AI training task may also be cloned to facilitate modifications to the original AI training task.
FIG. 3 illustrates a flow diagram of a method of cloning a training task to facilitate modification of an AI training task, according to one embodiment of the disclosure.
As shown in fig. 3, the method of the present disclosure further comprises: in operation S310, in response to receiving a clone request for a batch of training tasks, sending copies of the batch of training tasks for modification; receiving a new request to create a modified batch of training tasks, the new request including at least one modified algorithm-parameter combination as a new variable and a dataset identification for the modified algorithm-parameter combination, in operation S320; in operation S330, in response to receiving the new request, creating a new memory area for each modified algorithm-parameter combination so as to store a new training result; and in operation S340, in response to the creation of the new memory area, creating a new AI training task for each modified algorithm-parameter combination.
After the server receives the clone request, the generated AI training task may be returned to the user for the user to view the cloned AI training task as needed, as shown in operation S310.
Next, the user may send a request to modify the AI training task and send variables (e.g., algorithms, parameters, etc.) involved in the modification to the server. In addition, a new dataset identification for the cloned AI training task may also be sent to the server. It should be understood that the data set identification may be the same as or different from the original AI training task by default, e.g., the cloned AI training task may use the same name, stored path, etc. as the original AI training task by default, but may also use a different name and stored path according to the user's request.
Next, after the new AI training task is created, a new memory area may be allocated for the new algorithm-parameters, or the memory area of the original AI training task may be employed by default. The storage area may also be a persistent volume as described above.
When all the preparation work is ready, the cloned AI training task can be created according to the request, and new algorithm-parameters and the like are adopted as variables of the cloned AI training task.
In practice, an algorithm engineer may find that some combinations may not obtain expected results in the AI training process, so that such AI tasks should not continuously occupy computing resources, and should be stopped or deleted in time to release valuable computing resources. In the present disclosure, each AI training task in the same batch of training tasks may be stopped, deleted individually without affecting other AI tasks in the same batch.
Through the cloning mode, a certain amount of modification can be carried out on certain AI training tasks without reestablishing the AI training tasks, so that the establishment time of the training tasks is saved, and the working efficiency of a user is improved.
The result of the AI training can also be visually presented. According to one embodiment of the disclosure, a visualization component is created in response to a received visualization request, and the contents of a storage area of at least a portion of AI training tasks in the batch of training tasks are mounted into the visualization component.
In embodiments of the present disclosure, the storage space for the batch training task may be essentially a "volume". When a batch of training tasks are created, the name of a "volume" can be specified, and a directory corresponding to subtasks can be created in the "volume", that is, the same batch of AI training tasks can use a "volume" together, but the true address stored by the AI training tasks is a subfolder created under the "volume", and the subfolders are independent from each other and do not interfere with each other. By mounting these volumes into the corresponding visualization components, the content in the volumes can be displayed in a visualized manner.
The training intermediate files and the training result files of all the training tasks in the same batch can be integrated, the training effects of all the training tasks in the same batch are displayed in a visual chart by means of an AI (artificial intelligence) training visual tool, and a user is helped to visually compare the training effects of all the training tasks.
Further, in accordance with an update occurring in response to the content of the storage area, the updated content is presented in the visualization component. In this embodiment, if the content in the volume changes, the content presented by the visualization component changes as the content in the volume changes, so that the updated results can be presented to the user in real time.
In actual operation, at the user end, a template can be provided for the user to select or add the required content. The user can make an experimental scheme for screening the optimal model. In addition, due to the cloning and multiplexing, the repeated work of a user in the model screening process is eliminated, and the contrast experiment can be flexibly expanded.
The AI training task is managed in a scientific experiment mode, the original independent training task is endowed with an incidence relation, all possible repeated labor of an algorithm engineer in the process of selecting an optimal model is eliminated, a user is helped to timely and effectively obtain a training result, and the training effects of different training tasks in the experiment are visually compared.
The AI training tasks in the same batch of tasks have the same scheduling priority, so that the training tasks can be started orderly, the training with high priority can be carried out in time, and the training tasks in the same batch can be started in sequence within a short time range.
According to an embodiment of the present disclosure, there is also provided an electronic device including: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.
The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
The technical scheme disclosed by the invention can be applied to the field of artificial intelligence and is realized or realized in an artificial intelligence chip. The chip may exist alone or may be included in a computing device.
Fig. 4 shows a combined processing device 400 that includes the computing device 402, the universal interconnect interface 404, and other processing devices 406 described above. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user. Fig. 4 is a schematic view of a combined treatment apparatus.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.
A universal interconnect interface for transferring data and control instructions between a computing device (including, for example, a machine learning computing device) and other processing devices. The computing device acquires required input data from other processing devices and writes the input data into a storage device on the computing device chip; control instructions can be obtained from other processing devices and written into a control cache on a computing device slice; the data in the memory module of the computing device can also be read and transmitted to other processing devices.
Optionally, the architecture may further comprise a storage device 408, which is connected to said computing device and said other processing device, respectively. The storage device is used for storing data in the computing device and the other processing devices, and is particularly suitable for storing all data which cannot be stored in the internal storage of the computing device or the other processing devices.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
Referring to fig. 5, an exemplary board card is provided that may include other kits in addition to the chip 502, including but not limited to: a memory device 504, an interface apparatus 506, and a control device 508.
The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of groups of memory cells 510. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC checking. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The interface device is electrically connected with a chip in the chip packaging structure. The interface means is used to enable data transfer between the chip and an external device 512, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.
In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card.
Electronic devices or apparatuses include data processing apparatuses, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It should be noted that for simplicity of description, the above-described method embodiments are shown as a series of combinations of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. With this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.
Claims (17)
1. A method for managing AI training tasks, comprising:
receiving a request to create a batch of training tasks, the request including at least one algorithm-parameter combination as a variable and a dataset identification for the algorithm-parameter combination;
in response to receiving the request, creating a memory area for each algorithm-parameter combination to facilitate storing training results; and
an AI training task is created for each algorithm-parameter combination in response to the creation of the memory area.
2. The method of claim 1, wherein the batch of training tasks is a set of AI training tasks to solve the same technical problem; and/or a set of AI training tasks that share the same data set.
3. The method of claim 1 or 2, further comprising: and triggering a task monitor in response to creating the AI training task, wherein the task monitor is used for monitoring the running state of the AI training task.
4. The method of any of claims 1-3, further comprising:
in response to a request to create the batch of training tasks, a priority is assigned to each AI training task.
5. The method as recited in claim 4, further comprising:
in response to creating the AI training task, placing the created AI training task in a wait queue; and
and allocating resources for the AI training task according to the priority of the AI training task.
6. The method of claim 4 or 5, wherein all AI training tasks of the same batch of training tasks have the same priority.
7. The method according to any of claims 4-6, wherein resources are allocated preferentially for AI training tasks with high priority.
8. The method of any of claims 4-7, wherein when the priority of one AI training task is adjusted, the priorities of other non-running AI training tasks in the same batch of training tasks are adjusted identically.
9. The method according to any of claims 4-8, wherein the resources occupied by a running AI training task remain until the end of the running AI training task.
10. The method according to any one of claims 4-9, wherein a first-in-first-out strategy is employed for AI training tasks with the same priority.
11. The method of any of claims 1-10, wherein the copy of the AI training task is generated in response to receiving a clone request for the AI training task.
12. The method of any of claims 1-10, further comprising:
in response to receiving a clone request for an AI training task, sending a copy of the AI training task for modification;
receiving a new request to create a modified AI training task, the new request including at least one modified algorithm-parameter combination as a new variable and a dataset identification for the modified algorithm-parameter combination;
in response to receiving the new request, creating a new memory area for each modified algorithm-parameter combination to facilitate storing new training results; and
in response to the creation of the new memory area, a new AI training task is created for each modified algorithm-parameter combination.
13. The method of any of claims 1-12, wherein, in response to the received visualization request, a visualization component is created and contents of a memory area of at least a portion of the AI training tasks in the batch of training tasks are mounted into the visualization component.
14. The method of claim 13, wherein the updated content is presented in the visualization component in response to an update to the content of the storage area.
15. The method of any one of claims 1-14, wherein the method is performed in a distributed cluster of kubernets.
16. An electronic device, comprising:
one or more processors; and
memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-15.
17. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any one of claims 1-15.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110893221.6A CN115878250A (en) | 2021-08-04 | 2021-08-04 | Method for managing AI training task and related product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110893221.6A CN115878250A (en) | 2021-08-04 | 2021-08-04 | Method for managing AI training task and related product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115878250A true CN115878250A (en) | 2023-03-31 |
Family
ID=85762128
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110893221.6A Pending CN115878250A (en) | 2021-08-04 | 2021-08-04 | Method for managing AI training task and related product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115878250A (en) |
-
2021
- 2021-08-04 CN CN202110893221.6A patent/CN115878250A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102281739B1 (en) | Resource scheduling method, scheduling server, cloud computing system, and storage medium | |
CN106933669B (en) | Apparatus and method for data processing | |
CN113641457B (en) | Container creation method, device, apparatus, medium, and program product | |
CN109558234A (en) | A kind of timed task dispatching method and device | |
CN103197968A (en) | Thread pool processing method and system capable of fusing synchronous and asynchronous features | |
CN113110938A (en) | Resource allocation method and device, computer equipment and storage medium | |
CN114416352A (en) | Computing resource allocation method and device, electronic equipment and storage medium | |
CN113204353B (en) | Big data platform assembly deployment method and device | |
CN112433823A (en) | Apparatus and method for dynamically virtualizing physical card | |
CN115686805A (en) | GPU resource sharing method and device, and GPU resource sharing scheduling method and device | |
CN111767995B (en) | Operation method, device and related product | |
CN109033184B (en) | Data processing method and device | |
CN113961353A (en) | Task processing method and distributed system for AI task | |
CN112711522B (en) | Cloud testing method and system based on docker and electronic equipment | |
US20050050480A1 (en) | Notifying status of execution of jobs used to characterize cells in an integrated circuit | |
CN115484187B (en) | Method, equipment and storage medium for testing container network interface in container environment | |
CN115878250A (en) | Method for managing AI training task and related product | |
CN116820758A (en) | Job processing method, apparatus, computer device, storage medium, and program product | |
CN115809126A (en) | Job scheduling method and device in mixed deployment scene and electronic equipment | |
CN110851245A (en) | Distributed asynchronous task scheduling method and electronic equipment | |
CN115357433A (en) | Database backup method, device, equipment and storage medium under container environment | |
US20150189013A1 (en) | Adaptive and prioritized replication scheduling in storage clusters | |
CN109614242A (en) | A kind of computing capability sharing method, device, equipment and medium | |
US20230153157A1 (en) | Inter-node communication method and device based on multiple processing nodes | |
CN111984510B (en) | Performance test method and device for dispatching system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |