CN117909907A - High-throughput computing platform, and anomaly removal method, device and storage medium thereof - Google Patents

High-throughput computing platform, and anomaly removal method, device and storage medium thereof Download PDF

Info

Publication number
CN117909907A
CN117909907A CN202410290480.3A CN202410290480A CN117909907A CN 117909907 A CN117909907 A CN 117909907A CN 202410290480 A CN202410290480 A CN 202410290480A CN 117909907 A CN117909907 A CN 117909907A
Authority
CN
China
Prior art keywords
abnormality
information
computing platform
calculation
anomaly
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410290480.3A
Other languages
Chinese (zh)
Inventor
陈建辉
赵旭山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Contemporary Amperex Technology Co Ltd
Original Assignee
Contemporary Amperex Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Contemporary Amperex Technology Co Ltd filed Critical Contemporary Amperex Technology Co Ltd
Priority to CN202410290480.3A priority Critical patent/CN117909907A/en
Publication of CN117909907A publication Critical patent/CN117909907A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The present application relates to the field of high-throughput computing, and in particular, to a high-throughput computing platform, and an anomaly removal method, apparatus and storage medium thereof. The method comprises the following steps: obtaining an output file of a high-flux computing platform; analyzing to obtain abnormal information of the high-flux computing platform according to the output file; and calling more than two abnormality elimination steps corresponding to the analyzed abnormality information according to the corresponding relation between the abnormality information and the abnormality elimination flow. The method analyzes the output file, performs exception removal based on the corresponding relation between the exception information and the exception removal flow, can flexibly expand the exception removal method, does not need to update and package the bottom layer of the high-flux computing platform, and is beneficial to improving the maintenance convenience of the high-flux computing platform.

Description

High-throughput computing platform, and anomaly removal method, device and storage medium thereof
Technical Field
The present application relates to the field of high-throughput computing, and in particular, to a high-throughput computing platform, and an anomaly removal method, apparatus and storage medium thereof.
Background
High throughput computing (HTC for short, high Throughput Computing for english) is also called high throughput computing, and is used to split a large-scale intensive operation into a plurality of subtasks, and give the subtasks to clustered computer operation, so that high performance computing is realized while long-term stable operation is maintained.
In a high-throughput computing process, various anomalies may be encountered due to algorithm problems, software problems, or equipment problems, such as high-throughput computing platform anomalies, message queue anomalies, scheduler anomalies, computing unit anomalies, and so forth. The exception problem existing in the operation process can be effectively solved by writing the processing codes of the exception condition in the bottom layer, but the maintenance of the processing flow of the exception problem is troublesome because of more exception types of the high-flux computing platform, and the expansion of the processing flow of the exception problem is inconvenient.
Disclosure of Invention
In view of the above, embodiments of the present application provide an anomaly removal method, apparatus, device and storage medium for a high-throughput computing platform, so as to solve the problem in the prior art that the maintenance of the processing flow of the anomaly problem is troublesome and the expansion of the anomaly problem processing flow is inconvenient.
A first aspect of an embodiment of the present application provides a method for exception removal for a high-throughput computing platform, the method including: obtaining an output file of the high-flux computing platform; analyzing and obtaining abnormal information of the high-flux computing platform according to the output file; and calling more than two abnormality elimination steps corresponding to the obtained abnormality information through analysis according to the corresponding relation between the abnormality information and the abnormality elimination flow, determining the sequence of the more than two abnormality elimination steps, and carrying out abnormality elimination on the abnormality information.
Determining the abnormal information of the high-flux computing platform by analyzing the output file of the high-flux computing platform, determining more than two abnormal elimination steps included in the abnormal elimination flow corresponding to the abnormal information according to the corresponding relation between the abnormal information and the abnormal elimination flow, and performing abnormal elimination on the abnormal information according to the sequence of the more than two abnormal elimination steps. According to the method, the output file is analyzed, based on the correspondence between the anomaly information and more than two anomaly removal steps in the anomaly removal process, the anomaly removal is carried out in combination with the sequence of the anomaly removal steps, so that the anomaly removal steps can be effectively decoupled, the anomaly removal method can be expanded more flexibly, and the convenience of maintenance of a high-throughput computing platform can be improved.
With reference to the first aspect, in a first possible implementation manner of the first aspect, according to a correspondence between anomaly information and an anomaly removal procedure, invoking two or more anomaly removal steps corresponding to the anomaly information obtained by parsing, where the steps include: determining an abnormality type corresponding to the abnormality information; and calling more than two abnormality elimination steps corresponding to the abnormality type according to the corresponding relation between the abnormality type and the abnormality elimination flow.
In order to improve the processing efficiency of exception removal, the exception information can be classified, and the exception type to which the exception information belongs is determined. Based on the corresponding relation between the set exception type and the exception excluding flow, the exception excluding flow corresponding to the exception type is called to exclude the exception, so that the processing efficiency of the exception of the same type can be effectively improved.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, resolving, according to the output file, to obtain exception information of the high-throughput computing platform includes: resolving to obtain a calculation completion identifier of the output file; when the calculation completion identifier indicates that the calculation is completed, determining that no abnormality exists in the calculation of the high-throughput computing platform; and when the calculation completion identifier indicates that the calculation is not completed, determining that the abnormality exists in the calculation of the high-throughput computing platform.
By analyzing the calculation completion identifier of the output file, whether the calculation is abnormal or not can be determined. If no abnormality exists, the analysis of the abnormality information of the output file is not needed, and the file analysis efficiency is improved. If the calculation completion identifier indicates that the calculation is not completed, the specific information of the abnormality can be further analyzed, and an abnormality removal flow corresponding to the abnormality is determined.
With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, after the calculation completion identifier indicates that the calculation is completed, the method further includes: the method comprises the steps of recording an abnormality exclusion record in a high-throughput calculation process of a task and version information corresponding to the abnormality exclusion record of the task, wherein the abnormality exclusion record and the version information corresponding to the abnormality exclusion record are used for being called when the high-throughput calculation task is abnormal.
When the calculation completion identifier indicates that the calculation is completed, the calculation is not abnormal, the record of the abnormal elimination when the calculation process of the task is abnormal can be recorded, the record comprises the abnormal elimination flow adopted when each abnormal elimination is recorded, and the version of the strategies of different abnormal elimination flows can be updated. For example, for the same type of exception information, the latest version or versions may be selected for exception removal processing according to version information corresponding to the recorded exception removal record.
With reference to any one of the third possible implementation manners of the first aspect, in a fourth possible implementation manner of the first aspect, after parsing the exception information of the high-throughput computing platform according to the output file, the method further includes: and when the abnormal information does not have a corresponding abnormal elimination flow or the abnormal elimination times of the abnormal information exceed a preset time threshold, marking the abnormal information as a fault-tolerant calculation result, and transmitting the fault-tolerant calculation result to a calculation unit for the next calculation.
In the operation process of the high-throughput computing platform, new anomalies may occur or anomalies in the anomaly removal process may not exist temporarily, or the number of anomalies removed reaches a preset number threshold, the anomaly information may be marked as a fault-tolerant computing result, and the fault-tolerant computing result is transmitted to a unit for next computation. By labeling the method, the fault tolerance mechanism is conveniently and synchronously started by the subsequent strong-association computing unit, the subsequent complementary computing flow is conveniently and additionally computed, and the corresponding relation of exception elimination is conveniently perfected.
With reference to any one of the third possible implementation manners of the first aspect, in a fifth possible implementation manner of the first aspect, analyzing, according to the output file, to obtain exception information of the high-throughput computing platform, where the exception information includes: according to the output file, analyzing and obtaining ion energy, ion position and/or ion speed in the output file of the high-flux computing platform; determining ion energy change information according to the ion energy at different moments, determining ion position change information according to ion positions at different moments, and/or determining ion speed change information according to ion speeds at different moments; and under the condition that the ion energy change information, the ion position change information and/or the ion speed change information meet the preset stability requirement, determining that the high-flux computing platform has abnormal information of structure energy convergence.
When the abnormal information of the structure energy convergence is determined, ion energy, ion position and/or ion speed are obtained based on the output file, ion energy change information is obtained according to the ion energy at different moments, ion position change information is determined according to the ion position at different moments, and/or ion speed change information is determined according to the ion speed at different moments, and under the condition that the change amplitude of the ion energy is larger than an energy convergence threshold value, the change amplitude of the ion position is larger than the position convergence threshold value and/or the change amplitude of the ion speed is larger than the speed convergence threshold value, the abnormal condition that the structure energy is converged is determined, and the abnormal condition can be eliminated according to an abnormal elimination flow corresponding to the abnormal information.
With reference to any one of the fifth possible implementation manners of the first aspect, in a sixth possible implementation manner of the first aspect, invoking, according to a correspondence between anomaly information and an anomaly removal flow, two or more anomaly removal steps corresponding to the anomaly information obtained by parsing, where the steps include: when the abnormal information is that the relaxation state of the ions is abnormal or the ion energy is abnormal, the first structure file output by the high-flux computing platform is adjusted to be a second structure file for error correction; and inputting the second structure file into a high-flux computing platform for error correction computation.
When the relaxation state of the ions is abnormal or the ion energy is abnormal, the first structure file output by the high-flux computing platform can be adjusted to be a second structure file used for error correction input, and the second structure file is input to the high-flux computing platform for error correction computation, so that the processing of the abnormal type of the relaxation state abnormality or the ion energy abnormality can be effectively adapted.
A second aspect of an embodiment of the present application provides an anomaly removal device for a high-throughput computing platform, the device comprising: the file information acquisition unit is used for acquiring an output file of the high-flux computing platform; the analysis unit is used for analyzing and obtaining the abnormal information of the high-flux computing platform according to the output file; the step determining unit is used for calling more than two abnormality elimination steps corresponding to the abnormality information obtained through analysis according to the corresponding relation between the abnormality information and the abnormality elimination flow; and an abnormality removal unit configured to determine an order of the two or more abnormality removal steps, and to perform abnormality removal on the abnormality information.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the anomaly removal unit includes: an anomaly type determining subunit, configured to determine an anomaly type corresponding to the anomaly information; and the calling subunit is used for calling more than two abnormality elimination steps corresponding to the abnormality type according to the corresponding relation between the abnormality type and the abnormality elimination flow.
With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the parsing unit includes: an identifier parsing subunit, configured to parse to obtain a calculation completion identifier of the output file; a first determining subunit, configured to determine that no abnormality exists in the present calculation of the high-throughput computing platform when the calculation completion identifier indicates that the calculation is completed; and the second determining subunit is used for determining that the calculation of the high-throughput computing platform is abnormal when the calculation completion identifier indicates that the calculation is not completed.
With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the apparatus further includes: the recording subunit is used for recording an abnormality rejection record in the high-throughput computing process of the task and version information corresponding to the abnormality rejection record of the task, wherein the abnormality rejection record and the version information corresponding to the abnormality rejection record are used for being called when the high-throughput computing task is abnormal.
With reference to any one of the third possible implementation manners of the second aspect to the second aspect, in a fourth possible implementation manner of the second aspect, the anomaly removal unit includes: the adjusting subunit is used for adjusting the first structure file output by the high-flux computing platform to be a second structure file for error correction when the abnormal information is abnormal in the relaxation state of the ions or abnormal in the ion energy; and the input subunit is used for inputting the second structure file into a high-flux computing platform for error correction computation.
With reference to any one of the third possible implementation manners of the second aspect to the second aspect, in a fifth possible implementation manner of the second aspect, the apparatus further includes: the fault-tolerant unit is used for marking the abnormal information as a fault-tolerant calculation result when the corresponding abnormal elimination flow does not exist in the abnormal information or the abnormal elimination times of the abnormal information exceed a preset time threshold value, and transmitting the fault-tolerant calculation result to the calculation unit for the next calculation.
With reference to the second aspect to the third possible implementation manner of the second aspect, in a sixth possible implementation manner of the second aspect, the parsing unit includes: the ion energy analysis subunit is used for analyzing and obtaining the ion energy in the output file of the high-flux computing platform according to the output file; an ion energy variation information determination subunit configured to determine ion energy variation information according to the ion energies at different times; the ion energy anomaly determination subunit is configured to determine that the high-flux computing platform has anomaly information with structural energy convergence when the ion energy variation information meets a preset stability requirement.
A third aspect of an embodiment of the present application provides an exception removal apparatus for a high throughput computing platform, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to any one of the first aspects when the computer program is executed.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method according to any of the first aspects.
It will be appreciated that the advantages of the second to fourth aspects may be found in the relevant description of the first aspect and are not repeated here.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an embodiment of a method for exception removal for a high-throughput computing platform according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of an embodiment of an anomaly removal method for a high-throughput computing platform;
FIG. 3 is a schematic diagram of an implementation flow for exception removal based on exception types according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of exception removal according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of exception removal according to an embodiment of the present application;
FIG. 6 is a schematic diagram of an anomaly removal device for a high-throughput computing platform according to an embodiment of the present application;
fig. 7 is a schematic diagram of an anomaly removal apparatus for a high-throughput computing platform provided by an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
In order to illustrate the technical scheme of the application, the following description is made by specific examples.
In the fields of bioinformatics or material science, in order to explore the properties of materials, a large amount of computing resources are adopted to process large-scale data and execute repeated computing tasks by using high-throughput computing platform, so that a system can stably and reliably obtain a computing result.
In a high-throughput computing process, various anomalies may be encountered due to algorithm problems, software problems, or equipment problems, such as high-throughput computing platform anomalies, message queue anomalies, scheduler anomalies, computing unit anomalies, and so forth. The exception problem existing in the operation process can be effectively solved by writing the processing codes of the exception condition in the bottom layer, but once a new exception occurs, the high-flux computing platform, namely the equipment for high-flux computing, is required to be packaged and maintained again, so that the operation and maintenance difficulty of the high-flux computing platform is not facilitated to be reduced.
Based on this, an embodiment of the present application provides an anomaly removal method for a high-throughput computing platform, and fig. 1 is a schematic diagram of an implementation scenario of the anomaly removal method for the high-throughput computing platform. In an implementation scenario of the method, a high-throughput computing platform 101, an anomaly removal device 102, and a computing node 103 are included. The high-throughput computing platform 101 is configured to distribute computing tasks to the plurality of computing nodes 103 according to factors such as load balancing, task priority, and task characteristics. The high-throughput computing platform 101 obtains output files of contents such as computing states and computing results according to data information returned by the computing nodes, wherein the output files comprise OUTCAR files, OSZICAR files, vastrum.xml files, slurm files and the like. The abnormality removal device 102 is configured to parse the output file to obtain abnormality information of the high-throughput computing platform, and call an abnormality removal procedure to perform abnormality removal according to the abnormality information.
Fig. 2 is a schematic implementation flow chart of an anomaly removal method for a high-throughput computing platform according to an embodiment of the present application, which is described in detail below:
in S201, an output file of the high-throughput computing platform is acquired.
The triggering mode for acquiring the output file of the high-flux computing platform for abnormality detection in the embodiment of the application can comprise triggering modes such as timer triggering, user triggering or output file triggering, so that the abnormality removing equipment can acquire the output file of the high-flux computing platform. When the output file triggers the abnormality detection and elimination, the output file of the calculation task can be monitored in real time, so that the abnormality can be eliminated efficiently and accurately.
Output files of the high-throughput computing platform comprise a computing result file, a log file, a statistical report file, a parameter file and the like. The calculation result file comprises information such as molecular structure, energy, charge, density and the like. The log file may include information such as input parameters, run time, error messages, etc. The statistics report file may include information such as computing time, memory usage, disk space, etc. The parameter file may include parameters generated during the calculation process, including intermediate state or configuration information, etc.
For example, the high-throughput computing platform may include a VASP (high-throughput computing software name) software-based computing platform. The output files may include OUTCAR files, OSZICAR files, vastrum.xml files, slurm files, and the like.
The OUTCAR file is an output file of the VASP software, and may include basic information in the calculation process, such as total energy of the system, residual charge density, electron density, and the like. Or may also provide systematic symmetry information such as the sign of the symmetrically operating matrix and space group, etc.
The OSZICAR file is a file generated by the VASP software in the optimization process, and can contain the energy of each self-consistent cycle and related calculation information, and the information is used for determining state information such as energy change and convergence process.
The vastrum. Xml file is used to provide detailed information about the VASP software during the calculation process, including energy, charge density, wave function, etc. during the iteration process. The file is a comprehensive log file and is used for analyzing the calculation process and calculation results.
The Slurm file is the output file of the job scheduling system in the high-throughput computing platform. The file contains information about job scheduling and resource allocation. By the file, information such as the running state of the job, the use condition of resources, the completion condition of the job and the like can be analyzed and obtained.
In a possible implementation manner, after the output file is obtained, the embodiment of the application can judge whether the calculation is completed according to the output file. If the output file has the identifier for completing the calculation, the method can indicate that the calculation task of the high-flux calculation platform is not abnormal, and can directly carry out the next task, thereby effectively improving the processing efficiency of abnormal information. If the output file does not have the identifier for completing the calculation, the calculation task of the high-flux calculation platform is abnormal, and further abnormality judgment is needed, so that the detailed information of the related output file can be further acquired, and when the expected result is not reached, or error information or warning determination abnormal information occurs.
In S202, according to the output file, the anomaly information of the high-throughput computing platform is obtained through analysis.
The abnormal information in the embodiment of the application can be determined according to the error reporting information recorded in the output file or according to the state information recorded in the output file.
For example, the information such as ion energy in the current step may be determined from the information describing the ion step in the OUTCAR file output by the VASP software. From the ion energies in the multiple steps, information on the change in ion energies can be obtained. The ion energy change information and a preset convergence condition can be judged to determine whether the ion energy of the high-flux computing platform is converged or not.
If the ion energy change information meets a preset convergence condition, for example, the ion energy change information meets a preset stability requirement, for example, the ion energy change information changes stably according to a certain period, the current step can be considered to determine energy convergence. If the ion energy change information does not meet the preset stability requirement, such as that the ion energy change amplitude is larger than a preset energy amplitude threshold value, the energy convergence abnormality is determined.
In a possible implementation, it may also be determined whether the ion energy converges in combination with the ion's force information. For example, when the ion force gradually changes to 0, or the maximum force is smaller than a set force threshold, the structural energy convergence may be determined. Otherwise, the energy convergence may be considered abnormal when the force becomes progressively greater, or the maximum force is greater than the force threshold.
Or when determining whether the relaxation state of the ions is abnormal according to the output file of the VASP software, the information such as the position and/or the speed of the ions can be determined according to the information describing the ion step in the OUTCAR file output by the VASP.
Based on the information such as ion position and/or velocity acquired in different steps, position change information of the ions and/or velocity change information of the ions can be determined. The information of the position change of the ion is expressed as the jump amplitude of the ion. If the ion's jump amplitude is greater than a predetermined jump amplitude threshold, an abnormality in the ion's relaxation state is determined. And/or determining abnormal information of the relaxation state of the ion if the velocity variation information of the ion does not conform to a preset velocity variation characteristic of the relaxation state, such as that the velocity exhibits an irregular or non-periodic variation, including such as that the magnitude of the velocity jump is greater than a predetermined velocity variation threshold value, etc.
In S203, according to the correspondence between the anomaly information and the anomaly removal process, two or more anomaly removal steps corresponding to the anomaly information obtained by analysis are called.
The embodiment of the application can set the corresponding relation between the abnormal information and more than two abnormal elimination steps, and call the more than two abnormal elimination steps corresponding to the abnormal information based on the corresponding relation to eliminate the abnormal information. For example, for the above-mentioned ion energy convergence abnormality, or ion relaxation state abnormality, two or more abnormality removal steps may include: an adjustment step of adjusting the first structure file output by the scientific calculation to be a second structure file for input; and (3) re-inputting the calculation parameters into a high-flux calculation platform for calculation under the condition that the calculation parameters remain unchanged. For the recalculation, the calculated structural energy is converged and the relaxation state of the ions is restored to the normal state.
In order to improve the abnormality removal efficiency, the robustness of the abnormality removal is improved. The embodiment of the application can further improve the efficiency of exception removal by classifying the exception information, as shown in fig. 3, the method comprises the following steps:
in S301, an anomaly type corresponding to the anomaly information is determined.
The exception types in the embodiments of the present application may include, for example, one or more of platform exceptions, scheduler exceptions, message queue exceptions, hardware and system exceptions, general computing unit exceptions, and scientific computing unit exceptions.
Wherein the platform anomaly is an anomaly of the high-throughput computing platform itself. Platform exceptions typically include exceptions that occur when the platform code runs on K8s (Kubernetes, an open source system for automatically deploying, scaling, and managing containerized applications), including exceptions such as no start-up, an in-window application crash, no service access, insufficient storage, resource leakage, and the like.
Due to the specificity of the scheduling function of the high-throughput integrated computing platform, the function of the scheduler is relatively single, i.e., messages for consuming message queues. Scheduler anomalies may be encountered when dispatching tasks to heterogeneous multi-source servers for task jobs, including anomalies that block scheduler messages, or may also include anomalies that use open source schedulers such as Slurm, PBS, etc.
Eliminating queue anomalies is an anomaly in which a high-throughput computing platform experiences message blocking in the case of concurrent operations.
Hardware and system anomalies include one or more anomalies among memory, CPU, load, I/O, and storage on various types of server carriers of the high-throughput computing platform walking alone.
Typical compute unit anomalies include code, software-level error reporting anomalies that occur with typical servers (servers other than the super cloud computing cluster).
The scientific computing unit exception comprises a software-level error reporting exception of the super computing cluster, wherein the software-level error reporting exception occurs at the time of computation and the parallel tasks of the multi-core multi-node. Wherein the supercomputing cluster (Super Computing Cluster) is a high-performance computing system, consisting of a group of interconnected computers that cooperate to implement massively parallel computing and data processing.
For example, the detected ion energy convergence abnormality and relaxation state abnormality are abnormality information determined by an output file of a scientific calculation process, and may be classified as a type of scientific calculation abnormality.
In S302, according to the correspondence between the anomaly type and the anomaly removal process, invoking two or more anomaly removal steps corresponding to the anomaly type.
Based on the type of the abnormality information, two or more abnormality removal steps corresponding to the type of abnormality information, that is, the abnormality type, may be called to remove the abnormality information.
As shown in fig. 4, the obtained output file is parsed by the abnormality determiner, and the obtained abnormality types include abnormality types of type a, type B, type C, and the like. According to the corresponding relation between the anomaly type and the anomaly removal flow, the corresponding anomaly removal flow A, anomaly removal flow B and anomaly removal flow C can be found, wherein each anomaly removal flow comprises more than two anomaly removal steps. Based on the corresponding relation between the anomaly type and the anomaly removal flow, the method can effectively adapt to anomaly removal of different anomaly information of the same type, so that the robustness of anomaly removal can be effectively improved, and the efficiency of anomaly removal is improved.
In the embodiment of the application, the abnormality removal flow of different abnormality types can be split in advance to obtain two or more abnormality removal steps corresponding to the abnormality types. Based on the corresponding relation between the anomaly type and the anomaly removal flow, the corresponding relation between the anomaly type and the anomaly removal step and the sequence of the anomaly removal step can be obtained.
In S204, the order of the two or more abnormality removal steps is determined, and the abnormality information is subjected to abnormality removal.
For different exception types, the same exception removal step may be used, so that the use efficiency of the exception removal step can be improved.
For example, for an energy convergence anomaly or an ion relaxation anomaly in a scientific computing unit anomaly, in a possible implementation, the step of adjusting the output first structure file may be included to obtain a second structure file for the input high-throughput computing platform. After the adjustment, the method further comprises the step of recalculating, wherein the second structure file is input to the high-throughput computing platform for recalculating under the condition that the calculation parameters are unchanged.
Or in a possible implementation manner, the abnormality of the scientific computing unit may sequentially include, for example, a computing parameter checking step, a computing parameter adjusting step, an initial structure correcting step, a computing resource checking step, an iterative computing step, an optimization direction monitoring step, and the like.
The calculation parameter check is used for checking whether calculation parameters used for scientific calculation are reasonable or not, including checking whether energy band cutoff energy, K point sampling density, energy convergence formula and iteration times are reasonable or not when energy convergence is abnormal. The calculation parameter adjustment is used for adjusting unreasonable parameters in the calculation parameter checking step. And the initial structure correction step is used for taking the output structure file as the input structure file of the next calculation to continue the calculation when the initial structure is in an unstable state. The computing resource checking step is used for checking whether the computing resources, including the resources such as the memory, the CPU core number, etc., are sufficient. And the iterative calculation step is used for continuing iterative calculation until proper parameter combinations are found to enable the energy to be converged or the ion relaxation to be normal after the energy is still not converged or the ion relaxation is abnormal after the adjustment. The optimization direction monitoring step is used for monitoring whether the energy or ion relaxation state changes towards the target direction or not in the iterative calculation process.
For the abnormality of the common computing unit, an abnormality problem positioning step, a problem isolation step, a computing resource checking step, a repairing and debugging step, an optimization direction monitoring step and the like can be sequentially included. The processing steps associated with the scientific computing unit anomalies may include a computing resource detection step, an optimization direction monitoring step, and the like.
For the platform abnormality of the high-throughput computing platform, the method can comprise the steps of problem resource isolation, platform abnormality positioning, platform problem repairing, data recovery, task rescheduling and the like.
For the message queue abnormality of the high-throughput computing platform, the method can comprise the steps of problem resource isolation, queue problem elimination diagnosis, message queue problem repair, data consistency confirmation and the like.
For the scheduler abnormality of the high-throughput computing platform, the method can comprise the steps of problem resource isolation, scheduler abnormality diagnosis, task recovery, optimization direction monitoring and the like.
For hardware/system anomalies of a high-throughput computing platform, steps such as problem resource isolation, hardware/system anomaly diagnosis, hardware/system problem repair, data recovery and the like can be included.
Thus, for different types of exceptions, the same processing flow may be included between some exception types in more than two steps in the corresponding exception processing flow. For example, platform anomalies and hardware/system anomalies may include the same steps of problem resource isolation, data recovery, etc. By decoupling the step of processing the exception, the steps can be reused for different types of exceptions, the expansion efficiency of exception processing is improved, and the maintenance efficiency of exception processing is improved.
In addition, in a possible implementation manner, for different high-throughput computing platforms, including, for example, VASP, gaussian and LAMMPS platforms, steps corresponding to the exception types can be selected from a predetermined step library according to the exception types, and the sequence corresponding to the steps is determined, so that the method is efficiently adapted to the exception handling requirements of multiple platforms.
In a possible implementation, after the recalculation is completed, it may be determined again whether there is an identifier of the completion of the calculation. If the information is present, the high-throughput calculation at this time is free from errors, the information of which the exception is removed is returned, the sub-flow information of the exception removal at this time and the version number thereof can be recorded, and the next calculation task can be carried out. Based on the recorded version number, the exception removal procedure may be iteratively and optimally updated. According to the recorded exception removal record and version information, the latest version or the latest version can be adopted for exception removal when the same type of exception is encountered later. When recording the version of the abnormality removal record, whether the abnormality removal is effective or not can also be determined according to the removal result, namely the response result after the current abnormality removal. If the current calculation completion identifier is included in the output file after the exception removal processing, the current exception removal record can be indicated to be valid, and if the previous exception removal record of the type of exception does not include the current exception removal strategy or flow, the exception removal flow corresponding to the current exception removal record can be updated to the exception removal flow corresponding to the exception type, and new version information can be set. For example, the previous version is 3.0, and may be updated to 4.0. Or the new exception handling flow is a local optimization of the previous flow, and the version information can be updated on the basis of the previous version category, for example, the updated version information is 3.1.
If the recalculation is completed, it is judged that the identifier of the completion of calculation does not exist according to the output file (such as OUTCAR file), the calculation of the high-throughput calculation platform (such as a VASP software platform) is indicated to be in error, and then the abnormal information of the output file needs to be further analyzed, and the abnormal elimination is performed according to the corresponding relation between the abnormal information and the abnormal elimination, or the abnormal elimination is performed according to the corresponding relation between the abnormal type and the abnormal elimination.
In a possible implementation manner, the embodiment of the application can also count the number of times of exception removal. If the number of times of abnormality elimination of the same task exceeds a predetermined number of times threshold, the number of times of abnormality elimination, the abnormality elimination process and the abnormality tolerance information can be marked. When the anomaly removal process is marked, version information of the anomaly removal process can be marked, and the marking of the anomaly fault-tolerant information comprises marking that the step of calculation unit does not output a valid result and the result is the fault-tolerant calculated result. After the labeling sub-flow, the sub-flow can be transferred to a next calculation unit for calculation.
As shown in fig. 5, in the case where the abnormality detected by the abnormality determiner may be a new type of abnormality, the corresponding abnormality removal flow (no type), that is, the condition range in which no abnormality removal exists, is temporarily not set. In this case, an abnormality may be set as an abnormality of a pending type, and for the processing flow of such an abnormality, a fault-tolerant flow may be started. A pass sub-process and an annotate sub-process may be included. The transmission sub-flow is used for transmitting the fault-tolerant information to the next calculation task, and the labeling sub-flow is used for labeling that the calculation does not obtain a reasonable result. By marking and transmitting the fault-tolerant information, the follow-up strong-association computing unit can conveniently start the fault-tolerant mechanism synchronously, and the follow-up complementary computing process can conveniently carry out complementary computing on the fault-tolerant information, so that the abnormality removal function is perfected and optimized.
Thus, different exception types may multiplex portions of the sub-flows therein, including, for example, the pass sub-flow, etc. The abnormality removal efficiency can be effectively reduced.
The method has the advantages that the abnormal information is determined by acquiring the output file, and the abnormal elimination mode of the abnormal elimination flow is called based on the abnormal information, so that compared with the elimination method of the bottom code, the encapsulation maintenance difficulty of the high-flux computing platform can be greatly reduced, and the convenience and the computing efficiency of computing use are improved.
In addition, the embodiment of the application determines the corresponding abnormality removal flow based on the abnormality type, can effectively adapt to the abnormality removal of different abnormality information of the same abnormality type, can effectively improve the robustness of the abnormality removal, improves the efficiency of the abnormality removal, and reduces the difficulty of system maintenance.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.
Fig. 6 is a schematic diagram of an abnormality removing apparatus of a high-throughput computing platform according to an embodiment of the present application. The device comprises:
and the file information acquiring unit 601 is configured to acquire an output file of the high-throughput computing platform.
And the parsing unit 602 is configured to parse and obtain the anomaly information of the high-throughput computing platform according to the output file.
Step determining unit 603 is configured to call, according to the correspondence between the anomaly information and the anomaly removal procedure, more than two anomaly removal steps corresponding to the anomaly information obtained by analysis.
Determining the sequence of the two or more abnormality removal steps, and performing abnormality removal on the abnormality information.
The abnormality removing apparatus shown in fig. 6 corresponds to the abnormality removing method shown in fig. 2.
Fig. 7 is a schematic diagram of an abnormality removal apparatus of a high-throughput computing platform according to an embodiment of the present application. As shown in fig. 7, the abnormality removal apparatus 7 of the high-throughput computing platform of this embodiment includes: a processor 70, a memory 71, and a computer program 72 stored in the memory 71 and executable on the processor 70, such as an exception removal program for a high-throughput computing platform. The processor 70, when executing the computer program 72, implements the steps of the embodiments of the anomaly removal method for each of the high-throughput computing platforms described above. Or the processor 70, when executing the computer program 72, performs the functions of the modules/units of the various device embodiments described above.
By way of example, the computer program 72 may be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 72 in the anomaly removal device 7 of the high-throughput computing platform.
The abnormality removing device 7 of the high-throughput computing platform may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The exception removal devices of the high-throughput computing platform may include, but are not limited to, a processor 70, a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of the anomaly removal device 7 of the high-throughput computing platform and is not meant to be limiting of the anomaly removal device 7 of the high-throughput computing platform, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the anomaly removal device of the high-throughput computing platform may also include input-output devices, network access devices, buses, etc.
The processor 70 may be a central processing unit (Central Processing Unit, CPU), or may be another general purpose processor, a digital signal processor (DIGITAL SIGNAL processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a field-programmable gate array (field-programmable GATE ARRAY, FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 71 may be an internal storage unit of the abnormality removal device 7 of the high-throughput computing platform, for example, a hard disk or a memory of the abnormality removal device 7 of the high-throughput computing platform. The memory 71 may also be an external storage device of the abnormality removing device 7 of the high-throughput computing platform, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the abnormality removing device 7 of the high-throughput computing platform. Further, the memory 71 may also include both internal and external storage units of the anomaly removal device 7 of the high-throughput computing platform. The memory 71 is used to store the computer program and other programs and data required by the anomaly removal device of the high-throughput computing platform. The memory 71 may also be used for temporarily storing data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the present application may also be implemented by implementing all or part of the procedures in the methods of the above embodiments, and the computer program may be stored in a computer readable storage medium, where the computer program when executed by a processor may implement the steps of the respective method embodiments. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium may include content that is subject to appropriate increases and decreases as required by jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is not included as electrical carrier signals and telecommunication signals.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. A method for exception removal for a high-throughput computing platform, the method comprising:
obtaining an output file of the high-flux computing platform;
analyzing and obtaining abnormal information of the high-flux computing platform according to the output file;
calling more than two abnormality elimination steps corresponding to the abnormality information obtained through analysis according to the corresponding relation between the abnormality information and the abnormality elimination flow;
determining the sequence of the two or more abnormality removal steps, and performing abnormality removal on the abnormality information.
2. The method according to claim 1, wherein the step of calling two or more anomaly exclusions corresponding to the anomaly information obtained by the analysis according to the correspondence between the anomaly information and the anomaly exclusions flow, comprises:
determining an abnormality type corresponding to the abnormality information;
And calling more than two abnormality elimination steps corresponding to the abnormality type according to the corresponding relation between the abnormality type and the abnormality elimination flow.
3. The method of claim 2, wherein parsing the exception information for the high-throughput computing platform from the output file comprises:
Resolving to obtain a calculation completion identifier of the output file;
when the calculation completion identifier indicates that the calculation is completed, determining that no abnormality exists in the calculation of the high-throughput computing platform;
and when the calculation completion identifier indicates that the calculation is not completed, determining that the abnormality exists in the calculation of the high-throughput computing platform.
4. A method according to claim 3, wherein after the calculation completion identifier indicates that the calculation is complete, the method further comprises:
The method comprises the steps of recording an abnormality exclusion record in a high-throughput calculation process of a task and version information corresponding to the abnormality exclusion record of the task, wherein the abnormality exclusion record and the version information corresponding to the abnormality exclusion record are used for being called when the high-throughput calculation task is abnormal.
5. The method of any of claims 1-4, wherein after parsing the anomaly information for the high-throughput computing platform from the output file, the method further comprises:
And when the abnormal information does not have a corresponding abnormal elimination flow or the abnormal elimination times of the abnormal information exceed a preset time threshold, marking the abnormal information as a fault-tolerant calculation result, and transmitting the fault-tolerant calculation result to a calculation unit for the next calculation.
6. The method according to any one of claims 1-4, wherein parsing the exception information of the high-throughput computing platform from the output file comprises:
according to the output file, analyzing and obtaining ion energy, ion position and/or ion speed in the output file of the high-flux computing platform;
Determining ion energy change information according to the ion energy at different moments, determining ion position change information according to ion positions at different moments, and/or determining ion speed change information according to ion speeds at different moments;
and under the condition that the ion energy change information, the ion position change information and/or the ion speed change information meet the preset stability requirement, determining that the high-flux computing platform has abnormal information of structure energy convergence.
7. The method according to claim 6, wherein the step of calling two or more anomaly exclusions corresponding to the anomaly information obtained by the analysis according to the correspondence between the anomaly information and the anomaly exclusions flow, comprises:
When the abnormal information is that the relaxation state of the ions is abnormal or the ion energy is abnormal, the first structure file output by the high-flux computing platform is adjusted to be a second structure file for error correction;
and inputting the second structure file into a high-flux computing platform for error correction computation.
8. An anomaly removal device for a high-throughput computing platform, the device comprising:
the file information acquisition unit is used for acquiring an output file of the high-flux computing platform;
the analysis unit is used for analyzing and obtaining the abnormal information of the high-flux computing platform according to the output file;
The step determining unit is used for calling more than two abnormality elimination steps corresponding to the abnormality information obtained through analysis according to the corresponding relation between the abnormality information and the abnormality elimination flow;
And an abnormality removal unit configured to determine an order of the two or more abnormality removal steps, and to perform abnormality removal on the abnormality information.
9. An exception removal device for a high-throughput computing platform, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.
CN202410290480.3A 2024-03-14 2024-03-14 High-throughput computing platform, and anomaly removal method, device and storage medium thereof Pending CN117909907A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410290480.3A CN117909907A (en) 2024-03-14 2024-03-14 High-throughput computing platform, and anomaly removal method, device and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410290480.3A CN117909907A (en) 2024-03-14 2024-03-14 High-throughput computing platform, and anomaly removal method, device and storage medium thereof

Publications (1)

Publication Number Publication Date
CN117909907A true CN117909907A (en) 2024-04-19

Family

ID=90685433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410290480.3A Pending CN117909907A (en) 2024-03-14 2024-03-14 High-throughput computing platform, and anomaly removal method, device and storage medium thereof

Country Status (1)

Country Link
CN (1) CN117909907A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815294A (en) * 2019-02-14 2019-05-28 北京谷数科技有限公司 A kind of dereliction Node distribution parallel data storage method and system
CN111800336A (en) * 2020-08-06 2020-10-20 通维数码科技(上海)有限公司 Routing transmission implementation method based on multi-channel network link aggregation
CN113742125A (en) * 2021-09-06 2021-12-03 中国工程物理研究院计算机应用研究所 Lightweight high-throughput computing mode and fault-tolerant method thereof
CN115630107A (en) * 2022-10-31 2023-01-20 平安银行股份有限公司 Abnormal data processing method, electronic device and computer readable storage medium
CN116016278A (en) * 2022-12-22 2023-04-25 四川九州电子科技股份有限公司 Dynamic adjustment method for EDCA parameters
EP4211553A1 (en) * 2020-09-12 2023-07-19 Kinzinger Automation GmbH Method of interleaved processing on a general-purpose computing core
CN116996930A (en) * 2023-09-28 2023-11-03 深圳市鲸视科技有限公司 Wireless device testing method, system, computer device and storage medium
CN117135091A (en) * 2023-09-11 2023-11-28 广东云下汇金科技有限公司 DCI monitoring alarm method, system and computing device for data center
CN117194398A (en) * 2023-09-05 2023-12-08 中国工商银行股份有限公司 Abnormal file processing method and device, storage medium and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815294A (en) * 2019-02-14 2019-05-28 北京谷数科技有限公司 A kind of dereliction Node distribution parallel data storage method and system
CN111800336A (en) * 2020-08-06 2020-10-20 通维数码科技(上海)有限公司 Routing transmission implementation method based on multi-channel network link aggregation
EP4211553A1 (en) * 2020-09-12 2023-07-19 Kinzinger Automation GmbH Method of interleaved processing on a general-purpose computing core
CN113742125A (en) * 2021-09-06 2021-12-03 中国工程物理研究院计算机应用研究所 Lightweight high-throughput computing mode and fault-tolerant method thereof
CN115630107A (en) * 2022-10-31 2023-01-20 平安银行股份有限公司 Abnormal data processing method, electronic device and computer readable storage medium
CN116016278A (en) * 2022-12-22 2023-04-25 四川九州电子科技股份有限公司 Dynamic adjustment method for EDCA parameters
CN117194398A (en) * 2023-09-05 2023-12-08 中国工商银行股份有限公司 Abnormal file processing method and device, storage medium and electronic equipment
CN117135091A (en) * 2023-09-11 2023-11-28 广东云下汇金科技有限公司 DCI monitoring alarm method, system and computing device for data center
CN116996930A (en) * 2023-09-28 2023-11-03 深圳市鲸视科技有限公司 Wireless device testing method, system, computer device and storage medium

Similar Documents

Publication Publication Date Title
US10693711B1 (en) Real-time event correlation in information networks
US6836750B2 (en) Systems and methods for providing an automated diagnostic audit for cluster computer systems
US9015006B2 (en) Automated enablement of performance data collection
US9317393B2 (en) Memory leak detection using transient workload detection and clustering
US10489232B1 (en) Data center diagnostic information
CN105205003A (en) Automated testing method and device based on clustering system
CN110062926B (en) Device driver telemetry
US9400731B1 (en) Forecasting server behavior
Trivedi et al. Combining performance and availability analysis in practice
CN110543427B (en) Test case storage method and device, electronic equipment and storage medium
US20200364595A1 (en) Configuration assessment based on inventory
CN112527484A (en) Workflow breakpoint continuous running method and device, computer equipment and readable storage medium
CN112559285A (en) Distributed service architecture-based micro-service monitoring method and related device
US9563719B2 (en) Self-monitoring object-oriented applications
CN111913824A (en) Method for determining data link fault reason and related equipment
CN114519006A (en) Test method, device, equipment and storage medium
CN112598529B (en) Data processing method and device, computer readable storage medium and electronic equipment
CN111082964B (en) Distribution method and device of configuration information
CN117909907A (en) High-throughput computing platform, and anomaly removal method, device and storage medium thereof
GB2504496A (en) Removing code instrumentation based on the comparison between collected performance data and a threshold
US20170024745A1 (en) Network management event escalation
CN113259878A (en) Call bill settlement method, system, electronic device and computer readable storage medium
US10467082B2 (en) Device driver verification
CN117453376B (en) Control method, device, equipment and storage medium for high-throughput calculation
CN112306831A (en) Computing cluster error prediction method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination