CN116723085A - Service conflict processing method and device, storage medium and electronic device - Google Patents

Service conflict processing method and device, storage medium and electronic device Download PDF

Info

Publication number
CN116723085A
CN116723085A CN202310868738.9A CN202310868738A CN116723085A CN 116723085 A CN116723085 A CN 116723085A CN 202310868738 A CN202310868738 A CN 202310868738A CN 116723085 A CN116723085 A CN 116723085A
Authority
CN
China
Prior art keywords
target
service
node
processing operation
service conflict
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310868738.9A
Other languages
Chinese (zh)
Inventor
刘晓健
苏宝珠
董建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310868738.9A priority Critical patent/CN116723085A/en
Publication of CN116723085A publication Critical patent/CN116723085A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a method and a device for processing service conflict, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring target state information of a node to be detected, wherein the target state information is used for representing state information of service operation of the node to be detected; under the condition that the node to be detected has the service conflict of the target type is determined according to the target state information, determining target processing operation corresponding to the service conflict of the target type according to a pre-training model, wherein the pre-training model comprises a group of fault models, and each fault model in the group of fault models is a model obtained by training a data packet set generated by the occurrence of the corresponding service conflict of one type; and executing the target processing operation to repair the service conflict of the target type under the condition that the target processing operation is determined. By the embodiment of the application, the technical problem of low processing efficiency of service conflict in the related technology is solved.

Description

Service conflict processing method and device, storage medium and electronic device
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method and a device for processing service conflicts, a storage medium and an electronic device.
Background
With the development of technology, the demand of AI platforms for deep learning and resource scheduling management is increasing, and the deployment of platforms is increasing by an order of magnitude, so that stable operation and maintenance of the platforms are of great importance. However, due to misoperation of a platform user, deployment of service components conflicting with the clusters and other conditions often occur, so that abnormal conditions can occur to the service of the bottom layer of the AI cluster; in this case, there is a problem that the normal state cannot be automatically restored, and the overall influence is large. In the related art, when a conflict service component exists on a cluster bottom server, a platform is abnormal, manual positioning processing is needed to recover, and manual positioning is mainly performed by referring to a platform problem repair manual during positioning. Automatic analysis, positioning and automatic processing and repairing cannot be performed on the conflict component. Problems caused by manual positioning processing service conflict include large manpower consumption, incapability of controlling positioning time, low efficiency and possibility of incapability of accurately positioning and repairing due to experience and technical limitations of operation and maintenance personnel. As can be seen, the related art has a problem of low processing efficiency for service conflicts.
Aiming at the technical problem of low processing efficiency of service conflict in the related technology, no effective solution is proposed at present.
Disclosure of Invention
The embodiment of the application provides a method and a device for processing service conflicts, a storage medium and an electronic device, which are used for at least solving the technical problem of low processing efficiency of the service conflicts in the related technology.
According to an embodiment of the present application, there is provided a method for processing a service conflict, including: acquiring target state information of a node to be detected, wherein the target state information is used for representing state information of service operation of the node to be detected; determining a target processing operation corresponding to the target type of service conflict according to a pre-training model under the condition that the target type of service conflict exists in the node to be detected according to the target state information, wherein the target processing operation is used for repairing the target type of service conflict, the pre-training model comprises a group of fault models, and each fault model in the group of fault models is a model obtained by training a data packet set generated by occurrence of a corresponding type of service conflict; and executing the target processing operation under the condition that the target processing operation is determined so as to repair the service conflict of the target type.
In an exemplary embodiment, after obtaining the target state information of the node to be detected, the method further comprises: judging whether the node to be detected has service conflict or not according to the target state information; judging whether the service conflict is generated by pre-deployed service or not under the condition that the service conflict exists in the node to be detected; and under the condition that the service conflict is not the service conflict generated by the pre-deployed service, determining that the target type of service conflict exists in the node to be detected.
In an exemplary embodiment, the obtaining the target state information of the node to be detected includes: and acquiring first state information of the operation of a group of micro service components of the node to be detected and second state information of the bottom container service of the node to be detected, wherein the target state information comprises the first state information and the second state information.
In one exemplary embodiment, before determining the target processing operation corresponding to the service conflict of the target type according to the pre-training model, the method further comprises: determining that the node to be detected has a first type of service conflict under the condition that the first state information indicates abnormal interaction of the group of micro service components; determining that the node to be detected has a second type of service conflict under the condition that the second state information indicates that the container corresponding to the bottom container service is not in an operation state; wherein the target type of service conflict comprises the first type of service conflict and the second type of service conflict.
In an exemplary embodiment, the determining, according to a pre-training model, a target processing operation corresponding to a service conflict of the target type includes: acquiring target log information of the node to be detected, wherein the target log information comprises the log information generated when service corresponding to a group of micro service components of the node to be detected and/or service conflict of the target type occurs to bottom container service of the node to be detected; and detecting the target log information by using the pre-training model to obtain an identifier of the target processing operation, wherein the identifier of the target processing operation is used for indicating the node to be detected to execute the target processing operation indicated by the identifier of the target processing operation so as to repair the service conflict of the target type.
In an exemplary embodiment, the detecting the target log information by using the pre-training model to obtain the identification of the target processing operation includes: extracting a set of target feature values of the target log information by using the pre-training model, wherein the set of target feature values comprise feature values of a set of parameters in the target log information; comparing the set of target characteristic values with a set of preset characteristic values in the pre-training model to obtain a comparison result, wherein the set of preset characteristic values represent preset characteristic values corresponding to the set of parameters in the pre-training model; and determining the identification of the target processing operation based on the comparison result.
In an exemplary embodiment, the comparing the set of target feature values with a set of preset feature values in the pre-training model to obtain a comparison result includes: determining variance values between the set of target feature values and N sets of preset feature values respectively to obtain N variance values, wherein the N sets of preset feature values are preset feature values corresponding to N fault models in the pre-training model, each fault model in the N fault models corresponds to a set of preset feature values respectively, each variance value in the N variance values is a variance value between the set of target feature values and a set of preset feature values in the N sets of preset feature values, the comparison result comprises the N variance values, and N is a positive integer greater than or equal to 1; the determining, based on the comparison result, an identification of the target processing operation includes: and under the condition that an ith variance value in the N variance values is smaller than or equal to an ith preset repair threshold value, determining the identification of the processing operation corresponding to the ith preset repair threshold value as the identification of the target processing operation, wherein the ith preset repair threshold value is a repair threshold value corresponding to an ith fault model in the N fault models, and i is a positive integer smaller than or equal to N.
In an exemplary embodiment, the method further comprises: and training the pre-training model by using the target log information under the condition that each variance value in the N variance values is larger than a preset restoration threshold value corresponding to each variance value.
In one exemplary embodiment, before determining the target processing operation corresponding to the service conflict of the target type according to the pre-training model, the method further comprises: obtaining an ith fault model in the group of fault models, wherein the ith fault model corresponds to an ith type of service conflict, and the ith fault model is used for determining a processing operation corresponding to the ith type of service conflict, and i is a positive integer greater than or equal to 1: acquiring an ith data packet set and an identification of an actual processing operation corresponding to the ith data packet set, wherein the ith data packet set comprises a sample log information set generated when the ith type of service conflict occurs; training an ith initial fault model by using the ith data packet set until a loss value between an identification of a prediction processing operation output by the ith initial fault model and an identification of an actual processing operation meets a preset convergence condition, ending training, and determining the ith initial fault model at the end of training as the ith fault model, wherein parameters in the ith initial fault model are adjusted under the condition that the loss value does not meet the convergence condition.
According to still another embodiment of the present application, there is provided a service conflict processing apparatus, including: the system comprises an acquisition module, a detection module and a detection module, wherein the acquisition module is used for acquiring target state information of a node to be detected, and the target state information is used for representing the state information of service operation of the node to be detected; the first determining module is used for determining a target processing operation corresponding to the service conflict of the target type according to a pre-training model under the condition that the service conflict of the target type exists in the node to be detected according to the target state information, wherein the target processing operation is used for repairing the service conflict of the target type, the pre-training model comprises a group of fault models, and each fault model in the group of fault models is a model obtained by training a data packet set generated by occurrence of the corresponding service conflict of one type; and the processing module is used for executing the target processing operation under the condition that the target processing operation is determined so as to repair the service conflict of the target type.
According to a further embodiment of the application, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the embodiment of the application, the target processing operation corresponding to the service conflict of the target type is determined according to the pre-training model by acquiring the target state information of the node to be detected under the condition that the service conflict of the target type exists in the node to be detected according to the target state information, wherein the target processing operation is used for repairing the service conflict of the target type, the pre-training model comprises a group of fault models, and each fault model in the group of fault models is a model obtained by training a data packet set generated by the occurrence of the corresponding service conflict of the type. When detecting that the node to be detected has the service conflict of the target type, the method can automatically determine the target processing operation corresponding to the service conflict of the target type according to the pre-training model, and then repair the service conflict of the target type according to the target processing operation. The problem that the processing time is long because the service conflict is needed to be positioned and repaired manually in the related technology is avoided. Therefore, the technical problem of low processing efficiency of the service conflict in the related technology can be solved, and the effect of improving the processing efficiency of the service conflict is achieved.
Drawings
FIG. 1 is a schematic diagram of a hardware environment of a server of a method for handling service conflicts according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of handling service conflicts in accordance with an embodiment of the present application;
FIG. 3 is a flow chart of the handling of a service exception according to an embodiment of the application;
FIG. 4 is a flow chart of a service conflict detection and recovery method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a pre-training model evaluation process according to an embodiment of the present application;
fig. 6 is a block diagram of a processing apparatus for service conflict according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The method embodiments provided by the embodiments of the present application may be performed in a server, a computer terminal, a device terminal, or similar computing apparatus. Taking the operation on a server as an example, fig. 1 is a schematic diagram of a hardware environment of a server according to a service conflict processing method according to an embodiment of the present application. As shown in fig. 1, the server may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and in one exemplary embodiment, the server may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative, and is not intended to limit the structure of the server described above. For example, a server may also include more or fewer components than shown in FIG. 1, or have a different configuration than the equivalent functions shown in FIG. 1 or more than the functions shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for handling service conflicts in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located with respect to the processor 102, which may be connected to a server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a server. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a method for processing a service conflict is provided, and fig. 2 is a flowchart of a method for processing a service conflict according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:
step S202, obtaining target state information of a node to be detected, wherein the target state information is used for representing state information of service operation of the node to be detected;
step S204, under the condition that the target type service conflict exists in the node to be detected according to the target state information, determining a target processing operation corresponding to the target type service conflict according to a pre-training model, wherein the target processing operation is used for repairing the target type service conflict, the pre-training model comprises a group of fault models, and each fault model in the group of fault models is a model obtained by training a data packet set generated by occurrence of a corresponding type of service conflict;
And step S206, executing the target processing operation to repair the service conflict of the target type under the condition that the target processing operation is determined.
Through the steps, the target processing operation corresponding to the service conflict of the target type is determined according to the pre-training model under the condition that the service conflict of the target type exists in the node to be detected according to the target state information by acquiring the target state information of the node to be detected, wherein the pre-training model comprises a group of fault models, and each fault model in the group of fault models is a model obtained by training a data packet set generated by the occurrence of the corresponding service conflict of the type. When detecting that the node to be detected has the service conflict of the target type, the method can automatically determine the target processing operation corresponding to the service conflict of the target type according to the pre-training model, and then repair the service conflict of the target type according to the target processing operation. The problem that the processing time is long because the service conflict is needed to be positioned and repaired manually in the related technology is avoided. Therefore, the technical problem of low processing efficiency of the service conflict in the related technology can be solved, and the effect of improving the processing efficiency of the service conflict is achieved.
The main execution body of the steps may be a server, such as a server or a node in a cluster, or a processor, or a device end, or a controller, or an application control program in a device, or a processor with man-machine interaction capability configured on a storage device, or a processing device or a processing unit with similar processing capability, but is not limited thereto.
In the technical scheme provided in the step S202, taking the AI platform cluster as an example, a detection repair service may be deployed at each main node of the AI platform, and when a checkservice timing task is enabled, the target state information of each node of the AI platform may be detected; the target state information is used for indicating the state information of service operation of the node to be detected, for example, the target state information may indicate the state information of operation of a group of micro service components of the node to be detected, and/or the state information of an underlying container service (or may be called an underlying POD service) of the node to be detected, for example, the interaction state of services of the AI platform, such as ibase/iresouce, etc., with the underlying service components (or micro service components) may be checked, and the state and up duration of each service container on the cluster master node may be queried through a dock command; aiming at checking of service abnormality, in practical application, the log information of the bottom service component of the AI platform can be checked, and the ERROR log information obtains abnormal state. The target state information may be fault log information containing the node to be detected and/or underlying service exception reporting information. As an optional embodiment, the node to be detected may be a slave node in an AI platform cluster; as another alternative embodiment, the node to be detected may also be a master node in the AI platform cluster.
In the technical solution provided in step S204, when it is determined that the node to be detected has the service conflict of the target type according to the target state information, for example, when it is determined that the node to be detected has the service conflict of the target type according to the fault log of the node to be detected and the failure report information of the underlying service, the service conflict of the target type may be a service conflict generated by a service that is not deployed in advance by the AI platform, or may be a service conflict outside the platform, for example, in an actual application, may be due to misoperation of a platform user, an abnormal condition may occur in the underlying service of the AI cluster due to deployment of a service component that collides with the cluster, and the like, and such service abnormal condition cannot be automatically recovered generally; the service conflict of the target type can be mariadib service conflict, which may cause abnormal synchronization of the galera cluster and frequent restarting; or the service conflict of the target type can be a docker service conflict, so that the original docker service/var/lib/data of the cluster is destroyed, and the cluster reports errors in a large range; or the service conflict of the target type can be an influxdb service conflict, and the cluster monitoring data is abnormal in writing; or the service conflict of the target type can be telegraf service conflict, abnormal execution of the monitoring script and the like; when detecting that the node to be detected has a service conflict of a target type (or called an off-platform service conflict), determining a target processing operation (or a target processing scheme or a target repairing scheme) corresponding to the service conflict of the target type according to a pre-training model, wherein the pre-training model comprises a group of fault models, each fault model in the group of fault models is a model obtained by training a data packet set generated by utilizing a corresponding service conflict of one type, for example, the data packet set can comprise a log information set generated when a service conflict of one type occurs; thus, when the determined service conflict of the target type exists, a target processing operation (or target processing scheme) corresponding to the service conflict of the target type can be determined according to the pre-training model. Through this step, the purpose of the target processing operation (or target processing scheme) corresponding to the service conflict of the target type can be determined using the pre-training model.
In the technical solution provided in step S206, when the target processing operation (or the target processing solution) is determined, the target processing operation is executed to repair the service conflict of the target type, for example, when the master node detects that the master node or the slave node has the service conflict of the target type, a repair task (corresponding to the target processing operation or the target processing solution) may be issued to enable the relevant master node or slave node to execute the repair operation. Through the step, the self-repairing operation of the service conflict of the target type can be completed.
In the above embodiment, when detecting that the node to be detected has the service conflict of the target type, the target processing operation corresponding to the service conflict of the target type may be automatically determined according to the pre-training model, and then the service conflict of the target type may be repaired according to the target processing operation. The problem that the processing time is long because the service conflict is needed to be positioned and repaired manually in the related technology is avoided. Therefore, the technical problem of low processing efficiency of the service conflict in the related technology can be solved, and the effect of improving the processing efficiency of the service conflict is achieved.
In an alternative embodiment, after obtaining the target state information of the node to be detected, the method further comprises: judging whether the node to be detected has service conflict or not according to the target state information; judging whether the service conflict is generated by pre-deployed service or not under the condition that the service conflict exists in the node to be detected; and under the condition that the service conflict is not the service conflict generated by the pre-deployed service, determining that the target type of service conflict exists in the node to be detected.
In the above embodiment, after the target state information of the node to be detected is obtained, for example, after the state information of a group of micro service components of the node to be detected is obtained, and/or the state information of an underlying container service (or may be referred to as an underlying POD service) of the node to be detected is obtained, whether a service conflict exists in the node to be detected may be determined according to the target state information, when it is determined that the service conflict exists, whether the service conflict is a service conflict generated by a pre-deployed service, for example, whether the service conflict is a service conflict generated by an AI platform pre-deployed service (may also be referred to as an intra-platform service anomaly or a service conflict), and if it is determined that the service conflict is not a service conflict generated by a pre-deployed service, it may be determined that the node to be detected exists in the target type of service conflict, or may be referred to as a service conflict in a non-platform. By the method, the device and the system, after the target state information of the node to be detected is obtained, whether the node to be detected has service conflict in a non-platform or not is judged according to the target state information.
In an optional embodiment, the obtaining the target state information of the node to be detected includes: and acquiring first state information of the operation of a group of micro service components of the node to be detected and second state information of the bottom container service of the node to be detected, wherein the target state information comprises the first state information and the second state information.
In the above embodiment, the first state information of the operation of a group of micro service components of the node to be detected and the second state information of the bottom container service (or called as bottom POD service) of the node to be detected may be obtained, for example, the interaction state of each service of the AI platform, such as ibase/iresoure, and the like, with the bottom service components (or micro service components) may be checked, and the state and up time of each service container on the cluster master node may be queried through a dock command; for checking the service abnormality, the log information of the bottom layer service component of the AI platform can be checked, the ERROR log information obtains an abnormal state, and the target state information can be fault log information containing nodes to be detected and/or ERROR reporting information of the bottom layer service abnormality. In practical application, a detection repair service can be deployed on each main node of the AI platform, and a checkservice timing task is started to detect whether a service conflict problem occurs. According to the embodiment, the target state information of the node to be detected can be determined according to the log information of each bottom layer service component in the AI platform cluster and the ERROR log information of the bottom layer POD service.
In an alternative embodiment, before determining the target processing operation corresponding to the service conflict of the target type according to the pre-trained model, the method further comprises: determining that the node to be detected has a first type of service conflict under the condition that the first state information indicates abnormal interaction of the group of micro service components; determining that the node to be detected has a second type of service conflict under the condition that the second state information indicates that the container corresponding to the bottom container service is not in an operation state; wherein the target type of service conflict comprises the first type of service conflict and the second type of service conflict.
In the above embodiment, when the first state information indicates that the interaction of a group of micro service components of the node to be detected is abnormal, it may be determined that the node to be detected has a first type of service conflict, for example, when a checkservice timing task is enabled, the interaction state of each service of the AI platform, such as ibase/iresouce, and the like, with the underlying service component (or micro service component) may be checked, and when the first state information indicates that the interaction of a group of micro service components is abnormal, it may be determined that the node to be detected has a first type of service conflict, that is, when the first state information indicates that the micro service component (or the underlying service component) of the node has an interaction abnormality; when the second state information indicates that the underlying container service (or the underlying POD service) of the node to be detected is not in an operation state, it may be determined that the node to be detected has a second type of service conflict, for example, when a checkservice timing task is enabled, the state and up duration of each service container on the cluster master node may be queried through a dock command, and when the second state information indicates that the underlying POD service is not in an operation state, it may be determined that the node to be detected has a second type of service conflict, for example, the second type of service conflict may be a dock service conflict, where the service conflict may cause the original dock service/var/lib/data of the cluster to be destroyed, and the cluster may report errors in a large range; the first state information may be log information of an underlying service component of the AI platform, and the second state information may be error log information of an underlying POD service exception. Optionally, in practical applications, the target state information may also be used to represent other types of service conflicts. According to the method and the device, the purpose of determining whether the first type of service conflict and the second type of service conflict exist in the node to be detected according to the target state information is achieved.
In an alternative embodiment, the determining, according to a pre-training model, a target processing operation corresponding to a service conflict of the target type includes: acquiring target log information of the node to be detected, wherein the target log information comprises the log information generated when service corresponding to a group of micro service components of the node to be detected and/or service conflict of the target type occurs to bottom container service of the node to be detected; and detecting the target log information by using the pre-training model to obtain an identifier of the target processing operation, wherein the identifier of the target processing operation is used for indicating the node to be detected to execute the target processing operation indicated by the identifier of the target processing operation so as to repair the service conflict of the target type.
In the above embodiment, the target log information of the node to be detected may be obtained, where the target log information may include the log information generated when the service corresponding to a group of micro service components of the node to be detected and/or the underlying container service of the node to be detected collides with the service of the target type, and for example, the target log information may be the log information of the underlying service component of the AI platform and/or the underlying POD service exception reporting log information; and then detecting the target log information by using a pre-training model to obtain the identification of target processing operation, wherein the identification of target processing operation is used for indicating to execute corresponding target processing operation so as to repair the service conflict of the target type, and each identification of target processing operation corresponds to one target processing scheme (or target repairing scheme). The pre-training models may include a set of fault models, for example, the pre-training models include 5 (or 10, or other number) fault models, each fault model in the set of fault models is a model obtained by training a set of data packets generated by a corresponding type of service conflict, for example, the 5 fault models are each obtained by training a set of data packets generated by 5 different types of service conflicts, and the set of data packets may include a set of log information generated when a certain type of service conflict occurs; thus, when the determined service conflict of the target type exists, a target processing operation (or target processing scheme) corresponding to the service conflict of the target type can be determined according to the pre-training model. According to the embodiment, when the service conflict occurs in the nodes to be detected of the AI platform cluster, a repairing scheme can be determined according to the target log information so as to achieve the purpose of automatically repairing the service conflict, and further the effect of maintaining the stability of the platform is achieved.
Taking telegraf service conflict as an example, in practical application, firstly protecting a directory file under a path where the own telegraf service of a platform is located, and backing up; systemctl stop telegraf stops the user's own installed telegraf service, deletes the underlying configuration/etc/telegraf. Conf file (the configuration information is thoroughly clear after some services need to restart the node). And restoring the self-service backup data of the platform to a designated path in installation and deployment, restarting the telegraf service, and after the service is started, carrying out self-checking and butt joint on the platform cluster to restore the use.
In an optional embodiment, the detecting the target log information by using the pre-training model to obtain the identifier of the target processing operation includes: extracting a set of target feature values of the target log information by using the pre-training model, wherein the set of target feature values comprise feature values of a set of parameters in the target log information; comparing the set of target characteristic values with a set of preset characteristic values in the pre-training model to obtain a comparison result, wherein the set of preset characteristic values represent preset characteristic values corresponding to the set of parameters in the pre-training model; and determining the identification of the target processing operation based on the comparison result.
In the above embodiment, the pre-training model may be used to extract a set of target feature values of target log information, where the set of target feature values may include feature values of a set of parameters in the target log information, where the target log information may be container log and service fault information of a service conflict, for example, the set of parameters may be parameters related to or generated during the operation of a micro service component of a node to be detected, and/or parameters related to or generated during the operation of an underlying container service of the node to be detected, and the set of parameters may include one or more feature points, for example, feature points of an error code included in the log information, where the feature points may be "jdbc. try restarting transaction "," level=error msg =, "failed to open gcomm backend connection", etc., according to different feature points in the dataset, the later operation and maintenance personnel obtains a new error log and processes the new error log into feature points in the development process of the platform project, and supplements the feature points to the dataset model for comparison and identification; and comparing the set of target characteristic values with a set of preset characteristic values in a pre-training model to obtain a comparison result, wherein the pre-training model comprises a set of fault models, each fault model in the set of fault models corresponds to a set of preset characteristic values respectively, determining the identification of the target processing operation according to the comparison result, for example, calculating the variance between the set of target characteristic values and the set of preset characteristic values, and taking the target processing operation (or referred to as a target processing scheme) corresponding to the target fault model as a repairing scheme for generating abnormal conflict of the target log information when the variance meets the requirement with the repairing threshold corresponding to the target fault model (for example, one fault model in the set of fault models). According to the method and the device for processing the service conflict, the problem of the service conflict can be rapidly located through the pre-training model evaluation of the target log information, and the repairing scheme corresponding to the service conflict is determined, so that the problems of labor consumption and low efficiency caused by the fact that the service conflict is required to be manually located and processed in the related technology are avoided, and the effect of improving the processing efficiency of the service conflict is achieved.
In an alternative embodiment, the comparing the set of target feature values with a set of preset feature values in the pre-training model to obtain a comparison result includes: determining variance values between the set of target feature values and N sets of preset feature values respectively to obtain N variance values, wherein the N sets of preset feature values are preset feature values corresponding to N fault models in the pre-training model, each fault model in the N fault models corresponds to a set of preset feature values respectively, each variance value in the N variance values is a variance value between the set of target feature values and a set of preset feature values in the N sets of preset feature values, the comparison result comprises the N variance values, and N is a positive integer greater than or equal to 1; the determining, based on the comparison result, an identification of the target processing operation includes: and under the condition that an ith variance value in the N variance values is smaller than or equal to an ith preset repair threshold value, determining the identification of the processing operation corresponding to the ith preset repair threshold value as the identification of the target processing operation, wherein the ith preset repair threshold value is a repair threshold value corresponding to an ith fault model in the N fault models, and i is a positive integer smaller than or equal to N.
In the above embodiment, optionally, the pre-training model includes N fault models, each of the N fault models has a preset feature value corresponding to the set of parameters, that is, N sets of preset feature values are shared, and variance values between a set of target feature values of the target log information and the N sets of preset feature values can be determined, so as to obtain N variance values; each of the N fault models corresponds to a preset repair threshold, and when the ith variance value in the N variance values is smaller than or equal to the ith preset repair threshold, the identification of the processing operation corresponding to the ith preset repair threshold can be determined as the identification of the target processing operation, namely, the identification of the repair scheme corresponding to the ith fault model is determined as the identification of the target processing operation. Optionally, in practical application, the optimal variance value may be determined from the N variance values, that is, the smallest variance value is selected from the N variance values, and then it is determined whether the optimal variance value meets a requirement of a preset repair threshold. In this embodiment, by comparing the variance value with the preset repair threshold value corresponding to each fault model repair scheme, when the variance value meets the requirement, the corresponding identifier of the target processing operation (i.e., the identifier of the target repair scheme) may be obtained, so as to execute the corresponding repair operation. Optionally, in practical application, the preset repair threshold corresponding to each fault model may be adjusted according to the AI cluster state.
In an alternative embodiment, the method further comprises: and training the pre-training model by using the target log information under the condition that each variance value in the N variance values is larger than a preset restoration threshold value corresponding to each variance value.
In the foregoing embodiment, when each variance value of the N variance values is greater than a preset repair threshold corresponding to the N fault models, the target log information may be used as an alternative data packet set to be used for training the pre-training model, or the pre-training model may be trained using the target log information. If all the N variance values can not meet the repair threshold, the identification of the target processing operation can not be obtained, namely, the identification of the repair operation can not be obtained, the current error information data set storage operation is executed, after the operation and maintenance personnel find a manual solution, the model information is trained again according to the error reporting information to enter a model library, and the corresponding repair scheme is updated.
In an alternative embodiment, before determining the target processing operation corresponding to the service conflict of the target type according to the pre-trained model, the method further comprises: obtaining an ith fault model in the group of fault models, wherein the ith fault model corresponds to an ith type of service conflict, and the ith fault model is used for determining a processing operation corresponding to the ith type of service conflict, and i is a positive integer greater than or equal to 1: acquiring an ith data packet set and an identification of an actual processing operation corresponding to the ith data packet set, wherein the ith data packet set comprises a sample log information set generated when the ith type of service conflict occurs; training an ith initial fault model by using the ith data packet set until a loss value between an identification of a prediction processing operation output by the ith initial fault model and an identification of an actual processing operation meets a preset convergence condition, ending training, and determining the ith initial fault model at the end of training as the ith fault model, wherein parameters in the ith initial fault model are adjusted under the condition that the loss value does not meet the convergence condition.
In the above embodiment, each fault model in the group of fault models is obtained by classifying and training various fault logs and the data set of the bottom layer service abnormality error reporting information in advance according to the difference of the service conflict fault automatic repairing methods; for example, for the ith fault model, by acquiring the ith data packet set and the identifier of the actual processing operation corresponding to the ith data packet set (which may be understood as the identifier of the processing scheme or the identifier of the repairing scheme), training the ith initial fault model by using the ith data packet set until the loss value between the identifier of the predicted processing operation output by the ith initial fault model and the identifier of the actual processing operation meets the preset convergence condition, and determining the ith initial fault model at the end of training as the ith fault model. Through the embodiment, the aim of obtaining each fault model by utilizing the respective training of various data packet sets is fulfilled.
It will be apparent that the embodiments described above are merely some, but not all, embodiments of the application. The present application will be specifically described with reference to examples.
The embodiment of the application provides an AI (advanced technology attachment) platform cluster fault recovery method based on underlying service rejection training, which aims at a fault data set acquired during service conflict, and selects a most suitable pre-training fault model from a pre-training model library for executing corresponding rejection recovery operation. The method specifically comprises the following steps:
(1) And deploying detection repair service at each main node of the AI platform.
(2) And (3) starting a checkservice timing task, detecting whether the micro service components of each node of the AI platform and the K8S bottom pod service state are normal or not (key checkpoints: a, inquiring the states of each service container on a cluster main node and up time through a dock command; b, the interaction states of each service of the AI platform, such as ibase/iresinoerce and the like, with the bottom service components, c, checking log information of the bottom service components of the AI platform, and acquiring abnormal states by ERROR log information).
(3) The timing task detects that the service conflict problem cannot be repaired (the service conflict is such as mariadib service conflict, which causes abnormal synchronization and frequent restarting of a galera cluster, and docker service conflict, which causes damage to original docker service/var/lib/data of the cluster, large-scale error reporting of the cluster, influxdb service conflict, abnormal writing of cluster monitoring data, telegraf service conflict, abnormal execution of monitoring scripts and the like), and enters a repairing stage. And (3) sorting logs of abnormal services in the cluster and K8S bottom layer service abnormal error reporting information to form a targetData.AI platform bottom layer generation container, evaluating through a LogME pre-training model, selecting a fault model with highest scoring, and performing conflict service recovery by a fault model generation method (4), so that a corresponding repair method is selected according to the comparison of the scoring and the threshold requirement.
(4) The fault model classifies data sets of fault logs and K8S bottom layer service exception error reporting information in advance according to different service conflict fault automatic repairing methods (if a docker service conflict is identified, a docker service installed by a user is stopped, a docker process relied on by a platform bottom layer is restarted, backup data of the docker is recovered after the service is normal, a mariadib service conflict is identified by log information error reporting, a node with the latest data is determined according to the maximum value of seqno, a mysql container is stopped, a gasra.cache and a data.dat file are removed, and a new gasra cluster is generated for an AI platform after a config.json file of the node is modified. The training model is used for generating a corresponding fault model (training models when a model library is generated are existing, and training is carried out according to a classified fault data set), the trained result is stored in a pre-training model library, the loss function is DIOU, and when the pre-training model library is generated, a punishment value is added for the standardized difference value of the central values of a prediction frame and a detection frame, so that the training convergence speed is accelerated).
Embodiments of the present application will be described below with reference to the accompanying drawings. FIG. 3 is a flow chart of the handling of a service exception according to an embodiment of the application, the flow comprising:
S302, detecting whether the service is abnormal; namely, by starting the checkservice service, whether the micro-service components of each node of the AI platform and the K8S bottom-layer pod service state are normal or not is detected, and when abnormality is detected, step S304 is entered.
S304, judging whether non-platform services exist; after detecting that the service has an abnormality in step S302, firstly, acquiring service information of each node and whether the number of services meets the cluster requirement, comparing the service information with a platform service list, and judging whether non-platform service exists.
S306, when the judgment result of the step S304 is yes, that is, when a non-platform service abnormality (corresponding to the service conflict of the target type) exists, performing service conflict detection; i.e. when there is a non-platform service abnormality, step S310 is entered, i.e. a service detection and rejection training processing scheme is entered.
S308, when the judgment result in the step S304 is no, that is, when no non-platform service abnormality exists, that is, the service abnormality is an intra-platform service abnormality, entering an intra-platform service repair scheme, and then entering a step S312; the embodiment of the invention mainly aims at the solution of service conflict, namely corresponds to the dotted line box part in fig. 3.
S310, pre-training rejection repairing, for example, by evaluating a LogME pre-training model (corresponding to the pre-training model), selecting a fault model with highest score, and selecting a repairing method (corresponding to the target processing operation) according to the comparison of the score and the threshold requirement, so as to execute rejection training after service conflict and platform recovery.
S312, returning a cluster repair result.
S314, ending.
FIG. 4 is a flowchart of a service conflict detection and recovery method according to an embodiment of the present application, the flowchart comprising:
s402, starting a checkservice service, and checking the micro service components of each node of the AI platform and the bottom pod service state of the K8S (corresponding to the target state information).
S404, judging whether the cluster has service conflict.
S406, when the judgment result of the step S404 is yes, collecting abnormal information (corresponding to the target log information) to form a data set;
i.e. when the timing task detects that an error dataset is caused by a platform service conflict. The log of the exception service and the K8S bottom layer service exception reporting information form targetData (corresponding to the data packet set).
If the determination result in step S404 is no, the routine returns to step S402 to continue checking the service status.
S408, pre-training model selection.
FIG. 5 is a schematic diagram of a pre-training Model evaluation process according to an embodiment of the present application, wherein a pre-training Model, such as Model3 (corresponding to repair scenario identification 3) in FIG. 5, is selected based on an anomaly information dataset.
The AI platform bottom layer generates a container pod-a pre-training model evaluation exists in a container yaml configuration file, when abnormal information triggers the model evaluation, the current node automatically executes a kubecl create command to create a pre-training pod based on the LogME applicable mirror image, the targetData file information is brought in the container, and model training (commands such as python-u/log me/train_sample/log me-cnn_lm_v1.01/scripts/lm.py-model is performed 16-feature 20-1000-batch_size 256-data_dir=/targetData) to obtain model_check_response.
S410, judging whether the fault model scores meet the threshold requirement.
S412, if the determination result in step S410 is yes, restoring the service conflict restoration cluster.
For example, through log me pre-training model evaluation, the fault model with the highest score is selected, so that the corresponding repair method is selected according to the comparison of the score and the threshold requirement to execute rejection training and platform recovery after service conflict.
S414, if the determination result in step S410 is no, storing the targetdata set, determining the repair method, and then training the new model.
In the above steps S410-S414, in the model training, the feature points (the number of feature points can be customized, the larger the value is, the more accurate) in the extracted targetData set model are compared with the values in the model library, so as to obtain the feature point variance:
wherein S is 2 As characteristic point variance, mu i For the value identified by the model X corresponding to i in the model library, X i The model corresponding value generated for training the abnormal information data set is n, and the n is the number of the taken characteristic points.
Note that: a. command parameter interpretation: after the pre-training container is started, the python command is executed in the container to run training model tasks. The model framework LM can be imported, the number of feature points 20, the number of running steps 1000 (the greater the number of execution steps, the longer the running time, the more fully trained), the batch size 256 (custom based on node memory and machine performance), and the dataset path/targetData.
b. The feature points are feature point sequences which are summarized and sequenced in advance according to abnormal data sets when the services in the clusters collide, and the sequences are defined and have priority according to the universality and defect severity of abnormal information. And selecting characteristic point parameters in the first n characteristic point sequences to perform data set training according to the number n of the characteristic points defined by the user.
For example, the feature point may be "jdbc.admission.jdb4" in the log; try restarting transaction "," level=error msg =, "failed to open gcomm backend connection", etc., according to different feature points in the dataset, the later operation and maintenance personnel obtains a new error log, processes the new error log into feature points in the development process of the platform project, and supplements the feature points to the dataset model for comparison and identification.
Comparing the variance with the repair threshold of the current model repair scheme identification, and when the variance meets the requirement, acquiring a corresponding repair scheme and executing repair operation. The repair threshold can be freely adjusted according to the cluster state, so that the automatic triggering frequency and the accuracy are adjusted.
If the variance score is optimal, the repair threshold is still not met. The sorting repair operation cannot be obtained, the current error information data set storage operation is executed, after the operation and maintenance personnel find the manual solution, model information input model zoo is trained according to the error information again, and the corresponding repair scheme is updated. The method is convenient for automatically detecting and recovering the service conflict abnormal problems of more scenes in the later period.
Optionally, when the check result is obtained in the algorithm (such as the model_check_res described above), the weight distribution may be performed on the custom multiple feature points according to the log level. Assuming that there are 20 feature points in total, the matrix is [ c ] 1 ,c 2 ,…c n ]The weight values of the 20 feature points are added to 1. In performing model training, commands such as: python-u/log me/train_sample/log me-cnn_lm_v1.01/script s/lm.py-model x lgg 16-feature 20-c c x-1000-batch_size 256-data_di r=/targetData. And (3) newly adding a characteristic point matrix value in the parameter transmission, automatically acquiring configured weight information, and comparing all characteristic point return values with each model standard value mo del_standard_value after the mo del_check_res carries out weight analysis to acquire optimal model matching so as to acquire a platform repairing mode. For example: model_check_res=c 1 _check_res*10%+c 2 _check_res*5%+c 3 _check_res*3%+……+c n _check_res*2%。
In the above embodiment, an AI platform cluster fault recovery method based on underlying service rejection training is provided, which is to train a fault model by classifying container logs and service fault information of an AI platform cluster with various service conflict conditions according to a repair scheme, and when a cluster has a service conflict problem, the fault log quickly locates and classifies the problem by a log me pre-training model evaluation method, and designates the repair scheme to recover the conflict service. The aim of saving cluster recovery time and improving accuracy can be achieved.
In the related art, the cluster processing scheme is mainly used for repairing a specific scene only through manual repair, the applicability range is small, and the coverage is incomplete, so that the cluster processing scheme cannot be used in multiple scenes, and the universality of the problem of service conflict cannot be processed; and because of more platform services, the methods for repairing the problems of the fault logs caused by various conflict services, such as the abnormality of the docker, the incapability of synchronizing data and the like are different. The adoption of the scheme in the related art can lead to a slow speed of positioning and repairing the problem. According to the embodiment of the application, aiming at the fault data set obtained during service conflict, one most suitable pre-training fault model is selected from the pre-training model library for executing corresponding rejection recovery operation, after the pre-training model is evaluated, more scenes of the service conflict problem of the AI platform can be covered to an automatic detection repair service center, compared with the common training model method, the evaluation method has the advantages that node occupation is smaller (the pod is started to occupy little memory for operation training evaluation after the abnormality is identified, and the pod is deleted after the repair is completed), and the training time is greatly shortened. The AI platform is more favorable to resume normal work, avoids producing the problem of great loss. The method can achieve the effects of judging the abnormal classification of the service conflict problem and designating the restoration mode for restoration.
The following description of the relevant terms appearing in the embodiments of the application follows:
MySQL Galera: the method comprises the steps that the software of a cluster is provided with galera on Mysql of a specific version, a plurality of Mysql form the cluster, each Mysql is a master node, and data synchronization is carried out between the Mysql and the master node;
influxdb: an open source distributed timing, event and index database written in the Go language, without external reliance. The database is mainly used for storing data related to a large amount of time stamps, and is used for illustrating abnormal service condition explanation;
telegraf: an open source index collection tool based on plug-in is mainly used for performance monitoring;
K8S: (kubernetes) for managing containerized applications on multiple hosts in a cloud platform;
pod: is the minimum unit level of K8S management, consisting of a combination of one or more containers;
log me: a universal, rapid and accurate pre-training model evaluation method.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the embodiments of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
In this embodiment, there is further provided a device for processing service conflict, and fig. 6 is a block diagram of a device for processing service conflict according to an embodiment of the present application, as shown in fig. 6, where the device includes:
an obtaining module 602, configured to obtain target state information of a node to be detected, where the target state information is used to represent state information of service operation of the node to be detected;
a first determining module 604, configured to determine, according to a pre-training model, a target processing operation corresponding to a service conflict of a target type when it is determined that the node to be detected has the service conflict of the target type according to the target state information, where the target processing operation is used to repair the service conflict of the target type, and the pre-training model includes a set of fault models, where each fault model in the set of fault models is a model obtained by training a set of data packets generated by occurrence of a corresponding service conflict of one type;
and the processing module 606 is configured to execute the target processing operation to repair the service conflict of the target type if the target processing operation is determined.
In an alternative embodiment, the apparatus further comprises: the first judging module is used for judging whether the node to be detected has service conflict or not according to the target state information after the target state information of the node to be detected is acquired; the second judging module is used for judging whether the service conflict is generated by the pre-deployed service or not under the condition that the service conflict exists in the node to be detected; and the second determining module is used for determining that the node to be detected has the service conflict of the target type under the condition that the service conflict is not the service conflict generated by the pre-deployed service.
In an alternative embodiment, the acquiring module 602 includes: the acquisition sub-module is used for acquiring first state information of the operation of the group of micro service components of the node to be detected and second state information of the bottom container service of the node to be detected, wherein the target state information comprises the first state information and the second state information.
In an alternative embodiment, the apparatus further comprises: a third determining module, configured to determine, before determining, according to a pre-training model, that a target processing operation corresponding to a service conflict of the target type, in a case where the first state information indicates that interaction of the set of micro service components is abnormal, that the node to be detected has a service conflict of a first type; a fourth determining module, configured to determine that, when the second state information indicates that the container corresponding to the underlying container service is not in an operating state, the node to be detected has a second type of service conflict; wherein the target type of service conflict comprises the first type of service conflict and the second type of service conflict.
In an alternative embodiment, the first determining module 604 includes: the acquisition sub-module is used for acquiring target log information of the node to be detected, wherein the target log information comprises service corresponding to a group of micro service components of the node to be detected and/or log information generated when the underlying container service of the node to be detected generates service conflict of the target type; and the detection sub-module is used for detecting the target log information by utilizing the pre-training model to obtain the identification of the target processing operation, wherein the identification of the target processing operation is used for indicating the node to be detected to execute the target processing operation indicated by the identification of the target processing operation so as to repair the service conflict of the target type.
In an alternative embodiment, the detection submodule includes: the extraction unit is used for extracting a group of target characteristic values of the target log information by utilizing the pre-training model, wherein the group of target characteristic values comprise characteristic values of a group of parameters in the target log information; the comparison unit is used for comparing the set of target characteristic values with a set of preset characteristic values in the pre-training model to obtain a comparison result, wherein the set of preset characteristic values represent preset characteristic values corresponding to the set of parameters in the pre-training model; and the determining unit is used for determining the identification of the target processing operation based on the comparison result.
In an alternative embodiment, the comparing unit includes: a first determining subunit, configured to determine variance values between the set of target feature values and N sets of preset feature values, to obtain N variance values, where the N sets of preset feature values are preset feature values corresponding to N fault models in the pre-training model, each fault model in the N fault models corresponds to a set of preset feature values, each variance value in the N variance values is a variance value between the set of target feature values and a set of preset feature values in the N sets of preset feature values, and the comparison result includes the N variance values, where N is a positive integer greater than or equal to 1; the above-mentioned determination unit includes: and the second determining subunit is configured to determine, as the target processing operation identifier, a processing operation identifier corresponding to an i-th preset repair threshold value, where the i-th preset repair threshold value is a repair threshold value corresponding to an i-th fault model in the N fault models, and i is a positive integer less than or equal to N, where the i-th variance value is less than or equal to the i-th preset repair threshold value in the N variance values.
In an alternative embodiment, the apparatus further comprises: and the training module is used for training the pre-training model by using the target log information under the condition that each variance value in the N variance values is larger than a preset restoration threshold value corresponding to each variance value.
In an alternative embodiment, the apparatus further comprises: the obtaining module is used for obtaining an ith fault model in the group of fault models before determining a target processing operation corresponding to the service conflict of the target type according to a pre-training model, wherein the ith fault model corresponds to the service conflict of the ith type, the ith fault model is used for determining the processing operation corresponding to the service conflict of the ith type, and i is a positive integer greater than or equal to 1: acquiring an ith data packet set and an identification of an actual processing operation corresponding to the ith data packet set, wherein the ith data packet set comprises a sample log information set generated when the ith type of service conflict occurs; training an ith initial fault model by using the ith data packet set until a loss value between an identification of a prediction processing operation output by the ith initial fault model and an identification of an actual processing operation meets a preset convergence condition, ending training, and determining the ith initial fault model at the end of training as the ith fault model, wherein parameters in the ith initial fault model are adjusted under the condition that the loss value does not meet the convergence condition.
It should be noted that each of the above units or modules may be implemented by software or hardware, and for the latter, may be implemented by, but not limited to: the units or modules are all located in the same processor; alternatively, each of the units or modules described above may be located in a different processor in any combination.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the embodiments of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the application are not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the embodiments of the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the embodiments of the present application should be included in the protection scope of the embodiments of the present application.

Claims (12)

1. A method for handling service conflicts, comprising:
acquiring target state information of a node to be detected, wherein the target state information is used for representing state information of service operation of the node to be detected;
determining a target processing operation corresponding to the target type of service conflict according to a pre-training model under the condition that the target type of service conflict exists in the node to be detected according to the target state information, wherein the target processing operation is used for repairing the target type of service conflict, the pre-training model comprises a group of fault models, and each fault model in the group of fault models is a model obtained by training a data packet set generated by occurrence of a corresponding type of service conflict;
And executing the target processing operation under the condition that the target processing operation is determined so as to repair the service conflict of the target type.
2. The method according to claim 1, wherein after obtaining the target state information of the node to be detected, the method further comprises:
judging whether the node to be detected has service conflict or not according to the target state information;
judging whether the service conflict is generated by pre-deployed service or not under the condition that the service conflict exists in the node to be detected;
and under the condition that the service conflict is not the service conflict generated by the pre-deployed service, determining that the target type of service conflict exists in the node to be detected.
3. The method according to claim 1, wherein the obtaining the target state information of the node to be detected includes:
and acquiring first state information of the operation of a group of micro service components of the node to be detected and second state information of the bottom container service of the node to be detected, wherein the target state information comprises the first state information and the second state information.
4. A method according to claim 3, wherein prior to determining a target processing operation corresponding to a service conflict of the target type according to a pre-trained model, the method further comprises:
determining that the node to be detected has a first type of service conflict under the condition that the first state information indicates abnormal interaction of the group of micro service components;
determining that the node to be detected has a second type of service conflict under the condition that the second state information indicates that the container corresponding to the bottom container service is not in an operation state; wherein the target type of service conflict comprises the first type of service conflict and the second type of service conflict.
5. The method of claim 1, wherein the determining, according to a pre-training model, a target processing operation corresponding to a service conflict of the target type comprises:
acquiring target log information of the node to be detected, wherein the target log information comprises the log information generated when service corresponding to a group of micro service components of the node to be detected and/or service conflict of the target type occurs to bottom container service of the node to be detected;
And detecting the target log information by using the pre-training model to obtain an identifier of the target processing operation, wherein the identifier of the target processing operation is used for indicating the node to be detected to execute the target processing operation indicated by the identifier of the target processing operation so as to repair the service conflict of the target type.
6. The method of claim 5, wherein detecting the target log information using the pre-training model to obtain the identification of the target processing operation comprises:
extracting a set of target feature values of the target log information by using the pre-training model, wherein the set of target feature values comprise feature values of a set of parameters in the target log information;
comparing the set of target characteristic values with a set of preset characteristic values in the pre-training model to obtain a comparison result, wherein the set of preset characteristic values represent preset characteristic values corresponding to the set of parameters in the pre-training model;
and determining the identification of the target processing operation based on the comparison result.
7. The method of claim 6, wherein the step of providing the first layer comprises,
Comparing the set of target feature values with a set of preset feature values in the pre-training model to obtain a comparison result, wherein the comparing comprises: determining variance values between the set of target feature values and N sets of preset feature values respectively to obtain N variance values, wherein the N sets of preset feature values are preset feature values corresponding to N fault models in the pre-training model, each fault model in the N fault models corresponds to a set of preset feature values respectively, each variance value in the N variance values is a variance value between the set of target feature values and a set of preset feature values in the N sets of preset feature values, the comparison result comprises the N variance values, and N is a positive integer greater than or equal to 1;
the determining, based on the comparison result, an identification of the target processing operation includes: and under the condition that an ith variance value in the N variance values is smaller than or equal to an ith preset repair threshold value, determining the identification of the processing operation corresponding to the ith preset repair threshold value as the identification of the target processing operation, wherein the ith preset repair threshold value is a repair threshold value corresponding to an ith fault model in the N fault models, and i is a positive integer smaller than or equal to N.
8. The method of claim 7, wherein the method further comprises:
and training the pre-training model by using the target log information under the condition that each variance value in the N variance values is larger than a preset restoration threshold value corresponding to each variance value.
9. The method according to any one of claims 1 to 8, wherein prior to determining a target processing operation corresponding to a service conflict of the target type according to a pre-trained model, the method further comprises:
obtaining an ith fault model in the group of fault models, wherein the ith fault model corresponds to an ith type of service conflict, and the ith fault model is used for determining a processing operation corresponding to the ith type of service conflict, and i is a positive integer greater than or equal to 1:
acquiring an ith data packet set and an identification of an actual processing operation corresponding to the ith data packet set, wherein the ith data packet set comprises a sample log information set generated when the ith type of service conflict occurs;
training an ith initial fault model by using the ith data packet set until a loss value between an identification of a prediction processing operation output by the ith initial fault model and an identification of an actual processing operation meets a preset convergence condition, ending training, and determining the ith initial fault model at the end of training as the ith fault model, wherein parameters in the ith initial fault model are adjusted under the condition that the loss value does not meet the convergence condition.
10. A service conflict handling apparatus, comprising:
the system comprises an acquisition module, a detection module and a detection module, wherein the acquisition module is used for acquiring target state information of a node to be detected, and the target state information is used for representing the state information of service operation of the node to be detected;
the first determining module is used for determining a target processing operation corresponding to the service conflict of the target type according to a pre-training model under the condition that the service conflict of the target type exists in the node to be detected according to the target state information, wherein the target processing operation is used for repairing the service conflict of the target type, the pre-training model comprises a group of fault models, and each fault model in the group of fault models is a model obtained by training a data packet set generated by occurrence of the corresponding service conflict of one type;
and the processing module is used for executing the target processing operation under the condition that the target processing operation is determined so as to repair the service conflict of the target type.
11. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method according to any of the claims 1 to 9.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method as claimed in any one of claims 1 to 9 when the computer program is executed.
CN202310868738.9A 2023-07-14 2023-07-14 Service conflict processing method and device, storage medium and electronic device Pending CN116723085A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310868738.9A CN116723085A (en) 2023-07-14 2023-07-14 Service conflict processing method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310868738.9A CN116723085A (en) 2023-07-14 2023-07-14 Service conflict processing method and device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN116723085A true CN116723085A (en) 2023-09-08

Family

ID=87864616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310868738.9A Pending CN116723085A (en) 2023-07-14 2023-07-14 Service conflict processing method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN116723085A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116980279A (en) * 2023-09-25 2023-10-31 之江实验室 Fault diagnosis system and fault diagnosis method for programmable network element equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116980279A (en) * 2023-09-25 2023-10-31 之江实验室 Fault diagnosis system and fault diagnosis method for programmable network element equipment
CN116980279B (en) * 2023-09-25 2023-12-12 之江实验室 Fault diagnosis system and fault diagnosis method for programmable network element equipment

Similar Documents

Publication Publication Date Title
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN110955550B (en) Cloud platform fault positioning method, device, equipment and storage medium
CN110178121B (en) Database detection method and terminal thereof
CN113282461A (en) Alarm identification method and device for transmission network
CN110088744B (en) Database maintenance method and system
CN107870948A (en) Method for scheduling task and device
CN110275992B (en) Emergency processing method, device, server and computer readable storage medium
CN116723085A (en) Service conflict processing method and device, storage medium and electronic device
CN111290900A (en) Software fault detection method based on micro-service log
CN112540887A (en) Fault drilling method and device, electronic equipment and storage medium
CN112529223A (en) Equipment fault repair method and device, server and storage medium
US11263072B2 (en) Recovery of application from error
CN110275793B (en) Detection method and equipment for MongoDB data fragment cluster
CN111367782B (en) Regression testing data automatic generation method and device
CN113031991B (en) Remote self-adaptive upgrading method and device for embedded system
CN115766402B (en) Method and device for filtering server fault root cause, storage medium and electronic device
CN113742400B (en) Network data acquisition system and method based on self-adaptive constraint conditions
CN116264541A (en) Multi-dimension-based database disaster recovery method and device
CN114911677A (en) Monitoring method and device for containers in cluster and computer readable storage medium
CN112291302B (en) Internet of things equipment behavior data analysis method and processing system
CN111258788B (en) Disk failure prediction method, device and computer readable storage medium
CN114629786A (en) Log real-time analysis method, device, storage medium and system
CN112395119B (en) Abnormal data processing method, device, server and storage medium
CN111444032A (en) Computer system fault repairing method, system and equipment
CN111324513B (en) Monitoring management method and system for artificial intelligence development platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination