CN116361104A - Big data-based application fault prediction method, device, equipment and storage medium - Google Patents

Big data-based application fault prediction method, device, equipment and storage medium Download PDF

Info

Publication number
CN116361104A
CN116361104A CN202310130883.7A CN202310130883A CN116361104A CN 116361104 A CN116361104 A CN 116361104A CN 202310130883 A CN202310130883 A CN 202310130883A CN 116361104 A CN116361104 A CN 116361104A
Authority
CN
China
Prior art keywords
data
application
historical
fault
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310130883.7A
Other languages
Chinese (zh)
Inventor
汤文鹏
朱桂林
王青召
翟钧
苏琳珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Changan New Energy Automobile Technology Co Ltd
Original Assignee
Chongqing Changan New Energy Automobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Changan New Energy Automobile Technology Co Ltd filed Critical Chongqing Changan New Energy Automobile Technology Co Ltd
Priority to CN202310130883.7A priority Critical patent/CN116361104A/en
Publication of CN116361104A publication Critical patent/CN116361104A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Environmental & Geological Engineering (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application fault prediction method, the device, the equipment and the storage medium based on big data are characterized in that through collecting historical log data of a plurality of historical moments of an application to be predicted and marking, the marked historical log data are used as a training set to train a preset basic prediction model to obtain an application fault prediction model for predicting a fault prediction result of the application to be predicted in a future time period, the obtained current log data of the application to be predicted are input into the application fault prediction model to obtain the fault prediction result, prediction of an application fault can be actively carried out through the mode, an active defense scheme of the application fault prediction is provided, the application fault prediction can be changed into active one, the situation is prevented, and the operation and maintenance efficiency is improved.

Description

Big data-based application fault prediction method, device, equipment and storage medium
Technical Field
The application relates to the technical field of intelligent operation and maintenance, in particular to an application fault prediction method, device and equipment based on big data and a storage medium.
Background
The development of operation and maintenance can be divided into four stages, namely manual operation and maintenance, script and open source tool operation and maintenance, automatic operation and maintenance and intelligent operation and maintenance, and most companies are in the second stage and the third stage at present, and are passive operation and maintenance.
In the prior art, document 1 (CN 108123834 a) describes a log analysis system based on a big data platform, which proposes to perform real-time data analysis on a network data packet, perform data feature matching through a network data protocol feature library, send the network log data confirmed as abnormal by the matching to the big data platform for storage, perform cluster analysis, classify training, and dynamically update the network data protocol feature library. However, the conventional document 1 only provides a scheme for collecting and analyzing the anomaly logs to dynamically update the network data protocol feature library, and only provides a technical point of view of determining whether the current log is anomalous from comparison of individual anomaly cases, but in the field of applying fault prediction, the non-rain murmur is far more important than judging whether a fault occurs.
At present, in the related art, an alarm can be given only when an application fault occurs, and the system stays at the angle of passive operation and maintenance, so that an intelligent operation and maintenance scheme for predicting the application fault is lacking.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the present application provides an application fault prediction method, apparatus, device and storage medium based on big data, so as to solve the above-mentioned technical problem that the related art lacks an intelligent operation and maintenance scheme for predicting an application fault.
The application provides an application fault prediction method based on big data, which comprises the following steps: acquiring current log data of an application to be predicted; inputting the current log data into an application fault prediction model to obtain a fault prediction result of the application to be predicted; the training mode of the application fault prediction model comprises the steps of collecting historical log data of a plurality of historical moments of the application to be predicted, and marking to obtain marking results of the historical log data, wherein at least one marking result of the historical log data is a fault, the marked historical log data is used as a training data set, and the training data set is used for training a preset basic prediction model to obtain the application fault prediction model for predicting the fault prediction result of the application to be predicted in a future time period.
In an embodiment of the present application, collecting and annotating the history log data of the plurality of history moments of the application to be predicted, to obtain an annotation result of each history log data, including: collecting historical operation data of the application to be predicted at a plurality of historical moments, and marking the abnormal state of the historical operation data according to a preset marking standard to obtain marking results of the historical operation data; and collecting a plurality of historical fault data of a plurality of historical moments in the application fault time period to be predicted, wherein the fault time period comprises the fault moment, a first preset time period before the fault moment and a second preset time period after the fault moment, the historical fault data of the fault moment are marked as faults, the historical fault data of the rest historical moments are marked as normal, marking results of the historical fault data are obtained, and the historical log data comprise the historical operation data and the historical fault data.
In an embodiment of the present application, collecting historical operation data of the plurality of historical moments of the application to be predicted includes: and reading process related data of a plurality of processes of the application to be predicted, memory operation data of a host, disk operation data of a server and use condition data of preset components in the server at intervals of preset time to obtain historical operation data of a plurality of historical moments.
In an embodiment of the present application, labeling the abnormal state of the historical operation data according to a preset labeling standard to obtain a labeling result of each historical operation data includes: if the preset labeling standard is met, labeling the abnormal state of the historical operation data as a fault, and if the preset labeling standard is not met, labeling the abnormal state of the historical operation data as normal, so as to obtain labeling results of the historical operation data; the preset labeling standard comprises at least one of a central processing unit utilization rate being larger than a first preset threshold, a memory occupation being larger than a second preset threshold, a single garbage collection frequency being larger than a third preset threshold, single garbage collection time being larger than a first preset duration, a garbage collection cycle frequency being larger than a fourth preset threshold, and a disk input/output utilization rate being larger than a preset utilization rate, and the historical operation data comprises at least one of the central processing unit utilization rate, the memory occupation, the single garbage collection frequency, the single garbage collection time being, the garbage collection cycle frequency and the disk input/output utilization rate.
In an embodiment of the present application, collecting a plurality of historical fault data of a plurality of historical moments in the application fault period to be predicted includes: the method comprises the steps of collecting historical fault data of fault time when an application to be predicted breaks down, collecting historical fault data of a plurality of historical time points of a first preset time period before the fault time, and collecting historical fault data of a plurality of historical time points of a second preset time period after the fault time, wherein the fault comprises at least one of application running, process blocking and process absence, and the historical fault data comprises at least one of central processing unit utilization rate, memory occupation, disk input and output utilization rate, garbage collection single-time data and garbage collection period data.
In an embodiment of the present application, training the preset basic prediction model through the training data set includes: predicting each history log data in the training data set through the preset basic prediction model to obtain an initial prediction result, wherein the initial prediction result comprises at least one initial fault result at a prediction moment, and the prediction moment is later than the history moment of the history log data; determining a target loss function according to the labeling result of each history log data and the initial prediction result; and training the preset basic prediction model according to the target loss function.
In an embodiment of the present application, the initial prediction result includes at least two initial failure results at prediction moments, and the application failure prediction model is used for predicting failure prediction results of at least two future moments of the application to be predicted in a future time period.
The application also provides an application fault prediction device based on big data, which comprises: the acquisition module is used for acquiring current log data of the application to be predicted; the result prediction module is used for inputting the current log data into an application fault prediction model to obtain a fault prediction result of the application to be predicted; the model training module is used for collecting the historical log data of the plurality of historical moments of the application to be predicted, and labeling the historical log data to obtain labeling results of the historical log data, wherein the labeling result of at least one of the historical log data is a fault, the labeled historical log data is used as a training data set, and the training data set is used for training a preset basic prediction model to obtain an application fault prediction model for predicting the fault prediction result of the application to be predicted in a future time period.
The application also provides an application fault prediction device based on big data, the device comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the apparatus to implement the big data based application failure prediction method as claimed in any of the preceding claims.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform the big data based application failure prediction method as set forth in any of the above.
As described above, the application fault prediction method, apparatus, device and storage medium based on big data have the following beneficial effects:
the method comprises the steps of collecting historical log data of a plurality of historical moments of an application to be predicted, marking, taking marked historical log data as a training set to train a preset basic prediction model to obtain an application fault prediction model for predicting a fault prediction result of the application to be predicted in a future time period, inputting the obtained current log data of the application to be predicted into the application fault prediction model to obtain a fault prediction result, and actively predicting an application fault in the mode.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:
FIG. 1 is a flow chart of a big data based application failure prediction method according to an embodiment of the present application;
FIG. 2 is a schematic image of a sigmoid function according to an embodiment of the present application;
FIG. 3 is a graphical representation of a log (x) function provided by an embodiment of the present application;
FIG. 4 is a graphical representation of a log (1-x) function provided by an embodiment of the present application;
FIG. 5 is a flowchart of a big data based application failure prediction method according to another embodiment of the present application;
FIG. 6 is a schematic hardware structure of a big data based application failure prediction device according to an embodiment of the present application;
fig. 7 is a schematic diagram of a hardware architecture of a big data based application failure prediction device suitable for implementing one or more embodiments of the present application.
Detailed Description
Further advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure in the present specification, by describing embodiments of the present application with reference to the accompanying drawings and preferred examples. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be understood that the preferred embodiments are presented by way of illustration only and not by way of limitation to the scope of the present application.
It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.
In the present application, "and/or" describing the association relationship of the association object, it means that there may be three relationships, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The term "plurality" as used herein refers to two or more.
In the description of this application, the words "first," "second," and the like are used solely for the purpose of distinguishing between descriptions and not necessarily for the purpose of indicating or implying a relative importance or order.
In addition, in the embodiments of the present application, the term "exemplary" is used to mean serving as an example, instance, or illustration. Any embodiment or implementation described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or implementations. Rather, the term use of an example is intended to present concepts in a concrete fashion.
In the following description, numerous details are set forth to provide a more thorough explanation of embodiments of the present application, however, it will be apparent to one skilled in the art that embodiments of the present application may be practiced without these specific details, in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments of the present application.
Fig. 1 shows a flowchart of an application failure prediction method based on big data according to an embodiment of the present application. Specifically, in an exemplary embodiment, as shown in fig. 1, the present embodiment provides an application failure prediction method based on big data, the method including the steps of:
step S110, current log data of an application to be predicted is obtained.
The application to be predicted may be an application preset by a person skilled in the art, and the current log data may be data generated by running the predicted application in one or more hardware devices.
It should be noted that, the current log data is data that needs to predict a time of an application to be predicted, and is not particularly data collected in real time.
And step S120, inputting the current log data into an application fault prediction model to obtain a fault prediction result of the application to be predicted.
The training mode of applying the fault prediction model comprises the following steps:
collecting historical log data of a plurality of historical moments of an application to be predicted, and marking to obtain marking results of the historical log data, wherein the marking result of at least one historical log data is a fault;
and training the preset basic prediction model through the training data set by taking the marked historical log data as the training data set to obtain an application fault prediction model for predicting a fault prediction result of the application to be predicted in a future time period.
It can be seen that this embodiment provides a scheme for predicting failure of an application based on big data, and predicts whether the application will fail at a certain future point in time based on log analysis of the server and the application. The method comprises the steps of inputting historical log data collected through a big data platform, such as server logs (CPU, memory and disk use conditions), application log information (GC frequency and time consumption) and historical fault information, training the historical log data into a model, and inputting log information (namely current log data) of current log information (a certain day or a certain time period) into an application fault prediction model after training to predict whether the application to be predicted is likely to have faults. The method can actively predict the application faults, provides an active defense scheme for the application fault prediction, can change the passive mode into the active mode, prevents the application faults from happening, and improves the operation and maintenance efficiency. And by combining a big data analysis technology and a machine learning technology, fault points are rapidly positioned, the state of a system and possible faults are predicted, and the traditional operation and maintenance thought is changed.
In an embodiment of the present application, collecting and annotating history log data of a plurality of history moments of an application to be predicted, to obtain an annotation result of each history log data, including:
collecting historical operation data of a plurality of historical moments of the application to be predicted, and marking the abnormal state of the historical operation data according to a preset marking standard to obtain marking results of the historical operation data;
collecting a plurality of historical fault data of a plurality of historical moments in a to-be-predicted application fault period, wherein the fault period comprises a fault moment, a first preset time period before the fault moment and a second preset time period after the fault moment, marking the historical fault data of the fault moment as a fault, marking the historical fault data of the rest historical moments as normal, and obtaining marking results of the historical fault data, wherein the historical log data comprises historical operation data and historical fault data.
For example, the CPU (central processing unit) usage rate, the memory occupation condition, the JVM GC (java adds Garbage Collection (GC)) frequency on the JVM virtual machine, and the like of each node of the application system can be collected as the history running data. In this embodiment, collecting historical operation data of a plurality of historical moments of an application to be predicted includes:
and reading process related data of a plurality of processes of the application to be predicted, memory operation data of a host, disk operation data of a server and use condition data of preset components in the server at intervals of preset time to obtain historical operation data of a plurality of historical moments.
The method for collecting process-related data of a plurality of processes in one example is that process-related information provided by a proc interface of a Linux system is read once every 30 seconds (other time periods set by a person skilled in the art may be also just one example, and the details are not repeated herein), and relevant fields (process-related data) of the extracted information can be referred to table 1, which mainly reflects users of each process, occupied memory and display memory, start and end time of the process, and the like. After the data is acquired, the data is classified and stored in a preset storage space according to the process number, such as an elastic search and the like. Specific examples of the process-related data (one or more groups thereof may be selected as the process-related data) are as follows:
TABLE 1
Figure BDA0004083769470000081
An elastic search insertion and data query method can be used for two types of appointed Id and automatic generation Id, wherein the appointed Id uses a PUT operation, the automatic generation Id uses a POST operation, mapping is automatically generated when the POST is performed, and if parameters of the POST are not strictly defined, the corresponding mapping is automatically established according to the condition of the POST; the data query mode may be: the query specifying index information, the query specifying document information, all data under the query corresponding index, the query character string search and the structured query are combined by one or more conditions, and the query mode can be fuzzy matching, similar to like in sql or accurate matching. For example, showing the age of 22 years, all people with the name band ston 22, output the top 10 items closest to the query result, ordered from high to low in terms of the degree of compliance.
The method for acquiring the memory operation data of the host machine includes that the memory operation data of the host machine is obtained by reading the content of a host machine/proc/meminfo file once every 30 seconds through codes, the related information of the memory operation of the host machine is mainly reflected, after the data is acquired, the data is stored into a preset storage space such as an elastic search according to the acquisition time, and specific parameters (one or more groups of the memory operation data can be selected) of the memory operation data of the host machine can be seen as the memory operation data in the following table 2:
TABLE 2
Figure BDA0004083769470000082
Figure BDA0004083769470000091
The method for acquiring the disk operation data of the server and the usage data of the preset components in the server includes that the content of a host/proc/disks file is read every 30 seconds through codes, the fields of the extracted information are different according to the number of disks of the server, the relevant information of the disk operation of the server is mainly reflected, and the data are stored in an elastic search (real-time data storage engine) according to the acquisition time after being acquired.
The collection method of the preset component use condition data in the server in an example is that the JVM GC condition of the application program is obtained through codes every 30 seconds, and the use condition of JVM (java related component) in the server is mainly reflected. After data acquisition, the process numbers are classified and stored into an elastic search according to the acquisition time, the extracted fields can be seen in table 3, and one or more fields in table 3 can be selected as preset component use condition data:
TABLE 3 Table 3
Figure BDA0004083769470000101
In an embodiment, marking the abnormal state of the historical operation data according to a preset marking standard to obtain a marking result of each historical operation data, including:
if the preset labeling standard is met, labeling the abnormal state of the historical operation data as a fault, and if the preset labeling standard is not met, labeling the abnormal state of the historical operation data as normal, so as to obtain labeling results of the historical operation data;
the preset labeling standard comprises at least one of a central processing unit utilization rate being larger than a first preset threshold, a memory occupation being larger than a second preset threshold, a single garbage collection frequency being larger than a third preset threshold, single garbage collection time being larger than a first preset duration, a garbage collection cycle frequency being larger than a fourth preset threshold, a disk input/output utilization rate being larger than a preset utilization rate, and historical operation data comprising at least one of the central processing unit utilization rate, the memory occupation, the single garbage collection frequency, the single garbage collection time being longer than the single garbage collection cycle frequency and the disk input/output utilization rate.
For example, in the elastic search, data with CPU occupation (central processing unit utilization) exceeding 80% is marked, data with memory utilization (memory occupation) exceeding 80% is marked, data with disk input/output utilization exceeding 80% is marked, and data with Young GC frequency (garbage collection single frequency) exceeding 200ms for once per minute or GC time (garbage collection single time) or Full GC frequency (garbage collection cycle frequency) exceeding once per day is marked.
In an embodiment, collecting a plurality of historical fault data for a plurality of historical moments in a period of application fault to be predicted includes:
the method comprises the steps of collecting historical fault data of fault time when an application to be predicted breaks down, collecting historical fault data of a plurality of historical time points of a first preset time period before the fault time, and collecting historical fault data of a plurality of historical time points of a second preset time period after the fault time, wherein the fault comprises at least one of application running, process blocking and process absence, and the historical fault data comprises at least one of central processing unit utilization rate, memory occupation, disk input and output utilization rate, garbage recycling single-time data and garbage recycling period data.
The fault may be generated by natural operation of the application or may be generated by manual simulation, and is not limited herein. For example, five minutes before and after an application exception (application running, process stuck, JVM process not present, etc.) may be marked for CPU, memory, disk IO usage, young GC, full GC log data.
In one embodiment, training the pre-set base prediction model with the training dataset includes:
predicting each history log data in the training data set through a preset basic prediction model to obtain an initial prediction result, wherein the initial prediction result comprises at least one initial fault result at a prediction moment, and the prediction moment is later than the history moment of the history log data;
determining a target loss function according to the labeling result and the initial prediction result of each history log data;
and training a preset basic prediction model according to the target loss function.
In an embodiment, the initial prediction results comprise initial failure results of at least two prediction moments, and the failure prediction model is applied for predicting failure prediction results of at least two future moments of the application to be predicted in the future time period. For example, logistic regression is a linear regression model that uses a gradient descent method to solve parameters by assuming that the data obeys Bernoulli distribution, and then achieves the goal of two classifications.
The Bernoulli distribution (Bernoulli distribution), also known as a two-point distribution or 0-1 distribution, is the simplest discrete probability distribution. The success probability is noted as p (p is more than or equal to 0 and less than or equal to 1), the failure probability is noted as q=1-p, and the following is true:
Figure BDA0004083769470000121
where P (x) is probability, positive class is 1, negative class is 0, and obviously obeys a 0-1 distribution.
An exemplary model training and prediction process is as follows:
training: the model is trained by training data, i.e. a learning process, i.e. parameters of the model are determined.
And (3) predicting: after training, the model parameters are determined, and a result is obtained when predicted data is input.
The common linear regression y=wx+b, the applied fault prediction model is trained by a training set, i.e. the model parameters w, b are obtained, so that a straight line or hyperplane (x is multidimensional) is determined. Next, for the test set, a data x, w, b has been learned, and the carry-over y=wx+b, a y value, i.e., the predicted value, is obtained.
By the previous linear regression, y=wx+b has been obtained. It is a real number, and the value range of y can be (minus infinity, plus infinity). Now, it is not intended to have its value so large, so it is intended to give the compression to 0,1. Researchers have found that a signomid function can achieve this function. Therefore, the introduction investigated this y with a signomid function.
An exemplary sigmoid function is
Figure BDA0004083769470000122
The image of the sigmoid function can be compressed, i.e. y=wx+b is brought into the sigmoid (x), see fig. 2. The output of this function is also defined as y, namely:
Figure BDA0004083769470000123
thus, y is the value of (0, 1), and equation (2) is transformed as follows:
Figure BDA0004083769470000124
the loss function is a function that measures the difference between the true value and the predicted value. Therefore, it is desirable that the smaller this function is, the better. Here, the minimum loss is 0. Taking the classification (0, 1) as an example: when the true value is 1 and the prediction output of the model is 1, the loss is preferably 0, and the loss is preferably as large as possible when the prediction is 0. Similarly, when the true value is 0 and the prediction output of the model is 0, the loss is preferably 0, and when the prediction is 1, the loss is the largest. Therefore, minimizing the loss function indicates that the smaller the prediction is, the more accurate the prediction is. An example loss function is:
Figure BDA0004083769470000131
function-based images-log (x), see fig. 3, and-log (1-x) images-after compression can be seen in fig. 4, the prediction y is between 0-1. By using the loss function, the loss is reduced as much as possible, and a good effect can be achieved.
These two losses are combined:
- [ ylog (x) + (1-y) log (1-x) ] formula (5),
wherein y is the label, and 0 and 1 are taken respectively.
Total loss for m samples:
Figure BDA0004083769470000132
in this equation, m is the number of samples, y is a label, the value 0 or 1, i represents the i-th sample, and f (x) represents the predicted output. J (θ) is the final loss value of the model, and the minimum value is 0, indicating the probability of failure occurrence.
And substituting the data to be predicted into the loss function formula (6) to solve in the stage of the test model, so as to obtain a predicted value.
Referring to fig. 5, fig. 5 is a flow chart illustrating a specific big data based failure prediction method according to the present invention, as shown in fig. 5, the specific method includes:
and continuously collecting system logs and application logs of each node of the application system, CPU utilization rate, memory occupation condition and JVM GC frequency three-dimensional data. Firstly, collecting CPU utilization rate, memory occupation condition and JVM GC frequency of each node of an application system. And secondly, marking data, namely marking the data with the CPU utilization rate of more than 80%, the memory occupation of more than 80%, the disk IO utilization rate of more than 80%, the Young GC frequency of more than once per minute or more than 200ms in time, and the full GC frequency of more than once per day or more than 300ms in time. And then collecting and marking IO utilization rate data of a CPU, a memory and a disk when the application fails, and collecting and focusing JVM GC frequencies 5 minutes before and after the failure. Then, the model is trained, and the acquired data is input into the model for training. And finally, inputting index data of a certain time point of the application node by the test model, and predicting whether the system can fail in a certain time period in the future.
In summary, the application fault prediction method based on big data is provided, history log data of a plurality of history moments of an application to be predicted are collected and marked, the marked history log data are used as a training set to train a preset basic prediction model to obtain an application fault prediction model for predicting a fault prediction result of the application to be predicted in a future time period, the obtained current log data of the application to be predicted are input into the application fault prediction model to obtain a fault prediction result, the fault prediction result comprises one or more prediction results of whether the application to be predicted will fail in the future time, prediction of the application fault can be actively performed in the mode, an active defense scheme of application fault prediction is provided, the application fault prediction can be changed into active and prevented, and operation and maintenance efficiency is improved.
As shown in fig. 6, the present application further provides an application failure prediction device based on big data, where the device includes:
an obtaining module 601, configured to obtain current log data of an application to be predicted;
the result prediction module 602 is configured to input current log data into an application fault prediction model to obtain a fault prediction result of an application to be predicted;
the model training module 603 is configured to collect historical log data of multiple historical moments of an application to be predicted, and annotate the historical log data to obtain annotation results of each historical log data, where the annotation result of at least one historical log data is a fault, and train a preset basic prediction model through the training data set with the annotated historical log data as a training data set, so as to obtain an application fault prediction model for predicting a fault prediction result of the application to be predicted in a future time period.
Therefore, the present embodiment provides a scheme for predicting the application fault based on big data, which can actively predict the application fault, provides an active defense scheme for predicting the application fault, can change the passive mode into the active mode, prevents the situation from happening, and improves the operation and maintenance efficiency.
In summary, the application fault prediction device based on big data is provided, history log data of a plurality of history moments of an application to be predicted are collected and marked, the marked history log data are used as a training set to train a preset basic prediction model, an application fault prediction model for predicting a fault prediction result of the application to be predicted in a future time period is obtained, the obtained current log data of the application to be predicted is input into the application fault prediction model, the fault prediction result is obtained, prediction of an application fault can be actively performed through the mode, an active defense scheme of application fault prediction is provided, the application fault prediction can be changed into active one, the application fault prediction is prevented from being happened, and the operation and maintenance efficiency is improved.
It should be noted that, the big data based application fault prediction device provided in the foregoing embodiment and the big data based application fault prediction method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module and unit perform the operation has been described in detail in the method embodiment, which is not repeated herein. In practical application, the application fault prediction device based on big data provided in the above embodiment may allocate the functions to different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above, which is not limited herein.
The embodiment of the application also provides an application fault prediction device based on big data, which comprises the following steps: one or more processors; and a storage means for storing one or more programs that, when executed by the one or more processors, cause the big data based application failure prediction device to implement the big data based application failure prediction method provided in the above embodiments.
Fig. 7 shows a schematic structural diagram of a computer apparatus suitable for use in implementing the big data based application failure prediction device of the embodiments of the present application. It should be noted that, the computer system 1000 of the big data based application failure prediction device shown in fig. 7 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 7, the computer system 1000 includes a central processing unit (Central Processing Unit, CPU) 1001 which can perform various appropriate actions and processes according to a program stored in a Read-only memory (ROM) 1002 or a program loaded from a storage section 1008 into a random access memory (Random Access Memory, RAM) 1003, for example, performing the method described in the above embodiment. In the RAM1003, various programs and data required for system operation are also stored. The CPU 1001, ROM 1002, and RAM1003 are connected to each other by a bus 1004. An Input/Output (I/O) interface 1005 is also connected to bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN (Local AreaNetwork ) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed on the drive 1010 as needed, so that a computer program read out therefrom is installed into the storage section 1008 as needed.
In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When executed by a Central Processing Unit (CPU) 1001, the computer program performs various functions defined in the apparatus of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present application may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
Another aspect of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform the big data based application failure prediction method as described above. The computer-readable storage medium may be contained in the big data based application failure prediction apparatus described in the above embodiment or may exist alone without being assembled into the big data based application failure prediction apparatus.
Another aspect of the present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the big data based application failure prediction method provided in the above embodiments.
The above embodiments are merely illustrative of the principles of the present application and its effectiveness and are not intended to limit the present application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. It is therefore contemplated that the appended claims will cover all such equivalent modifications and changes as fall within the true spirit and scope of the disclosure.

Claims (10)

1. An application fault prediction method based on big data, the method comprising:
acquiring current log data of an application to be predicted;
inputting the current log data into an application fault prediction model to obtain a fault prediction result of the application to be predicted;
the training mode of the application fault prediction model comprises the steps of collecting historical log data of a plurality of historical moments of the application to be predicted, and marking to obtain marking results of the historical log data, wherein at least one marking result of the historical log data is a fault, the marked historical log data is used as a training data set, and the training data set is used for training a preset basic prediction model to obtain the application fault prediction model for predicting the fault prediction result of the application to be predicted in a future time period.
2. The big data-based application fault prediction method of claim 1, wherein collecting and annotating historical log data of the plurality of historical moments of the application to be predicted to obtain an annotation result of each historical log data, and the method comprises the following steps:
collecting historical operation data of the application to be predicted at a plurality of historical moments, and marking the abnormal state of the historical operation data according to a preset marking standard to obtain marking results of the historical operation data;
and collecting a plurality of historical fault data of a plurality of historical moments in the application fault time period to be predicted, wherein the fault time period comprises the fault moment, a first preset time period before the fault moment and a second preset time period after the fault moment, the historical fault data of the fault moment are marked as faults, the historical fault data of the rest historical moments are marked as normal, marking results of the historical fault data are obtained, and the historical log data comprise the historical operation data and the historical fault data.
3. The big data based application failure prediction method of claim 2, wherein collecting historical operating data for a plurality of historical moments of the application to be predicted comprises:
and reading process related data of a plurality of processes of the application to be predicted, memory operation data of a host, disk operation data of a server and use condition data of preset components in the server at intervals of preset time to obtain historical operation data of a plurality of historical moments.
4. The big data-based application fault prediction method as claimed in claim 2, wherein labeling the abnormal state of the historical operation data according to a preset labeling standard to obtain labeling results of the historical operation data, comprises:
if the preset labeling standard is met, labeling the abnormal state of the historical operation data as a fault, and if the preset labeling standard is not met, labeling the abnormal state of the historical operation data as normal, so as to obtain labeling results of the historical operation data;
the preset labeling standard comprises at least one of a central processing unit utilization rate being larger than a first preset threshold, a memory occupation being larger than a second preset threshold, a single garbage collection frequency being larger than a third preset threshold, single garbage collection time being larger than a first preset duration, a garbage collection cycle frequency being larger than a fourth preset threshold, and a disk input/output utilization rate being larger than a preset utilization rate, and the historical operation data comprises at least one of the central processing unit utilization rate, the memory occupation, the single garbage collection frequency, the single garbage collection time being, the garbage collection cycle frequency and the disk input/output utilization rate.
5. The big data based application failure prediction method according to claim 2, wherein collecting a plurality of historical failure data at a plurality of historical moments in the application failure period to be predicted includes:
the method comprises the steps of collecting historical fault data of fault time when an application to be predicted breaks down, collecting historical fault data of a plurality of historical time points of a first preset time period before the fault time, and collecting historical fault data of a plurality of historical time points of a second preset time period after the fault time, wherein the fault comprises at least one of application running, process blocking and process absence, and the historical fault data comprises at least one of central processing unit utilization rate, memory occupation, disk input and output utilization rate, garbage collection single-time data and garbage collection period data.
6. The big data based application failure prediction method of any of claims 1-5, wherein training a pre-set base prediction model with the training data set comprises:
predicting each history log data in the training data set through the preset basic prediction model to obtain an initial prediction result, wherein the initial prediction result comprises at least one initial fault result at a prediction moment, and the prediction moment is later than the history moment of the history log data;
determining a target loss function according to the labeling result of each history log data and the initial prediction result;
and training the preset basic prediction model according to the target loss function.
7. The big data based application failure prediction method of claim 6, wherein the initial prediction results include initial failure results at least two prediction moments, and the application failure prediction model is used to predict failure prediction results of at least two future moments of the application to be predicted in a future time period.
8. An application failure prediction apparatus based on big data, the apparatus comprising:
the acquisition module is used for acquiring current log data of the application to be predicted;
the result prediction module is used for inputting the current log data into an application fault prediction model to obtain a fault prediction result of the application to be predicted;
the model training module is used for collecting the historical log data of the plurality of historical moments of the application to be predicted, and labeling the historical log data to obtain labeling results of the historical log data, wherein the labeling result of at least one of the historical log data is a fault, the labeled historical log data is used as a training data set, and the training data set is used for training a preset basic prediction model to obtain an application fault prediction model for predicting the fault prediction result of the application to be predicted in a future time period.
9. An application failure prediction apparatus based on big data, the apparatus comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the apparatus to implement the method of any of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform the method of any of claims 1 to 7.
CN202310130883.7A 2023-02-17 2023-02-17 Big data-based application fault prediction method, device, equipment and storage medium Pending CN116361104A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310130883.7A CN116361104A (en) 2023-02-17 2023-02-17 Big data-based application fault prediction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310130883.7A CN116361104A (en) 2023-02-17 2023-02-17 Big data-based application fault prediction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116361104A true CN116361104A (en) 2023-06-30

Family

ID=86912378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310130883.7A Pending CN116361104A (en) 2023-02-17 2023-02-17 Big data-based application fault prediction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116361104A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116755910A (en) * 2023-08-16 2023-09-15 中移(苏州)软件技术有限公司 Host machine high availability prediction method and device based on cold start and electronic equipment
CN117250942A (en) * 2023-11-15 2023-12-19 成都态坦测试科技有限公司 Fault prediction method, device, equipment and storage medium for determining model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116755910A (en) * 2023-08-16 2023-09-15 中移(苏州)软件技术有限公司 Host machine high availability prediction method and device based on cold start and electronic equipment
CN116755910B (en) * 2023-08-16 2023-11-03 中移(苏州)软件技术有限公司 Host machine high availability prediction method and device based on cold start and electronic equipment
CN117250942A (en) * 2023-11-15 2023-12-19 成都态坦测试科技有限公司 Fault prediction method, device, equipment and storage medium for determining model
CN117250942B (en) * 2023-11-15 2024-02-27 成都态坦测试科技有限公司 Fault prediction method, device, equipment and storage medium for determining model

Similar Documents

Publication Publication Date Title
CN116361104A (en) Big data-based application fault prediction method, device, equipment and storage medium
CN100412871C (en) System and method to generate domain knowledge for automated system management
CN111427974A (en) Data quality evaluation management method and device
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN110969015B (en) Automatic label identification method and equipment based on operation and maintenance script
CN113360803A (en) Data caching method, device and equipment based on user behavior and storage medium
CN114491034B (en) Text classification method and intelligent device
CN115641101A (en) Intelligent recruitment method, device and computer readable medium
CN112328806A (en) Data processing method, system, computer equipment and storage medium
CN114416939A (en) Intelligent question and answer method, device, equipment and storage medium
CN114398466A (en) Complaint analysis method and device based on semantic recognition, computer equipment and medium
CN110069558A (en) Data analysing method and terminal device based on deep learning
CN116841846A (en) Real-time log abnormality detection method, device, equipment and storage medium thereof
CN116863116A (en) Image recognition method, device, equipment and medium based on artificial intelligence
CN116012019A (en) Financial wind control management system based on big data analysis
CN113674065B (en) Service contact-based service recommendation method and device, electronic equipment and medium
CN114841165A (en) User data analysis and display method and device, electronic equipment and storage medium
CN111221704B (en) Method and system for determining running state of office management application system
CN113888265A (en) Product recommendation method, device, equipment and computer-readable storage medium
Park et al. A new forecasting system using the latent dirichlet allocation (LDA) topic modeling technique
CN117807322B (en) False news detection method and system based on knowledge graph retrieval
CN113706207B (en) Order success rate analysis method, device, equipment and medium based on semantic analysis
CN116701639B (en) Text analysis-based double-carbon knowledge graph data analysis method and system
CN113392291B (en) Service recommendation method and system based on data center
CN113688924B (en) Abnormal order detection method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination