WO2022037536A1 - 故障处理方法、装置、网络设备和存储介质 - Google Patents

故障处理方法、装置、网络设备和存储介质 Download PDF

Info

Publication number
WO2022037536A1
WO2022037536A1 PCT/CN2021/112806 CN2021112806W WO2022037536A1 WO 2022037536 A1 WO2022037536 A1 WO 2022037536A1 CN 2021112806 W CN2021112806 W CN 2021112806W WO 2022037536 A1 WO2022037536 A1 WO 2022037536A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
data
model
optical module
historical data
Prior art date
Application number
PCT/CN2021/112806
Other languages
English (en)
French (fr)
Inventor
杨玺坤
骆庆开
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to EP21857630.4A priority Critical patent/EP4198803A4/en
Publication of WO2022037536A1 publication Critical patent/WO2022037536A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B10/00Transmission systems employing electromagnetic waves other than radio-waves, e.g. infrared, visible or ultraviolet light, or employing corpuscular radiation, e.g. quantum communication
    • H04B10/07Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems
    • H04B10/075Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems using an in-service signal
    • H04B10/079Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems using an in-service signal using measurements of the data signal
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B10/00Transmission systems employing electromagnetic waves other than radio-waves, e.g. infrared, visible or ultraviolet light, or employing corpuscular radiation, e.g. quantum communication
    • H04B10/07Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems
    • H04B10/075Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems using an in-service signal
    • H04B10/079Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems using an in-service signal using measurements of the data signal
    • H04B10/0791Fault location on the transmission path
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B10/00Transmission systems employing electromagnetic waves other than radio-waves, e.g. infrared, visible or ultraviolet light, or employing corpuscular radiation, e.g. quantum communication
    • H04B10/07Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems
    • H04B10/075Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems using an in-service signal
    • H04B10/079Arrangements for monitoring or testing transmission systems; Arrangements for fault measurement of transmission systems using an in-service signal using measurements of the data signal
    • H04B10/0793Network aspects, e.g. central monitoring of transmission parameters

Definitions

  • the present disclosure relates to, but is not limited to, the field of optical network communications, in particular, but not limited to, a fault handling method, apparatus, network device, and storage medium.
  • optical modules are important equipment for connecting optical networks.
  • the usage of optical modules is increasing. Once the optical module fails, it will directly affect the service quality and even cause service interruption.
  • the failure rate of optical module equipment is very low, because the stock of optical modules in the existing network is often 100,000 or even millions, the number of failures of optical modules is still very high; moreover, the geographical distribution of failures is random. Faults, operation and maintenance personnel are often unable to repair the first time. In short, it brings very high operation and maintenance costs to operators. Therefore, the fault detection of the optical module becomes a necessary means.
  • optical modules from different manufacturers often have different designs, resulting in different failure phenomena of optical modules. It is impossible to cover all models with a unified detection method.
  • the current main detection methods only use a single indicator, such as temperature, current, etc., to set certain thresholds for detection, and these indicators are affected by the external environment, and the detection effect of the same set of thresholds at different times varies greatly. Therefore, in the related art, there is a lack of a universal optical module fault detection solution, so that the operation and maintenance cost of the optical module is always high.
  • the fault handling method, device, network device and storage medium mainly solve the technical problems that in the related art, the optical module fault detection scheme is low in universality and the operation and maintenance cost of the optical module is high.
  • the present disclosure provides a fault handling method, which includes: according to the model of the optical module, analyzing the historical data of each optical module, and determining that the optical module of the corresponding model has a sudden change in the case of a fault. Historical data; extract the corresponding original fault data from the historical data with sudden change; according to the original fault data, carry out fault model training according to the model of the optical module.
  • An embodiment of the present disclosure also provides a fault processing device, including: a data analysis module, configured to analyze historical data of each optical module according to the model of the optical module, and determine that when the optical module of the corresponding model fails, the The historical data of the mutation; the fault extraction module is configured to extract the corresponding original fault data from the historical data of the mutation; the model training module is configured to perform fault model training according to the model of the optical module according to the original fault data.
  • a data analysis module configured to analyze historical data of each optical module according to the model of the optical module, and determine that when the optical module of the corresponding model fails, the The historical data of the mutation
  • the fault extraction module is configured to extract the corresponding original fault data from the historical data of the mutation
  • the model training module is configured to perform fault model training according to the model of the optical module according to the original fault data.
  • An embodiment of the present disclosure also provides a network device, including a processor, a memory, and a communication bus; the communication bus connects and communicates between the processor and the memory; the processor is configured to execute one or more computer programs stored in the memory, In order to realize the steps of the above-mentioned fault handling method.
  • Embodiments of the present disclosure further provide a computer storage medium, where one or more programs are stored in the computer-readable storage medium, and the one or more programs can be executed by one or more processors to implement the above-mentioned fault handling method. step.
  • FIG. 1 is a flowchart of a fault processing method according to Embodiment 1 of the present disclosure
  • FIG. 2 is a flowchart of a fault processing method according to Embodiment 2 of the present disclosure
  • FIG. 3 is a deployment flowchart of the fault handling method according to Embodiment 3 of the present disclosure.
  • FIG. 4 is a flowchart of detecting a fault of an optical module according to the fault processing method according to Embodiment 4 of the present disclosure
  • FIG. 5 is a flowchart of monitoring and updating of the fault handling method according to Embodiment 5 of the present disclosure
  • FIG. 6 is a schematic diagram of the composition of a fault processing device according to Embodiment 6 of the present disclosure.
  • FIG. 7 is a schematic diagram of the composition of a fault handling system according to Embodiment 6 of the present disclosure.
  • FIG. 8 is a schematic diagram of the composition of a network device according to Embodiment 7 of the present disclosure.
  • This embodiment provides a fault handling method, please refer to FIG. 2 , the method includes the following steps S101 to S103.
  • the historical data of the optical module is processed according to the model of the optical module.
  • Analysis that is, for different types of optical modules, the analysis results of the historical data of the optical modules are different.
  • the cause and effect of the failure of the optical module may be similar, but the data changes of the optical module during the failure may be different for different types of optical modules, which brings about the historical data analysis results of the optical module.
  • the difference of historical data of its mutation considering this part of the difference, the analysis is performed according to the model of the optical module, and the historical data of the sudden change when the optical module of different models occurs is obtained.
  • the historical data with sudden change indicates that the historical data has a significant change in a short period of time when the optical module fails, which can be a significant increase or a significant decrease.
  • the optical module fault detection problem it can be Defined as "in the historical data of the optical module, at a certain point in time, due to a special event-fault, some data have changed significantly before and after this time point", some data here refers to the occurrence of Mutated historical data.
  • the obvious changes mentioned here can be determined according to the actual data itself. For example, for normal indicators, the increase or decrease in a short period of time exceeds 50%.
  • This embodiment only enumerates a possible mutation range. For different data, the mutation range cannot be generalized. It can be determined according to the actual system operation and the type of historical data, which is not limited in this embodiment.
  • the sudden change can also indicate that after the time point of the failure, until the optical module cannot work and is replaced, the characteristic trend of the data during this period is obviously different from the trend before the failure time point.
  • the trend Before a certain time point, the trend has been stable, and after a certain time point, the trend continues to climb, or decline, or jitter, it can be determined that this time point is a mutation point, then the event that causes this mutation point has a high probability. for failure.
  • the type of historical data of the optical module in this embodiment may be performance data, inspection data, and the like.
  • the optical module Before analyzing the historical data of each optical module according to the model of the optical module, first, the optical module can be classified according to the model of the optical module.
  • the classification standards of various types of optical modules can be directly classified according to the manufacturer, or classified according to the working principle of the optical module, or classified according to the specifications and applicable scope of the optical module, and so on.
  • extracting the corresponding original fault data from the historical data where the sudden change occurred may include: analyzing the sudden change time point on the time series according to the historical data where the sudden change occurred; and determining, according to the sudden change time point, that a sudden change occurred in the historical data The historical data whose proportion exceeds the preset proportion; the historical data whose mutation proportion exceeds the preset proportion is used as the original fault data.
  • This scheme of determining the original fault data can greatly exclude the occasional mutation historical data, while retaining the historical data related to the fault with a high probability.
  • Analyze the mutation time points where the mutation time point can be considered as the time point when the optical module fails.
  • the historical data of the mutation is relatively concentrated; however, even the historical data of the mutation at the mutation time point is not It is all caused by faults, so it can be screened again, and the proportion of mutations in it, which exceeds the preset ratio of historical data, is used as the original fault data.
  • the percentage of mutation refers to that at these mutation time points, the proportion of historical data with mutation exceeds a threshold.
  • a certain historical If the proportion of sudden changes in all optical modules that have failed exceeds a certain proportion (for example, 80%), then the historical data can be considered to be the original failure data. Or, for the same optical module, it may have multiple mutation time points, and in these mutation time points, the proportion of a certain historical data mutation exceeds a certain proportion (for example, 80%), it can also be changed. Cheng considers the historical data to be the original fault data. For this embodiment, since the processing method is to classify the optical modules according to the model, the determined original fault data will also be different for different types of optical modules, which confirms the difference. On the other hand, the differences of the manufacturers' devices reflect the excellent universality of the fault handling methods in the various embodiments of the present disclosure.
  • the fault model can be trained according to the model of the optical module according to the original fault data.
  • the failure models are also different.
  • performing fault model training according to the model of the optical module according to the original fault data may include: screening the original fault data to obtain target fault data; and sampling the target fault data according to the normal to fault ratio of the optical module Obtain the sampled data; perform model training according to the sampled data.
  • the original fault data can be used to directly train the fault model of the optical module, there may still be some anomalies, such as vacancies in the analysis of historical data, abnormal data, or there are still non-faults when the original fault data is obtained from the mutated historical data. related data. Therefore, at this time, the original fault data can be filtered to obtain the target fault data.
  • the correlation between the target fault data and the optical module fault is higher, and the target fault data can reflect the fault condition of the optical module to a greater extent, and can detect and analyze the fault of the optical module more accurately.
  • the sampled data can be obtained by sampling the target failure data according to the normal to failure ratio of the optical modules. Then, model training is performed based on the sampled data.
  • screening the original fault data includes: deleting abnormal data in the original fault data; and/or extracting data representing the timing characteristics of the optical module in the original fault data.
  • abnormal data mainly refers to missing data in the acquisition process, or data with obvious errors
  • data representing the timing characteristics of optical modules refers to the characteristics of optical modules in terms of timing, extracted data that can represent significant characteristics in time .
  • performing fault model training according to the model of the optical module may include: performing fault model training on the optical modules of different models according to the sampled data corresponding to the optical modules of different models. Different types of optical modules have different fault data and corresponding fault models.
  • the failure model may include at least one of a machine learning model, a statistical model, and a deep learning model.
  • the method may further include: outputting the fault risk prediction result of the corresponding optical module within a preset time according to the result of the fault model training.
  • the results of the fault model training can be directly used to predict, that is, to predict whether some optical modules have failed, or when they will fail, what kind of failure will occur, etc., to facilitate the operation and maintenance of the system.
  • the method may further include: evaluating the failure risk prediction result by using a preset evaluation index. Prediction based on the training results of the fault model may always cause deviations. Therefore, the prediction results can be monitored by setting evaluation indicators; the actual fault conditions can be cross-compared with the prediction results to obtain the evaluation indicators corresponding to the fault model. . And once the evaluation metric triggers the condition, the fault model needs to be updated.
  • the method may further include: collecting supplementary historical data corresponding to optical modules that are falsely detected or missed; when the preset evaluation index reaches an update threshold, updating the fault model according to the supplementary historical data.
  • the original normal optical module is mistaken for the faulty optical module, that is, false detection, or the historical data of the faulty optical module is missed, that is, the missed detection ; Missing detection and false detection can also occur when predicting the failure of the optical module; and in the case of false detection or missed detection, the missed and false detection data can be collected first, that is, supplementary historical data collection Therefore, when the evaluation index reaches the update threshold, the fault model can be updated according to the supplementary historical data.
  • the fault processing method provided in this embodiment analyzes the historical data of each optical module according to the model of the optical module, and determines the historical data of the sudden change of the optical module of the corresponding model in the event of a fault; from the historical data of the sudden change , extract the corresponding original fault data; according to the original fault data, perform fault model training according to the model of the optical module. Therefore, by dividing according to the model of the optical module, the flexibility of collecting historical data of the faulty optical module is improved, and the applicable scope of fault detection is further improved, and the operation and maintenance cost of the optical module is effectively reduced.
  • This embodiment provides a fault handling method, please refer to FIG. 2 , the method includes the following steps S201 to S214. S201 , using historical data (such as performance data, inspection data, etc.) during the operation of the optical module as an input, and entering into a processing system of a fault processing method.
  • historical data such as performance data, inspection data, etc.
  • S206 Determine a sampling algorithm according to the proportion of normal and faulty optical modules in the current data, and generate sampling data.
  • the fault model adopts a machine learning model, and can also be replaced with a statistical model or a deep learning model according to requirements and computing power; Use the fault model directly and output the prediction result.
  • the output result is directly output to the user for viewing, or output to other downstream docking systems.
  • This embodiment provides a deployment process of a fault handling method, please refer to FIG. 3 , the method includes the following steps S301 to S311.
  • S301 Read historical data (such as performance data or inspection data) of the optical module from the upstream docking system.
  • S310 analyze the ratio of normal and faulty samples in the data, and perform sampling according to a sampling algorithm.
  • This embodiment provides a fault processing method for detecting a fault of an optical module. Please refer to FIG. 4 .
  • the method includes the following steps S401 to S406.
  • S401 Periodically run a fault detection task, and read historical data of an optical module in a recent period from an upstream docking system.
  • This embodiment provides a monitoring and updating process of a fault handling method. Please refer to FIG. 5 .
  • the method includes the following steps S501 to S511.
  • the system reaches the set task execution time, and runs the monitoring task.
  • the device includes a data analysis module 61 , a fault extraction module 62 and a model training module 63 .
  • the data analysis module 61 is configured to analyze the historical data of each optical module according to the model of the optical module, and determine the historical data of the sudden change of the optical module of the corresponding model in the case of failure.
  • the fault extraction module 62 is configured to extract corresponding original fault data from the historical data in which the mutation occurred.
  • the model training module 63 is configured to perform fault model training according to the model of the optical module according to the original fault data.
  • the historical data of the optical module is processed according to the model of the optical module. Analysis, that is, for different types of optical modules, the analysis results of the historical data are different. The cause and effect of the failure of the optical module may be similar, but the data changes during the failure may be different for different types of optical modules, which brings about the historical data analysis results of the optical module.
  • the difference of historical data of its mutation considering this part of the difference, the analysis is performed according to the model of the optical module, and the historical data of the sudden change when the optical module of different models occurs is obtained.
  • the historical data with sudden change indicates that the historical data has a significant change in a short period of time when the optical module fails, which can be a significant increase or a significant decrease.
  • the optical module fault detection problem it can be Defined as "in the historical data of the optical module, at a certain point in time, due to a special event-fault, some data have changed significantly before and after this time point", some data here refers to the occurrence of Mutated historical data.
  • the obvious changes mentioned here can be determined according to the actual data itself. For example, for normal indicators, the increase or decrease in a short period of time exceeds 50%.
  • This embodiment only enumerates a possible mutation range. For different data, the mutation range cannot be generalized. It can be determined according to the actual system operation and the type of historical data, which is not limited in this embodiment.
  • the type of historical data of the optical module in this embodiment may be performance data, inspection data, etc.; before analyzing the historical data of each optical module according to the model of the optical module, first, the model of the optical module can As a standard, each optical module is classified.
  • the classification standards of various types of optical modules can be directly classified according to the manufacturer, or classified according to the working principle of the optical module, or classified according to the specifications and applicable scope of the optical module, and so on.
  • extracting the corresponding original fault data from the historical data where the sudden change occurred may include: analyzing the sudden change time point on the time series according to the historical data where the sudden change occurred; and determining, according to the sudden change time point, that a sudden change occurred in the historical data
  • the historical data whose proportion exceeds the preset proportion; the historical data whose mutation proportion exceeds the preset proportion is used as the original fault data.
  • This scheme of determining the original fault data can greatly exclude the occasional mutation historical data, while retaining the historical data related to the fault with a high probability. Analyze the mutation time points, where the mutation time point can be considered as the time point when the optical module fails.
  • the historical data of the mutation is relatively concentrated; however, even the historical data of the mutation at the mutation time point is not It is all caused by faults, so it can be screened again, and the proportion of mutations in it, which exceeds the preset ratio of historical data, is used as the original fault data.
  • the percentage of mutation refers to that at these mutation time points, the proportion of historical data with mutation exceeds a threshold. For example, for a certain mutation time point, on multiple faulty optical modules of the same model, a certain historical If the proportion of sudden changes in all optical modules that have failed exceeds a certain proportion (for example, 80%), then the historical data can be considered to be the original failure data.
  • the same optical module may have multiple mutation time points, and in these mutation time points, the proportion of a certain historical data mutation exceeds a certain proportion (for example, 80%), it can also be changed. Cheng considers the historical data to be the original fault data. For this embodiment, since the processing method is to classify the optical modules according to the model, the determined original fault data will also be different for different types of optical modules, which confirms the difference. On the other hand, the difference of the devices of the manufacturers reflects the excellent universality of the fault handling apparatus in each embodiment of the present disclosure.
  • the fault model can be trained according to the model of the optical module according to the original fault data.
  • the failure models are also different.
  • performing fault model training according to the model of the optical module according to the original fault data may include: screening the original fault data to obtain target fault data; and sampling the target fault data according to the normal to fault ratio of the optical module Obtain the sampled data; perform model training according to the sampled data.
  • the original fault data can be used to directly train the fault model of the optical module, there may still be some anomalies, such as vacancies in the analysis of historical data, abnormal data, or there are still non-faults when the original fault data is obtained from the mutated historical data. related data. Therefore, at this time, the original fault data can be filtered to obtain the target fault data.
  • the correlation between the target fault data and the optical module fault is higher, and the target fault data can reflect the fault condition of the optical module to a greater extent, and can detect and analyze the fault of the optical module more accurately.
  • the sampled data can be obtained by sampling the target failure data according to the normal to failure ratio of the optical modules. Then, model training is performed based on the sampled data.
  • screening the original fault data includes: deleting abnormal data in the original fault data; and/or extracting data representing the timing characteristics of the optical module in the original fault data.
  • the data characterizing the timing characteristics of the optical module refers to the data extracted for the timing characteristics of the optical module that can characterize the temporally significant characteristics.
  • performing fault model training according to the model of the optical module may include: performing fault model training on the optical modules of different models according to the sampled data corresponding to the optical modules of different models. Different types of optical modules have different fault data and corresponding fault models.
  • the failure model may include at least one of a machine learning model, a statistical model, and a deep learning model.
  • the method may further include: outputting the fault risk prediction result of the corresponding optical module within a preset time according to the result of the fault model training.
  • the results of the fault model training can be directly used for prediction, that is, to predict whether some optical modules have failed, or when it will fail, what kind of failure will occur, etc., to facilitate the operation and maintenance of the system.
  • the method may further include: evaluating the failure risk prediction result by using a preset evaluation index. Prediction based on the training results of the fault model may always cause deviations. Therefore, the prediction results can be monitored by setting evaluation indicators; the actual fault conditions can be cross-compared with the prediction results to obtain the evaluation indicators corresponding to the fault model. . And once the evaluation metric triggers the condition, the fault model needs to be updated.
  • the method may further include: collecting supplementary historical data corresponding to optical modules that are falsely detected or missed; when the preset evaluation index reaches an update threshold, updating the fault model according to the supplementary historical data.
  • the method may further include: collecting supplementary historical data corresponding to optical modules that are falsely detected or missed; when the preset evaluation index reaches an update threshold, updating the fault model according to the supplementary historical data.
  • the flexibility of collecting historical data of faulty optical modules is improved by dividing according to the model of optical modules, and further The scope of application of fault detection is improved, and the operation and maintenance cost of optical modules is effectively reduced.
  • FIG. 7 shows a detailed schematic diagram of the composition of the fault processing apparatus in the fault processing system in this embodiment.
  • the data analysis module 61 is mainly responsible for classifying the data, and extracting fault data indicators of different types of optical modules through the mutation point analysis algorithm; the data analysis module 61 includes a first device classification sub-module 611 and an indicator extraction sub-module 612, wherein the first device
  • the classification sub-module 611 is mainly responsible for classifying the data according to the information such as the model of the optical module; the index extraction sub-module 612 analyzes the “mutation point” in the time series according to the fault data of the optical modules of different models, and obtains the comparison of the proportion of sudden changes in the faulty modules.
  • a lot of historical data, as the original fault data, is input to the downstream.
  • the fault extraction module 62 is mainly responsible for streamlining the data according to the original fault data provided by the upstream subsystem, cleaning the abnormal data, extracting new features, and sampling the data; the fault extraction module 62 includes an abnormality cleaning sub-module 621 , time sequence extraction sub-module 622 and sampling sub-module 623, wherein abnormal cleaning sub-module 621 uses the original fault data extracted upstream, removes unnecessary data, and eliminates the collected data with errors; Timing characteristics, extract data that can characterize temporally significant characteristics; the sampling sub-module 623 formulates a sampling algorithm according to the proportion of normal and faulty optical modules in the current data, and generates sampling data and sends it to the downstream.
  • the model training module 63 is mainly responsible for training the fault detection model and completing subsequent prediction tasks; the model training module 63 includes a second device classification sub-module 631 and a result processing sub-module 632, wherein the second device classification sub-module 631 is for the upstream input.
  • the sampled data is classified according to the model of the optical module; different models can be trained according to different models of modules and input data.
  • the model adopts a machine learning model, and can also be replaced by a statistical model or a deep learning model according to requirements and computing power; In the prediction stage, the model output results can be directly used; the result processing sub-module 632 is configured to summarize the results, output them as system results, and transmit them to the downstream at the same time.
  • the model update module 64 is mainly responsible for monitoring the results of the upstream output, and triggers the system model update when the evaluation index drops;
  • the model update module 64 includes a result evaluation sub-module 641, a collection sub-module 644, an incremental update trigger 642 and a full update trigger device 643, wherein the result evaluation sub-module 641 is configured to perform model detection capability evaluation on the current output result through the evaluation index; the collection sub-module 644 is configured to (normal) optical module data is collected for subsequent model update; incremental update trigger 642, controlled by the result evaluation sub-module 641, triggers the system to perform incremental update; full update trigger 643, also by the result evaluation sub-module 641 control, triggering the system to perform a full update.
  • the model update module 64 can also output the results directly to the user.
  • This embodiment also provides a network device, please refer to FIG. 8 , which includes a processor 81, a memory 82 and a communication bus 83; the communication bus 83 connects and communicates between the processor 81 and the memory 82; the processor 81 is configured as One or more computer programs stored in the memory 82 are executed to implement the steps in the fault handling methods in the foregoing embodiments, which will not be repeated here.
  • the present embodiments also provide a computer-readable storage medium embodied in any method or technology configured to store information, such as computer-readable instructions, data structures, computer program modules, or other data volatile or nonvolatile, removable or non-removable media.
  • Computer-readable storage media include but are not limited to RAM (Random Access Memory, random access memory), ROM (Read-Only Memory, read-only memory), EEPROM (Electrically Erasable Programmable read only memory, electrically erasable programmable read only memory) ), flash memory or other memory technology, CD-ROM (Compact Disc Read-Only Memory), digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, Or any other medium that can be configured to store the desired information and that can be accessed by a computer.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • EEPROM Electrically Erasable Programmable read only memory
  • flash memory or other memory technology
  • CD-ROM Compact Disc Read-
  • the computer-readable storage medium in this embodiment may be configured to store one or more computer programs, and the stored one or more computer programs may be executed by one or more processors, so as to implement the fault handling in the foregoing embodiments steps of the method.
  • This embodiment also provides a computer program (or computer software), which can be distributed on a computer-readable medium and executed by a computable device, so as to implement the steps of the fault handling methods in the foregoing embodiments; And in some cases, at least one of the steps shown or described may be performed in an order different from that described in the above embodiments.
  • a computer program or computer software
  • This embodiment also provides a computer program product, including a computer-readable device, where the computer program as shown above is stored on the computer-readable device.
  • the computer-readable device may include the computer-readable storage medium as described above.
  • the historical data of each optical module is analyzed according to the model of the optical module, and the history of mutation of the optical module of the corresponding model in the case of failure is determined. data; extract the corresponding original fault data from the historical data with sudden change; according to the original fault data, carry out fault model training according to the model of the optical module. Therefore, by dividing according to the model of the optical module, the flexibility of collecting historical data of the faulty optical module is improved, and the scope of application of fault detection is further improved. In the actual use process, it can be automatically updated. The accumulation of optical modules continuously improves the detection accuracy and effectively reduces the operation and maintenance cost of the optical module.
  • the functional modules/units in the system, and the device can be implemented as software (which can be implemented by computer program codes executable by a computing device). ), firmware, hardware, and their appropriate combination.
  • the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components Components execute cooperatively.
  • Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit .
  • communication media typically embodies computer readable instructions, data structures, computer program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery, as is well known to those of ordinary skill in the art medium. Therefore, the present disclosure is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Electromagnetism (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Optical Communication System (AREA)

Abstract

本公开提供的故障处理方法、装置、网络设备和存储介质,按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的光模块在出现故障的情况下,发生突变的历史数据;从发生突变的历史数据中,提取对应的原始故障数据;根据原始故障数据,按照光模块的型号进行故障模型训练。

Description

故障处理方法、装置、网络设备和存储介质
相关申请的交叉引用
本申请要求享有2020年08月17日提交的名称为“故障处理方法、装置、网络设备和存储介质”的中国专利申请CN202010826217.3的优先权,其全部内容通过引用并入本申请中。
技术领域
本公开涉及但不限于光网络通信领域,具体而言,涉及但不限于一种故障处理方法、装置、网络设备和存储介质。
背景技术
随着通信技术的不断发展,为了满足广大用户对综合通信业务、多媒体通信业务等宽带业务的需求,光通信网络的应用越来越广泛,它的出现很好的解决了通信网络中传输系统的带宽制约瓶颈。
光模块作为光电信号转换的关键部件,是连接光网络的重要设备,随着光网络的不断部署,光模块的使用量越来越大。光模块一旦出现故障,会直接影响业务质量,甚至导致业务中断。虽然光模块设备的故障率很低,但是由于光模块在现网的存量往往是十万,乃至百万级,因此光模块的故障数量依然很高;而且,出现故障的地域分布随机,如果出现故障,运维人员往往无法第一时间进行修复。总之,给运营商带来非常高的运维成本。因此,对光模块的故障检测成为一个必要的手段。但相关技术中,光模块故障检测还存在一些难题。首先,不同厂商的光模块往往设计存在差异,导致光模块的故障现象有差异,用统一的检测手段无法涵盖所有型号。另外,目前的主要检测手段只通过某个单一指标,比如温度,电流等,设定某些阈值来进行检测,而这些指标受外部环境影响,用同一套阈值在不同时间检测效果差异很大。因此,相关技术中,缺乏一种普适性的光模块故障检测方案,使得光模块的运维成本始终居高不下。
发明内容
本公开提供的故障处理方法、装置、网络设备和存储介质,主要解决的技术问题是相关技术中,光模块故障检测方案普适性低,光模块运维成本高的问题。
为解决上述技术问题,本公开提供的一种故障处理方法,包括:按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的光模块在出现故障的情况下,发生突变的历史数据;从发生突变的历史数据中,提取对应的原始故障数据;根据原始故障数据,按照光模块的型号进行故障模型训练。
本公开实施例还提供一种故障处理装置,包括:数据分析模块,配置为按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的光模块在出现故障的情况下,发生突变的历史数据;故障提取模块,配置为从发生突变的历史数据中,提取对应的原始故障数据;模型训练模块,配置为根据原始故障数据,按照光模块的型号进行故障模型训练。
本公开实施例还提供一种网络设备,包括处理器、存储器、及通信总线;通信总线在处理器和存储器之间进行连接通信;处理器配置为执行存储器中存储的一个或者多个计算机程序,以实现上述的故障处理方法的步骤。
本公开实施例还提供一种计算机存储介质,该计算机可读存储介质存储有一个或者多个程序,上述一个或者多个程序可被一个或者多个处理器执行,以实现上述的故障处理方法的步骤。
本公开的其他特征和相应的有益效果在说明书的后面部分进行阐述说明,且应当理解,至少部分有益效果从本公开说明书中的记载变的显而易见。
附图说明
图1为本公开实施例一的故障处理方法流程图;
图2为本公开实施例二的故障处理方法流程图;
图3为本公开实施例三的故障处理方法的部署流程图;
图4为本公开实施例四的故障处理方法检测光模块故障流程图;
图5为本公开实施例五的故障处理方法的监控更新流程图;
图6为本公开实施例六的故障处理装置组成示意图;
图7为本公开实施例六的故障处理系统组成示意图;
图8为本公开实施例七的网络设备组成示意图。
具体实施方式
为了使本公开的目的、技术方案及优点更加清楚明白,下面通过具体实施方式结合附图 对本公开实施例作进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本公开,并不用于限定本公开。
实施例一
本实施例提供了一种故障处理方法,请参见图2,该方法包括如下步骤S101至S103。
S101、按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的光模块在出现故障的情况下,发生突变的历史数据。
S102、从发生突变的历史数据中,提取对应的原始故障数据。
S103、根据原始故障数据,按照光模块的型号进行故障模型训练。
为了本实施例中的光模块的故障处理方法的适用性,对于各种厂商所生产的光模块均能很好的应用,本实施例中,是按照光模块的型号对光模块的历史数据进行分析,也就是对于不同型号的光模块而言,光模块的历史数据的分析结果是不同的。光模块的故障成因和故障效果可能是差不多的,但是光模块在故障时的数据变化,对于不同型号的光模块而言很可能是不同的,这就带来了光模块的历史数据分析结果中,其发生突变的历史数据的差异性。本实施例中,正是考虑到了这部分差异性,按照光模块的型号来进行分析,得到对于不同型号的光模块而言,其发生故障时发生突变的历史数据。
其中,发生突变的历史数据,表示该历史数据在光模块发生故障时短时间内,其指标明显的变化,可以是明显的升高,或者是明显的降低,在光模块故障检测问题中,可以定义为“在光模块的历史数据中,在某个时间点,由于特殊事件——故障,造成了此时间点前后某些数据发生了明显的变化”,这里的某些数据所指的就是发生突变的历史数据。这里所说的明显的变化,可以根据实际数据本身的情况来确定,比如说对于正常指标而言,其短时间内增加或降低的幅度超过了50%。本实施例只是列举一种可能的突变幅度的情况,对于不同的数据,其突变幅度不可能一概而论,可以根据实际的系统运行情况,结合历史数据的类型确定,本实施例并不对其进行限定。
此外,突变还可以表示,在发生故障的时间点之后,一直到光模块无法工作被更换为止,这段时间中的数据的特征走势,和故障时间点之前的走势存在明显的差异,比如在某个时间点之前,走势一直平稳,在某个时间点之后,走势不断攀升,或下降,或抖动,则可判定此时间点为一个突变点,那么导致这个突变点出现的事件,较大概率的为故障。
在分析各历史数据的突变情况时,统计其中发生突变最多的指标,对于其他的未发生突变,或者是突变不明显的历史数据,可以直接不予考虑。当然,为了提升检测的全面性,避免漏判、误判所带来的不良后果,可以对这些数据予以保留,在后续的更新阶段可以对故障模型进行更新。
本实施例中的光模块的历史数据的类型,可以是性能数据,巡检数据等等。在按照光模块的型号,对各个光模块的历史数据进行分析之前,首先,可以以光模块的型号为标准,对各光模块进行分类。各种型号的光模块的分类标准,可以是直接按照厂商进行分类,或者是按照光模块的工作原理进行分类,或者是按照光模块的规格、适用的范围进行分类等等。
在发生突变的历史数据中,并非所有的历史数据均和故障有关,例如,对于同一型号中发生故障的光模块而言,总有一些突变的历史数据是偶发性的,只有个别的故障光模块会产生,那么这样的数据往往不应该被作为故障数据;换言之,故障数据应当是对于相应型号的光模块而言,在发生故障时,发生突变的占比较高的那些历史数据。对于这样的历史数据,则可以作为原始故障数据。在本实施例中,从发生突变的历史数据中,提取对应的原始故障数据可以包括:根据发生突变的历史数据,分析时间序列上的突变时间点;根据突变时间点,确定历史数据中出现突变占比超过预设比例的历史数据;将突变占比超过预设比例的历史数据,作为原始故障数据。
这种确定原始故障数据的方案,可以极大的排除那些偶发性的突变历史数据,而保留大概率与故障相关的历史数据。分析突变时间点,其中突变时间点可以被认为是光模块发生故障的时间点,在这些时间点上,突变的历史数据较为集中;但是,即便是在突变时间点发生突变的历史数据,也并非全是故障引起,因此可以再度进行筛选,将其中的突变占比,超过预设比例的历史数据,作为原始故障数据。突变占比所指的是,在这些突变时间点上,发生突变的历史数据的比例超过了一个阈值,例如,对于某突变时间点,相同型号的多个发生故障的光模块上,某历史数据发生突变的占比在所有发生故障的光模块中,超过了某比例(例如80%),那么就可以认为该历史数据,是原始故障数据。或者是,对于同一个光模块而言,其可能具有多个突变时间点,而在这些突变时间点中,某历史数据发生突变的占比,超过了某比例(例如80%),也可以改成认为该历史数据是原始故障数据。对于本实施例而言,由于其处理方式是按照型号对光模块进行了分类,因此对于不同型号的光模块而言,其确定出的原始故障数据也会有所区别,这一方面印证了不同厂商器件的差异性,另一方面则体现出本公开各实施例中的故障处理方法极佳的普适性。
在确定了原始故障数据之后,就可以根据原始故障数据,按照光模块的型号进行故障模型训练。同样的,对于不同型号的光模块,其故障模型也是不同的。
在一些实施例中,根据原始故障数据,按照光模块的型号进行故障模型训练可以包括:对原始故障数据进行筛选,得到目标故障数据;根据光模块的正常与故障比,对目标故障数据进行采样得到采样数据;根据采样数据,进行模型训练。
原始故障数据虽然可以直接进行光模块的故障模型训练,但是其仍可能存在一些异常,如在分析历史数据阶段存在空缺,数据异常,或者是从突变的历史数据得到原始故障数据时仍然有非故障相关的数据。因此,此时可以再对原始故障数据进行筛选,得到目标故障数据。目标故障数据与光模块故障的关联性更高,目标故障数据可以在更大程度上反映光模块的故障情况,可以更加准确的对光模块的故障进行检测分析。
由于实际系统中,光模块发生故障的概率较低,大部分的光模块是没有发生故障的,因此可以根据光模块的正常与故障比,对目标故障数据,进行采样,来得到采样数据。然后,再根据采样数据进行模型训练。
在一些实施例中,对原始故障数据进行筛选包括:删除原始故障数据中的异常数据;和/或,提取原始故障数据中,表征光模块时序特性的数据。其中,异常数据主要是指采集过程中缺失的数据,或者是有明显错误的数据;表征光模块时序特性的数据,是指针对光模块时序上的特性,提取的能够表征时间上显著特征的数据。
在一些实施例中,按照光模块的型号进行故障模型训练可以包括:根据不同型号的光模块对应的采样数据,分别针对不同型号的光模块,进行故障模型训练。不同型号的光模块,其故障对应的故障数据有所差别,相应的其故障模型也是不同的。
在一些实施例中,故障模型可以包括机器学习模型、统计学模型和深度学习模型中的至少一种。
在一些实施例中,在按照光模块的型号进行故障模型训练之后,还可以包括:根据故障模型训练的结果,输出预设时间内,对应的光模块的故障风险预测结果。故障模型训练的结果可以直接用来预测,也就是预测某些光模块是否发生了故障,或者会在何时发生故障,会发生怎样的故障等等,为系统的运维提供方便。
在一些实施例中,还可以包括:通过预设的评价指标,对故障风险预测结果进行评估。基于故障模型训练的结果进行预测,总是会可能出现偏差,因此可以通过设置评价指标,来监控预测结果;可以将实际故障情况,与预测结果进行交叉比对,从而得到故障模型对应的评估指标。而一旦评估指标触发了条件,则需要对故障模型进行更新。
在一些实施例中,还可以包括:收集误检、漏检的光模块对应的补充历史数据;当预设的评价指标达到更新阈值时,根据补充历史数据,对故障模型进行更新。
在对光模块的历史数据进行分析时,可能会出现将原本正常的光模块,误认为是故障的光模块,即误检,或者是将故障的光模块的历史数据漏掉,也就是漏检;漏检和误检,还可以出现在对于光模块的故障情况进行预测的时候;而在出现误检或漏检的情况时,可以先将 漏检、误检的数据,即补充历史数据收集起来,在评价指标达到更新阈值时,就可以根据补充历史数据,对故障模型进行更新。
本实施例提供的故障处理方法,按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的光模块在出现故障的情况下,发生突变的历史数据;从发生突变的历史数据中,提取对应的原始故障数据;根据原始故障数据,按照光模块的型号进行故障模型训练。从而,通过按照光模块的型号进行划分,提升了采集故障光模块的历史数据的灵活性,并进一步提升了故障检测的适用范围,有效的降低了光模块的运维成本。
实施例二
本实施例提供了一种故障处理方法,请参见图2,该方法包括如下步骤S201至S214。S201、将光模块运行中的历史数据(如性能数据,巡检数据等)作为输入,进入系故障处理方法的处理系统中。
S202、按照光模块的型号,对历史数据进行分类,得到按照光模块型号区分的各个光模块对应的历史数据。
S203、针对不同型号光模块的历史数据,分析各光模块在时间序列上的“突变点”,即确定其故障发生时间点,得到发生故障的光模块中出现突变占比达到一定比例的历史数据作为原始故障数据。
S204、对原始故障数据进行筛选,去除不需要的数据,并对采集有误的数据进行剔除。
S205、进一步对原始故障数据进行筛选,提取其中能够表征光模块时序特性的数据,作为目标故障数据。
S206、针对当前数据的正常、故障的光模块比例,确定采样算法,生成采样数据。
S207、针对上游输入的采样数据,按照光模块的型号进行分类,得到按型号划分的光模块采样数据。
S208、根据不同型号的光模块以及对应的采样数据,训练不同的故障模型,故障模型采用机器学习模型,也可以根据需求和算力替换成统计学模型或深度学习模型;其中,在预测阶段可以直接使用故障模型,输出预测结果。
S209、将预测结果进行汇总,作为系统输出结果进行输出。
S210、通过预设的评价指标,对当前系统输出结果进行模型检测能力评估。
S211、将误检(正常光模块检测为故障)和漏检(故障光模块检测为正常)的光模块进行收集,供后续模型更新时使用。
S212、评价指标满足增量更新条件时,触发系统进行增量更新。
S213、评价指标满足全量更新条件时,触发系统进行全量更新。
S214、输出结果直接输出给用户查看,或者输出给其他下游对接系统。
实施例三
本实施例提供了一种故障处理方法的部署流程,请参考图3,该方法包括如下步骤S301至S311。
S301、从上游对接系统中读取光模块的历史数据(如性能数据或巡检数据)。
S302、根据光模块的型号信息,对历史数据进行分类。
S303、对每种型号的光模块,提取其中发生故障的光模块的历史数据。
S304、对每个故障模块的各个历史数据进行分析,分析各历史数据是否存在“趋势突变点”。
S305、若存在,则在统计计数器中累加+1。
S306、若不存在,则继续遍历其他历史数据,直到全部分析完成。
S307、统计完成后,使用统计比重最高的前n个历史数据作为原始故障数据。
S308、对数据进行清洗,包括删除采集异常数据等操作。
S309、针对光模块数据的时序特性,提取能表征其在时间上特征的数据,作为目标故障数据。
S310、分析数据中正常和故障样本的比例,根据采样算法进行采样。
S311、启动训练流程,根据不同型号的光模块数据分别训练故障模型,训练完成后,部署完成。
实施例四
本实施例提供了一种故障处理方法的检测光模块故障流程,请参考图4,该方法包括如下步骤S401至S406。
S401、定期运行故障检测任务,从上游对接系统中读取光模块最近一段时间的历史数据。
S402、根据光模块的型号将历史数据分类。
S403、针对光模块数据的时序特性,提取能表征其在时间上特征的数据。
S404、对每种型号的光模块的数据,分别传入对应的已经训练好的故障模型中,输出其可能出现故障的概率。
S405、监控预测结果。
S406、将预测结果输出给最终用户或对接的其他下游系统。
实施例五
本实施例提供了一种故障处理方法的监控更新流程,请参考图5,该方法包括如下步骤S501至S511。
S501、系统达到设定的任务执行时间,运行监控任务。
S502、将目前已收集的预测结果进行统计。
S503、统计当前光模块的故障情况,如有无严重告警,有无被更换,有无维修工单等。
S504、将实际故障情况与预测结果进行交叉比对,计算评估指标。
S505、判断评估指标是否达到更新阈值(可能是全量更新或增量更新)。
S506、若没有到达,则继续收集目前误检、漏检的模块数据。
S507、若已到达阈值,则准备当前的历史数据,并将收集的误检,漏检数据补充进去。
S508、判定是否到达全量更新阈值。
S509、若到达,则使用S507中的数据全量更新故障模型。
S510、若没有到达,则判断是否到达增量更新阈值。
S511、若到达,则使用S507中的数据,进行故障模型的增量更新。
实施例六
本实施例提供了一种故障处理装置,请参见图6,该装置包括数据分析模块61、故障提取模块62和模型训练模块63。数据分析模块61配置为按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的光模块在出现故障的情况下,发生突变的历史数据。故障提取模块62配置为从发生突变的历史数据中,提取对应的原始故障数据。模型训练模块63配置为根据原始故障数据,按照光模块的型号进行故障模型训练。
为了本实施例中的光模块的故障处理装置的适用性,对于各种厂商所生产的光模块均能很好的应用,本实施例中,是按照光模块的型号对光模块的历史数据进行分析,也就是对于不同型号的光模块而言,其历史数据的分析结果是不同的。光模块的故障成因和故障效果可 能是差不多的,但是其在故障时的数据变化,对于不同型号的光模块而言很可能是不同的,这就带来了光模块的历史数据分析结果中,其发生突变的历史数据的差异性。本实施例中,正是考虑到了这部分差异性,按照光模块的型号来进行分析,得到对于不同型号的光模块而言,其发生故障时发生突变的历史数据。
其中,发生突变的历史数据,表示该历史数据在光模块发生故障时短时间内,其指标明显的变化,可以是明显的升高,或者是明显的降低,在光模块故障检测问题中,可以定义为“在光模块的历史数据中,在某个时间点,由于特殊事件——故障,造成了此时间点前后某些数据发生了明显的变化”,这里的某些数据所指的就是发生突变的历史数据。这里所说的明显的变化,可以根据实际数据本身的情况来确定,比如说对于正常指标而言,其短时间内增加或降低的幅度超过了50%。本实施例只是列举一种可能的突变幅度的情况,对于不同的数据,其突变幅度不可能一概而论,可以根据实际的系统运行情况,结合历史数据的类型确定,本实施例并不对其进行限定。
在分析各历史数据的突变情况时,统计其中发生突变最多的指标,对于其他的未发生突变,或者是突变不明显的历史数据,可以直接不予考虑。当然,为了提升检测的全面性,避免漏判、误判所带来的不良后果,可以对这些数据予以保留,在后续的更新阶段可以对故障模型进行更新。
本实施例中的光模块的历史数据的类型,可以是性能数据,巡检数据等等;在按照光模块的型号,对各个光模块的历史数据进行分析之前,首先,可以以光模块的型号为标准,对各光模块进行分类。各种型号的光模块的分类标准,可以是直接按照厂商进行分类,或者是按照光模块的工作原理进行分类,或者是按照光模块的规格、适用的范围进行分类等等。
在发生突变的历史数据中,并非所有的历史数据均和故障有关,例如,对于同一型号中发生故障的光模块而言,总有一些突变的历史数据是偶发性的,只有个别的故障光模块会产生,那么这样的数据往往不应该被作为故障数据;换言之,故障数据应当是对于相应型号的光模块而言,在发生故障时,发生突变的占比较高的那些历史数据。对于这样的历史数据,则可以作为原始故障数据。在本实施例中,从发生突变的历史数据中,提取对应的原始故障数据可以包括:根据发生突变的历史数据,分析时间序列上的突变时间点;根据突变时间点,确定历史数据中出现突变占比超过预设比例的历史数据;将突变占比超过预设比例的历史数据,作为原始故障数据。这种确定原始故障数据的方案,可以极大的排除那些偶发性的突变历史数据,而保留大概率与故障相关的历史数据。分析突变时间点,其中突变时间点可以被认为是光模块发生故障的时间点,在这些时间点上,突变的历史数据较为集中;但是,即便是在突变时间点发生突变的历史数据,也并非全是故障引起,因此可以再度进行筛选,将其中的突变占比,超过预设比例的历史数据,作为原始故障数据。突变占比所指的是,在这些 突变时间点上,发生突变的历史数据的比例超过了一个阈值,例如,对于某突变时间点,相同型号的多个发生故障的光模块上,某历史数据发生突变的占比在所有发生故障的光模块中,超过了某比例(例如80%),那么就可以认为该历史数据,是原始故障数据。或者是,对于同一个光模块而言,其可能具有多个突变时间点,而在这些突变时间点中,某历史数据发生突变的占比,超过了某比例(例如80%),也可以改成认为该历史数据是原始故障数据。对于本实施例而言,由于其处理方式是按照型号对光模块进行了分类,因此对于不同型号的光模块而言,其确定出的原始故障数据也会有所区别,这一方面印证了不同厂商器件的差异性,另一方面则体现出本公开各实施例中的故障处理装置极佳的普适性。
在确定了原始故障数据之后,就可以根据原始故障数据,按照光模块的型号进行故障模型训练。同样的,对于不同型号的光模块,其故障模型也是不同的。
在一些实施例中,根据原始故障数据,按照光模块的型号进行故障模型训练可以包括:对原始故障数据进行筛选,得到目标故障数据;根据光模块的正常与故障比,对目标故障数据进行采样得到采样数据;根据采样数据,进行模型训练。原始故障数据虽然可以直接进行光模块的故障模型训练,但是其仍可能存在一些异常,如在分析历史数据阶段存在空缺,数据异常,或者是从突变的历史数据得到原始故障数据时仍然有非故障相关的数据。因此,此时可以再对原始故障数据进行筛选,得到目标故障数据。目标故障数据与光模块故障的关联性更高,目标故障数据可以在更大程度上反映光模块的故障情况,可以更加准确的对光模块的故障进行检测分析。
由于实际系统中,光模块发生故障的概率较低,大部分的光模块是没有发生故障的,因此可以根据光模块的正常与故障比,对目标故障数据,进行采样,来得到采样数据。然后,再根据采样数据进行模型训练。
在一些实施例中,对原始故障数据进行筛选包括:删除原始故障数据中的异常数据;和/或,提取原始故障数据中,表征光模块时序特性的数据。其中,表征光模块时序特性的数据,是指针对光模块时序上的特性,提取的能够表征时间上显著特征的数据。
在一些实施例中,按照光模块的型号进行故障模型训练可以包括:根据不同型号的光模块对应的采样数据,分别针对不同型号的光模块,进行故障模型训练。不同型号的光模块,其故障对应的故障数据有所差别,相应的其故障模型也是不同的。
在一些实施例中,故障模型可以包括机器学习模型、统计学模型和深度学习模型中的至少一种。
在一些实施例中,在按照光模块的型号进行故障模型训练之后,还可以包括:根据故障模型训练的结果,输出预设时间内,对应的光模块的故障风险预测结果。故障模型训练的结 果可以直接用来预测,也就是预测某些光模块是否发生了故障,或者会在何时发生故障,会发生怎样的故障等等,为系统的运维提供方便。
在一些实施例中,还可以包括:通过预设的评价指标,对故障风险预测结果进行评估。基于故障模型训练的结果进行预测,总是会可能出现偏差,因此可以通过设置评价指标,来监控预测结果;可以将实际故障情况,与预测结果进行交叉比对,从而得到故障模型对应的评估指标。而一旦评估指标触发了条件,则需要对故障模型进行更新。
在一些实施例中,还可以包括:收集误检、漏检的光模块对应的补充历史数据;当预设的评价指标达到更新阈值时,根据补充历史数据,对故障模型进行更新。在对光模块的历史数据进行分析时,可能会出现将原本正常的光模块,误认为是故障的光模块,即误检,或者是将故障的光模块的历史数据漏掉,也就是漏检;漏检和误检,还可以出现在对于光模块的故障情况进行预测的时候;而在出现误检或漏检的情况时,可以先将漏检、误检的数据,即补充历史数据收集起来,在评价指标达到更新阈值时,就可以根据补充历史数据,对故障模型进行更新。
利用本实施例提供的故障处理装置中所包括的上述数据分析模块、故障提取模块和模型训练模块,通过按照光模块的型号进行划分,提升了采集故障光模块的历史数据的灵活性,并进一步提升了故障检测的适用范围,有效的降低了光模块的运维成本。
请参考图7,图7示出了本实施例中的故障处理装置在故障处理系统中的详细组成示意图。
数据分析模块61主要负责对数据进行分类,以及通过突变点分析算法提取不同型号光模块的故障数据指标;数据分析模块61包括第一器件分类子模块611和指标提取子模块612,其中第一器件分类子模块611主要负责根据光模块的型号等信息,将数据分类;指标提取子模块612针对不同型号光模块的故障数据,分析时间序列上的“突变点”,得到故障模块中出现突变占比较多的历史数据,作为原始故障数据,输入给下游。
故障提取模块62主要负责根据上游子系统提供的原始故障数据,对数据进行精简处理,并对异常数据进行清洗,新特征的提取,并将数据进行采样;故障提取模块62包括异常清理子模块621、时序提取子模块622和采样子模块623,其中异常清理子模块621使用上游提取的原始故障数据,去除不需要的数据,并对采集有误的数据进行剔除;时序提取子模块622针对光模块时序上的特性,提取能够表征时间上显著特征的数据;采样子模块623针对当前数据的正常、故障光模块比例,制定采样算法,生成采样数据输送给下游。
模型训练模块63主要负责对故障检测模型的训练以及完成后续的预测任务;模型训练模块63包括第二器件分类子模块631和结果处理子模块632,其中第二器件分类子模块631针 对上游输入的采样数据,按照光模块的型号进行分类;可以根据不同型号的模块以及输入数据,训练不同的模型,模型采用机器学习模型,也可以根据需求和算力替换成统计学模型或深度学习模型;在预测阶段则可以直接使用模型输出结果;结果处理子模块632配置为将结果进行汇总,作为系统结果输出,同时传输给下游。
模型更新模块64主要负责对上游输出的结果进行监控,在评价指标下降时触发系统模型更新;模型更新模块64包括结果评估子模块641、收集子模块644、增量更新触发器642和全量更新触发器643,其中结果评估子模块641配置为通过评价指标对当前输出结果进行模型检测能力评估;收集子模块644配置为将误检(即正常模块检测为故障)和漏检(即故障模块检测为正常)的光模块数据进行收集,供后续模型更新时使用;增量更新触发器642,由结果评估子模块641控制,触发系统进行增量更新;全量更新触发器643,同样由结果评估子模块641控制,触发系统进行全量更新。
模型更新模块64也可以输出结果直接输出给用户。
实施例七
本实施例还提供了一种网络设备,请参考图8,其包括处理器81、存储器82及通信总线83;通信总线83在处理器81和存储器82之间进行连接通信;处理器81配置为执行存储器82中存储的一个或者多个计算机程序,以实现上述各实施例中的故障处理方法中的步骤,这里不再赘述。
本实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括在配置为存储信息(诸如计算机可读指令、数据结构、计算机程序模块或其他数据)的任何方法或技术中实施的易失性或非易失性、可移除或不可移除的介质。计算机可读存储介质包括但不限于RAM(Random Access Memory,随机存取存储器),ROM(Read-Only Memory,只读存储器),EEPROM(Electrically Erasable Programmable read only memory,带电可擦可编程只读存储器)、闪存或其他存储器技术、CD-ROM(Compact Disc Read-Only Memory,光盘只读存储器),数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以配置为存储期望的信息并且可以被计算机访问的任何其他的介质。
本实施例中的计算机可读存储介质可配置为存储一个或者多个计算机程序,其存储的一个或者多个计算机程序可被一个或者多个处理器执行,以实现上述各实施例种的故障处理方法的步骤。
本实施例还提供了一种计算机程序(或称计算机软件),该计算机程序可以分布在计算 机可读介质上,由可计算装置来执行,以实现上述各实施例中的故障处理方法的步骤;并且在某些情况下,可以采用不同于上述实施例所描述的顺序执行所示出或描述的至少一个步骤。
本实施例还提供了一种计算机程序产品,包括计算机可读装置,该计算机可读装置上存储有如上所示的计算机程序。本实施例中该计算机可读装置可包括如上所示的计算机可读存储介质。
根据本公开提供的故障处理方法、装置、网络设备和存储介质,按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的光模块在出现故障的情况下,发生突变的历史数据;从发生突变的历史数据中,提取对应的原始故障数据;根据原始故障数据,按照光模块的型号进行故障模型训练。从而,通过按照光模块的型号进行划分,提升了采集故障光模块的历史数据的灵活性,并进一步提升了故障检测的适用范围,在实际使用过程中,可做到自动更新,随着故障数据的积累不断提升检测精度,有效的降低了光模块的运维成本。
可见,本领域的技术人员应该明白,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件(可以用计算装置可执行的计算机程序代码来实现)、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。
此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、计算机程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。所以,本公开不限制于任何特定的硬件和软件结合。
以上内容是结合实施方式对本公开实施例所作的进一步详细说明,不能认定本公开的具体实施只局限于这些说明。对于本公开所属技术领域的普通技术人员来说,在不脱离本公开构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本公开的保护范围。

Claims (13)

  1. 一种故障处理方法,包括:
    按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的所述光模块在出现故障的情况下,发生突变的历史数据;
    从所述发生突变的历史数据中,提取对应的原始故障数据;
    根据所述原始故障数据,按照光模块的型号进行故障模型训练。
  2. 如权利要求1所述的故障处理方法,其中,在所述按照光模块的型号,对各个光模块的历史数据进行分析之前,进一步包括:
    以光模块的型号为标准,对各所述光模块进行分类。
  3. 如权利要求1所述的故障处理方法,其中,所述从所述发生突变的历史数据中,提取对应的原始故障数据包括:
    根据所述发生突变的历史数据,分析时间序列上的突变时间点;
    根据所述突变时间点,确定所述历史数据中出现突变占比超过预设比例的历史数据;
    将突变占比超过预设比例的所述历史数据,作为所述原始故障数据。
  4. 如权利要求1-3任一项所述的故障处理方法,其中,所述根据所述原始故障数据,按照光模块的型号进行故障模型训练包括:
    对所述原始故障数据进行筛选,得到目标故障数据;
    根据光模块的正常与故障比,对所述目标故障数据进行采样得到采样数据;
    根据所述采样数据,进行模型训练。
  5. 如权利要求4所述的故障处理方法,其中,所述对所述原始故障数据进行筛选包括:
    删除所述原始故障数据中的异常数据;和/或,
    提取所述原始故障数据中,表征所述光模块时序特性的数据。
  6. 如权利要求4所述的故障处理方法,其中,所述按照光模块的型号进行故障模型训练包括:
    根据不同型号的光模块对应的采样数据,分别针对不同型号的光模块,进行故障模型训练。
  7. 如权利要求6所述的故障处理方法,其中,所述故障模型包括机器学习模型、统计学模型和深度学习模型中的至少一种。
  8. 如权利要求1-3任一项所述的故障处理方法,其中,在所述按照光模块的型号进行故障模型训练之后,进一步包括:
    根据故障模型训练的结果,输出预设时间内,对应的光模块的故障风险预测结果。
  9. 如权利要求8所述的故障处理方法,其中,进一步包括:
    通过预设的评价指标,对所述故障风险预测结果进行评估。
  10. 如权利要求9所述的故障处理方法,其中,进一步包括:
    收集误检、漏检的光模块对应的补充历史数据;
    当所述预设的评价指标达到更新阈值时,根据所述补充历史数据,对所述故障模型进行更新。
  11. 一种故障处理装置,包括:
    数据分析模块,配置为按照光模块的型号,对各个光模块的历史数据进行分析,确定相应型号的所述光模块在出现故障的情况下,发生突变的历史数据;
    故障提取模块,配置为从所述发生突变的历史数据中,提取对应的原始故障数据;
    模型训练模块,配置为根据所述原始故障数据,按照光模块的型号进行故障模型训练。
  12. 一种网络设备,包括处理器、存储器及通信总线;
    所述通信总线在处理器和存储器之间进行连接通信;
    所述处理器配置为执行存储器中存储的一个或者多个计算机程序,以实现如权利要求1-10中任一项所述的故障处理方法的步骤。
  13. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有一个或者多个计算机程序,所述一个或者多个计算机程序可被一个或者多个处理器执行,以实现如权利要求1-10中任一项所述的故障处理方法的步骤。
PCT/CN2021/112806 2020-08-17 2021-08-16 故障处理方法、装置、网络设备和存储介质 WO2022037536A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP21857630.4A EP4198803A4 (en) 2020-08-17 2021-08-16 FAULT PROCESSING METHOD AND APPARATUS, NETWORK DEVICE AND STORAGE MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010826217.3A CN114154385A (zh) 2020-08-17 2020-08-17 故障处理方法、装置、网络设备和存储介质
CN202010826217.3 2020-08-17

Publications (1)

Publication Number Publication Date
WO2022037536A1 true WO2022037536A1 (zh) 2022-02-24

Family

ID=80323417

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112806 WO2022037536A1 (zh) 2020-08-17 2021-08-16 故障处理方法、装置、网络设备和存储介质

Country Status (3)

Country Link
EP (1) EP4198803A4 (zh)
CN (1) CN114154385A (zh)
WO (1) WO2022037536A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104698343A (zh) * 2015-03-26 2015-06-10 广东电网有限责任公司电力调度控制中心 基于历史录波数据的电网故障判断方法和系统
CN109204389A (zh) * 2018-09-12 2019-01-15 济南轨道交通集团有限公司 一种地铁设备故障诊断和自愈方法、系统
CN111507363A (zh) * 2019-01-30 2020-08-07 华为技术有限公司 预测光模块故障的方法、装置和设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107528722B (zh) * 2017-07-06 2020-10-23 创新先进技术有限公司 一种时间序列中异常点检测方法及装置
IT201800003363A1 (it) * 2018-03-08 2019-09-08 Milano Politecnico Metodo per monitorare un sistema di comunicazioni ottiche
EP3829080A4 (en) * 2018-11-30 2021-08-04 Huawei Technologies Co., Ltd. PON FAULT LOCATION METHOD AND DEVICE

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104698343A (zh) * 2015-03-26 2015-06-10 广东电网有限责任公司电力调度控制中心 基于历史录波数据的电网故障判断方法和系统
CN109204389A (zh) * 2018-09-12 2019-01-15 济南轨道交通集团有限公司 一种地铁设备故障诊断和自愈方法、系统
CN111507363A (zh) * 2019-01-30 2020-08-07 华为技术有限公司 预测光模块故障的方法、装置和设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4198803A4 *

Also Published As

Publication number Publication date
EP4198803A4 (en) 2024-01-10
CN114154385A (zh) 2022-03-08
EP4198803A1 (en) 2023-06-21

Similar Documents

Publication Publication Date Title
CN109981328B (zh) 一种故障预警方法及装置
US8352789B2 (en) Operation management apparatus and method thereof
US8560894B2 (en) Apparatus and method for status decision
CN112712113B (zh) 一种基于指标的告警方法、装置及计算机系统
US20110276836A1 (en) Performance analysis of applications
US9524223B2 (en) Performance metrics of a computer system
US20110161048A1 (en) Method to Optimize Prediction of Threshold Violations Using Baselines
CN111010291A (zh) 业务流程异常告警方法、装置、电子设备及存储介质
WO2012160637A1 (ja) メッセージ判定装置およびメッセージ判定プログラム
JP2015028700A (ja) 障害検知装置、障害検知方法、障害検知プログラム及び記録媒体
CN111611146B (zh) 一种微服务故障预测方法和装置
CN103392176A (zh) 网络事件管理
CN115454778A (zh) 大规模云网络环境下的时序指标异常智能监控系统
CN114095965A (zh) 指标检测模型获取及故障定位方法、装置、设备及存储介质
CN113590429A (zh) 一种服务器故障诊断方法、装置及电子设备
CN110489260B (zh) 故障识别方法、装置及bmc
CN115794588A (zh) 内存故障预测方法、装置、系统及监测服务器
WO2022037536A1 (zh) 故障处理方法、装置、网络设备和存储介质
CN115495274B (zh) 基于时序数据的异常处理方法、网络设备和可读存储介质
CN111586129A (zh) 针对数据同步的报警方法、装置、电子设备及存储介质
CN107370618B (zh) 故障排查方法、装置及电子设备
CN112152833A (zh) 一种网络异常报警方法、装置及电子设备
CN115543671A (zh) 数据分析方法、装置、设备、存储介质及程序产品
CN113254313A (zh) 一种监控指标异常检测方法、装置、电子设备及存储介质
CN112134760A (zh) 链路状态监控方法、装置、设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21857630

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021857630

Country of ref document: EP

Effective date: 20230317