WO2021147370A1 - 故障检测模型的训练方法、装置及系统 - Google Patents

故障检测模型的训练方法、装置及系统 Download PDF

Info

Publication number
WO2021147370A1
WO2021147370A1 PCT/CN2020/119031 CN2020119031W WO2021147370A1 WO 2021147370 A1 WO2021147370 A1 WO 2021147370A1 CN 2020119031 W CN2020119031 W CN 2020119031W WO 2021147370 A1 WO2021147370 A1 WO 2021147370A1
Authority
WO
WIPO (PCT)
Prior art keywords
service
message
service flow
kpi
network object
Prior art date
Application number
PCT/CN2020/119031
Other languages
English (en)
French (fr)
Inventor
薛莉
张亮
程剑
叶浩楠
司晓云
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20915373.3A priority Critical patent/EP4084410A4/en
Publication of WO2021147370A1 publication Critical patent/WO2021147370A1/zh
Priority to US17/871,498 priority patent/US20220368606A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0894Policy-based network configuration management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0895Configuration of virtualised networks or elements, e.g. virtualised network function or OpenFlow elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5009Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/026Capturing of monitoring data using flow identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes

Definitions

  • This application relates to the field of communications, and in particular to a method, device and system for training a fault detection model.
  • a data communication network or a data center network includes a large number of network objects such as terminals or servers.
  • the network objects are connected to the access device, and the access device is connected to the wide area network through the forwarding device, so that the network object can pass through the access device, the forwarding device and Wide area network to transmit service flow.
  • an analysis platform can be deployed. First, a fault detection model is trained on the analysis platform, and the analysis platform detects the health of any network object through the fault detection model.
  • the access device or forwarding device in the data communication network or data center network When training the fault detection model, for the service flow of any network object, when the access device or forwarding device in the data communication network or data center network receives the service flow, it mirrors the service flow and sends it to the analysis platform The service flow obtained by mirroring.
  • the analysis platform can receive the business flow of each network object, and train a fault detection model according to the business flow of each network object.
  • the access device or the forwarding device Since the fault detection model is trained based on the service flow of the network object, the access device or the forwarding device is required to mirror the service flow, and then the service flow obtained by mirroring is sent to the analysis platform, which consumes a lot of network resources.
  • This application provides a method, device and system for training a fault detection model to reduce the consumption of network resources.
  • the technical solution is as follows:
  • this application provides a method for training a fault detection model.
  • the forwarding device receives at least one service flow.
  • the forwarding device obtains the service information of the at least one service flow, and the service information of the service flow includes the identification information of the network object to which the service flow belongs and M key performance indicators KPIs of the service flow, where M is an integer greater than 0,
  • the network object includes one or more devices.
  • the forwarding device sends training information to the first device, where the training information includes service information of the at least one service flow or a feature set obtained based on the service information of the at least one service flow, and the training information is used for training faults
  • a detection model the fault detection model is used to detect whether the network object is in a fault state.
  • the training information obtained by the forwarding device includes the identification information of the network object and M KPIs, or the feature set obtained based on the M KPIs of the network object
  • the data volume of the training information is much smaller than the service flow, and the service information is sent to the first device.
  • the required network resources are much smaller than the network resources required to send the service flow, so that the consumption of network resources can be reduced.
  • the forwarding device obtains at least one target service message from the service flow according to configuration policy information, and the configuration policy information includes at least one preset message type.
  • the forwarding device obtains the M KPIs of the service flow according to the at least one target service message. Since the target business message is obtained from the business flow, the M KPIs of the business flow are obtained according to the target business message, so that the number of messages that need to be analyzed can be reduced and the efficiency of obtaining KPIs can be improved.
  • the M KPIs include the network delay between the forwarding device and the network object, the amount of data sent by the network object belonging to the service flow and the network object At least one of the received data volumes belonging to the service flow.
  • the at least one target service message includes a first target service message and a second target service message, and the forwarding device is based on the first time when the first target service message is received and when the second target service message is received.
  • the second time of the message, the network delay between the forwarding device and the network object is acquired, the first target service message is a message sent to the network object, and the second target service message It is a message corresponding to the first target service message sent by the network object.
  • the at least one target service message includes a first start message and a first end message
  • the forwarding device is based on the sequence number of the first start message and the sequence of the first end message Number, acquiring the amount of data belonging to the service flow sent by the network object, the first start message is the first message of the service flow sent by the network object, and the first end message It is the last packet of the service flow sent by the network object.
  • the at least one target service message includes a second start message and a second end message
  • the forwarding device is based on the sequence number of the second start message and the second end message
  • the second start packet is the first packet of the service flow received by the network object
  • the second The end message is the last message of the service flow received by the network object.
  • the M KPIs include status identifiers, and the status identifiers are used to identify the status of the service flow.
  • the at least one target service message includes a first start message, and if the forwarding device receives the first end message within the first length of time after the third time, it sets the status of the status identifier to success Status; if the first end message is not received, the status of the status identifier is set to a failure state, the third time is the time when the first start message is received, and the first start message Is the first packet of the service flow sent by the network object, and the first end packet is the last packet of the service flow sent by the network object.
  • the status identifier of the service flow can be accurately obtained, and the accuracy of obtaining the status identifier can be improved.
  • the forwarding device obtains the KPIs of the N service flows belonging to the target network object in the first cycle from the at least one service flow, and the target network object is the at least one service flow.
  • the network object to which any service flow in the flow belongs, and N is an integer greater than 0.
  • the forwarding device obtains a feature set based on the KPIs of the N service flows. Since the feature set includes features acquired based on the KPI of each service flow belonging to the target network object, the feature set can better reflect the health of the network state, and the fault detection model trained based on the feature set is more accurate.
  • the feature set includes at least one statistical feature.
  • the forwarding device obtains M KPI sets, any KPI set includes one KPI for each of the N business flows, and the KPI types included in any KPI set are the same.
  • the forwarding device calculates the KPIs included in any KPI set by using at least one first calculation method to obtain at least one statistical feature corresponding to the any KPI set, and the at least one first calculation method includes the following: One or more types: performing statistics on KPIs in any KPI set, and calculating the mean, variance, dispersion, skewness, or kurtosis of the KPIs included in any KPI set. Since different statistical features are counted to form a feature set, the features included in the feature set are enriched, and the feature set can better reflect the health status of network objects.
  • the feature set further includes at least one time domain feature.
  • the forwarding device calculates the statistical features included in the statistical feature set by using at least one second calculation method to obtain at least one time domain feature.
  • the statistical feature set includes K statistical features, and the K statistical features are statistical features of the same type calculated in K periods, and the K periods include the first period and the For K-1 periods before the first period
  • the at least one second calculation method includes one or more of the following: calculating the ring ratio or difference value between two adjacent statistical features in the statistical feature set , Performing feature fitting on the statistical features in the statistical feature set. Since the time-domain feature is obtained based on the statistical features of K cycles, and the feature set also includes the time-domain feature, the feature set includes the feature with temporality.
  • the status identifier of the N service flows included in the any one KPI set, and the status identifier of any one of the N service flows is used to identify the any one of the services
  • the statistical characteristics of any one of the KPI sets include the number of status identifiers used to identify the success state and the number of status identifiers used to identify the failure state
  • the feature set also includes the proportion of business flows in the successful state and/or failure The proportion of the status of the business flow.
  • the number of status identifiers used to identify the success status and the number of KPIs included in any one of the KPI sets calculate the proportion of business flows in the successful status; and/or, according to the number of status identifiers used to identify the failure status and the total number of KPIs
  • the number of KPIs included in any KPI set is calculated, and the proportion of business flows in a failed state is calculated. Since the feature set also includes status identifiers, the features in the feature set are further enriched.
  • the first device is a cloud platform, an analyzer platform, or an upstream device of the forwarding device.
  • the network object is a terminal, a server, a client, a virtual machine, a router, a switch, a device in a virtual local area network VLAN, or a device in a designated network segment.
  • the M KPIs are used to describe the characteristics of the service flow.
  • this application provides a method for training a fault detection model.
  • the first device receives the service information of at least one service flow sent by the first forwarding device, and the service information of the service flow includes the service Identification information of the network object to which the flow belongs and M key performance indicators KPIs of the service flow, where M is an integer greater than 0, and the network object includes one or more devices.
  • the first device trains a fault detection model according to the service information of the at least one service flow, or obtains at least one fault detection model for training the fault detection model according to the service information of the at least one service flow, and the fault detection The model is used to detect whether the network object is in a fault state. Since the service information sent by the forwarding device includes the identification information and KPI of the network object, the data volume of the service information is much smaller than the data volume of the service flow, thereby reducing the network resources consumed by the first device to receive the service information.
  • the first device obtains at least one feature set, and any feature set includes at least one feature obtained based on the KPI of each service flow belonging to the target network object, and the target network object is the The network object to which any one of the at least one service flow belongs.
  • the first device trains a fault detection model according to the at least one feature set. Since the feature set includes features acquired based on the KPI of each service flow belonging to the target network object, the feature set can better reflect the health of the network state, and the fault detection model trained based on the feature set is more accurate.
  • the first device obtains KPIs of N service flows belonging to the target network object in a first period, where the first period is within the first time period, and N Is an integer greater than 0.
  • the first device obtains M KPI sets, any KPI set includes one KPI of each of the N business flows, and the KPI types included in any KPI set are the same.
  • the first device calculates the KPIs included in any KPI set by using at least one first calculation method to obtain at least one statistical feature corresponding to the any KPI set, and the at least one first calculation method includes the following One or more: performing statistics on the KPIs in any KPI set, and calculating the mean, variance, dispersion, skewness, or kurtosis of the KPIs included in any KPI set. Since different statistical features are counted to form a feature set, the features included in the feature set are enriched, so that the feature set can better reflect the health status of network objects.
  • any one of the feature sets further includes at least one time domain feature.
  • At least one second calculation method is used to calculate the statistical features included in the statistical feature set to obtain at least one time-domain feature.
  • the statistical feature set includes K statistical features, and the K statistical features are statistical features of the same type calculated in K periods, and the K periods include the first period and the For K-1 periods before the first period, the at least one second calculation method includes one or more of the following: calculating the ring ratio or difference value between two adjacent statistical features in the statistical feature set , Performing feature fitting on the statistical features in the statistical feature set. Since the time-domain feature is obtained based on the statistical features of K cycles, and the feature set also includes the time-domain feature, the feature set includes the feature with temporality.
  • the any one KPI set includes the status identifier of the N service flows, and the status identifier of any one of the N service flows is used to identify the any one of the services.
  • the status of the flow; the statistical features of any one of the KPI sets include the number of status identifiers used to identify the successful state and the number of status identifiers used to identify the failed state; the any feature set also includes the proportion of the business flow in the successful state and/ Or the proportion of business flows in a failed state.
  • the number of status identifiers used to identify the success status and the number of KPIs included in any one of the KPI sets calculate the proportion of business flows in the successful status; and/or, according to the number of status identifiers used to identify the failure status and the total number of KPIs
  • the number of KPIs included in any KPI set is calculated, and the proportion of business flows in a failed state is calculated. Since the feature set also includes status identifiers, the features in the feature set are further enriched.
  • a training sample is generated, the training sample includes any one of the feature sets and the label of the training sample, and when the target network object is in a fault state, the label is used To identify the fault state, when the target network object is in a normal state, the tag is used to identify the normal state. Since the labels of the training samples are set, the fault detection model can be trained in a supervised manner.
  • the first device sends the at least one feature set to a training device, and the at least one feature set is used by the training device to train a fault detection model.
  • the first device receives the fault detection model sent by the training device. In this way, a higher-performance training device can be used to train the fault detection model, which improves the efficiency of training.
  • the M KPIs of the service flow include the network delay between the network object and the forwarding device, and the amount of data sent by the network object belonging to the service flow, The amount of data belonging to the service flow received by the network object, or at least one of the status identifiers of the service flow, and the status identifier information is used to identify the status of the service flow.
  • the first device is a cloud platform, an analyzer platform, or an upstream device of the forwarding device.
  • the network object is a terminal, a server, a client, a virtual machine, a router, a switch, a device in a virtual local area network VLAN, or a device in a designated network segment.
  • the M KPIs are used to describe the characteristics of the service flow.
  • the present application provides a training device for a fault detection model, which is used to execute the first aspect or the method in any one of the possible implementation manners of the first aspect.
  • the device includes a unit for executing the method of the first aspect or any one of the possible implementation manners of the first aspect.
  • the present application provides a training device for a fault detection model, which is used to execute the second aspect or the method in any one of the possible implementation manners of the second aspect.
  • the device includes a unit for executing the second aspect or any one of the possible implementation manners of the second aspect.
  • the present application provides a training device for a fault detection model.
  • the device includes a processor, a memory, and a transceiver.
  • the processor, the memory and the transceiver may be connected through a bus system.
  • the memory is configured to store one or more programs
  • the processor is configured to execute one or more programs in the memory, so that the apparatus completes the first aspect or the method in any possible implementation manner of the first aspect.
  • the present application provides a training device for a fault detection model.
  • the device includes a processor, a memory, and a transceiver.
  • the processor, the memory and the transceiver may be connected through a bus system.
  • the memory is used to store one or more programs, and the processor is used to execute one or more programs in the memory, so that the apparatus completes the second aspect or the method in any possible implementation manner of the second aspect.
  • the present application provides a computer-readable storage medium with program code stored in the computer-readable storage medium, which when run on a computer, causes the computer to execute the first, second, and first aspects above Any possible implementation manner of or a method in any possible implementation manner of the second aspect.
  • the present application provides a computer program product containing program code, which when running on a computer, enables the computer to execute the first aspect, the second aspect, any possible implementation manner of the first aspect, or the second aspect Any of the possible implementations of the method.
  • the present application provides a training system for a fault detection model, the system includes the device described in the third aspect and the device described in the fourth aspect; or, the system includes the device described in the fifth aspect And the device described in the sixth aspect.
  • FIG. 1 is a schematic diagram of a network architecture provided by an embodiment of the present application.
  • Fig. 2 is a schematic structural diagram of a data communication network provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a data center network provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of a method for training a fault detection model provided by an embodiment of the present application
  • FIG. 5 is a flowchart of a transmission service flow provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a fault detection method provided by an embodiment of the present application.
  • FIG. 7 is a flowchart of another method for training a fault detection model provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of another fault detection method provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a training device for a fault detection model provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of another training device for a fault detection model provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of another training device for a fault detection model provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of another training device for a fault detection model provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a training system for a fault detection model provided by an embodiment of the present application.
  • an embodiment of the present application provides a network architecture, and the network architecture includes:
  • the network object, the forwarding device and the first device a network connection is established between the forwarding device and the network object, and a network connection is also established between the forwarding device and the first device.
  • the service path used to transmit the service flow passes through the forwarding device.
  • the network connection between the network object and the forwarding device is part of the service path.
  • the any service message may be sent by the network object. After the network object sends any service message, it will be received by the forwarding device, and then The forwarding device forwards any service message to its upstream device. Alternatively, the any service message may need to be sent to the network object, and the forwarding device will first receive the any service message, and then forward the any service message to the network object.
  • the network architecture may further include a training device, and a network connection may be established between the first device and the training device.
  • the network object may be a terminal, a server, a router, a switch, a client, a virtual machine, a device in a virtual local area network (VLAN), or a device in a designated network segment, etc.
  • the network segment is an address range, including the addresses of multiple devices.
  • the identification information of the network object is the address of the network object.
  • the identification information of the network object is the identification information of the VLAN or the identification information of the network segment.
  • the first device is an upstream device of the forwarding device, or the first device is a cloud platform or an analyzer platform.
  • the service path used to transmit the service flow may pass through the first device, that is to say: the network connection between the forwarding device and the first device is also the service path
  • the first device can be used to forward the packets included in the service flow of the network object.
  • the network architecture further includes a network management device, and the network management device may establish a network connection with each network object in the network architecture, and establish a network connection with the first device in the network architecture.
  • the foregoing network architecture can be applied to a data communication network. See the data communication network shown in FIG. 2.
  • the data communication network includes at least one terminal, at least one optical network terminal (ONT), and at least one optical network.
  • Terminal optical network terminal, OLT
  • broadband access server broadband remote access server, BRAS
  • core router core router
  • ONT optical network terminal
  • ONT optical network terminal
  • BRAS broadband access server
  • core router core router
  • the terminal is connected to an ONT.
  • ONT optical network terminal
  • the any ONT is connected to an OLT.
  • Each of the at least one OLT is also connected to the BRAS, the BRAS is also connected to the CR, and the CR can be connected to the wide area network.
  • a cloud platform or analyzer platform can also be set in the data communication network, and a network connection is established between the cloud platform or the analyzer platform and each ONT in the data communication network, and/or the cloud platform or the ONT A network connection is established between the analyzer platform and each OLT in the data communication network, and/or a network connection is established between the cloud platform or the analyzer platform and the BRAS in the data communication network.
  • the forwarding device can be an ONT, OLT, or BRAS.
  • the network object can be a terminal.
  • the first device may be an upstream device of the forwarding device, for example, the forwarding device is an ONT or an OLT, and the first device may be a BRAS.
  • the first device may be a cloud platform or analyzer platform that has a network connection with the forwarding device.
  • the foregoing network architecture may be applied to a data center network. See the data center network shown in FIG. 3.
  • the data center network includes at least one server, at least one leaf (Leaf), at least one backbone switch (Spine), and gateway ( gateway, GW).
  • Leaf Leaf
  • Spine backbone switch
  • gateway gateway
  • the server is connected to a leaf.
  • the any leaf is connected to at least one spine.
  • Each Spine in the at least one Spine is also connected to the GW, and the GW may also be connected to the wide area network.
  • a cloud platform may be set up in the data center network, and a network connection is established between the cloud platform and each leaf in the data center network, and/or a network connection is established between each spine.
  • an analyzer platform may also be set in the data center network, and the analyzer platform has a network connection with each leaf in the data center network, and/or a network connection with each spine.
  • the forwarding device can be Leaf, Spine, or GW.
  • the network object can be a server.
  • the first device may be an upstream device of the forwarding device, for example, the forwarding device is Leaf, and the first device may be Spine.
  • the first device may be a cloud platform or an analyzer platform that has a network connection with each forwarding device.
  • the network objects included in the network architecture provided by the embodiments of the present application may fail.
  • the network objects When the network objects fail, service interruption may occur, causing serious losses. Therefore, it is necessary to detect the failed network objects in time.
  • the first device can train a fault detection model, or train a fault detection model through the training device, the fault detection model is used to detect whether the network object is in a fault state, so that the first device can detect in time through the fault detection model The network object that failed.
  • the network management device may configure some network objects in the network architecture to be in a certain fault state within the first time period.
  • the forwarding device in the network architecture when the forwarding device receives the business flow of the network object, it obtains the business information of the business flow, and the business information includes the identification information of the network object and M key performance indicators (KPI), M is an integer greater than 0, and the training information is sent to the first device.
  • the training information includes the service information of the service flow or the feature set obtained based on the service information.
  • the first device receives the training information sent by the forwarding device included in the network architecture, and trains the intelligent model according to the received training information to obtain a fault detection model.
  • the fault detection model can be used to detect whether a network object in the network architecture is in the fault state.
  • the detailed acquisition process for the forwarding device to acquire the detection information and the detailed training process for the first device to train the fault detection model will be described in detail in the subsequent embodiment shown in FIG. 4 or FIG. 7, and will not be introduced here.
  • the fault state may be a delay fault state or a link establishment fault state, etc.
  • the first device trains the fault detection model, when the forwarding device receives the service flow of the network object, it obtains the service information of the service flow, and sends the detection information to the first device.
  • the detection information includes the service information of the service flow or The feature set obtained based on the business information.
  • the first device receives the detection information sent by the forwarding device included in the network architecture, and detects the network object in the fault state in the network architecture through the fault detection model according to the received detection information.
  • an embodiment of the present application provides a method for training a fault detection model, and the training method can be applied to the network architecture provided by any of the embodiments shown in FIGS. 1 to 3.
  • the forwarding device obtains the service information of the service flow, sends the service information of the service flow to the first device, and the first device receives the service information and trains the fault detection model.
  • the method includes:
  • Step 101 The forwarding device receives the service flow.
  • the forwarding device is a device through which the service path of the service flow passes, so the forwarding device will receive any service message belonging to the service flow.
  • the forwarding device can continue to perform the operation of step 102 as follows.
  • the network management device may also send fault configuration information to some network objects and the first device in the network architecture, and the fault configuration information includes the start time of the first time period and a Failure status. And, the network management device sends configuration policy information to the forwarding device, where the configuration policy information includes at least one of at least one preset message type and protocol type.
  • the fault configuration information further includes the end time of the first time period.
  • the configuration policy information also includes the fault status.
  • the first time period is the time for training the fault detection model, that is, the fault detection model is trained by the training method provided in the embodiment of the present application in the first time period.
  • the network management device also sends an object set to the first device, and the object set includes identification information of each network object that is in a fault state within the first time period.
  • the network object determines the first time according to the start time and the duration threshold of the first time period part. In the case that the fault configuration information further includes the end time of the first time period, the network object determines the first time period according to the start time and the end time of the first time period. Then the network object works in the fault state for the first time period.
  • the first device determines the first time period according to the start time and the duration threshold of the first time period. In the case that the fault configuration information further includes the end time of the first time period, the first device determines the first time period according to the start time and the end time of the first time period. Then the first device starts to execute the process of training the fault detection model in the first time period.
  • the first device When the first device receives the object set, it also saves the received object set.
  • the configuration policy information includes at least one preset message type.
  • the at least one preset message type may include a synchronous (SYN) message, a synchronous acknowledgement (SYN ACK) message, a finish (FIN) message, or a reset (RST) message, etc. At least one of them.
  • the protocol type included in the configuration policy information is user datagram protocol (UDP)
  • the configuration policy information may not include the preset message type.
  • the technician enters the start time and fault status of the first time period in the network management device.
  • the network management device receives the start time and the fault status of the first time period.
  • the technician can also input the end time of the first time period to the network management device, and the network management device can also receive the end time of the first time period.
  • the network management device generates fault configuration information based on the received information.
  • the technician enters the identification information of a part of the network object in the network architecture into the network management device, and then the network management device sends the fault configuration information to the part of the network object according to the identification information of the part of the network object.
  • the forwarding device may be an ONT or an OLT in the data communication network shown in FIG. 2, or an access device such as Leaf in the data center network shown in FIG. 3.
  • the first device is a cloud platform, an analyzer platform, a BRAS in a data communication network, a Spine in a data center network, or other third-party devices, etc. or,
  • the forwarding device may be a BRAS in the data communication network shown in FIG. 2, or a Spine or GW device in the data center network shown in FIG. 3.
  • the first device is a cloud platform, an analyzer platform, or other third-party devices, etc.
  • Step 102 The forwarding device obtains at least one target service message from the service flow according to the configuration policy information.
  • the forwarding device when the protocol type included in the configuration policy information is TCP, when the forwarding device receives a service message, the forwarding device detects whether the protocol type included in the service message is TCP, and the service message Whether the message type of the message is a certain preset message type included in the configuration policy information, if the protocol type is TCP and the message type is a certain preset message type, then the service message is taken as a target Business message and save the target business message.
  • the terminal first sends a SYN message to the server.
  • the SYN message is used to request the establishment of a TCP connection between the terminal and the server.
  • the TCP connection is used for transmission.
  • the service path of the service flow between the terminal and the server.
  • the server After receiving the SYN message, the server sends a SYN ACK message to the terminal, and the terminal receives the SYN ACK message.
  • the establishment of the TCP connection between the terminal and the server is completed.
  • the TCP connection is used between the terminal and the server to transmit service messages.
  • the terminal sends a FIN message to the server.
  • the server receives the FIN message, sends a FIN message or RST message to the terminal, and the terminal receives the FIN message or RST message, and disconnects the TCP connection with the server.
  • the first service message of the service flow sent by the terminal to the server is a SYN message
  • the SYN message is also an initial message for establishing a service path.
  • the first service message of the service flow sent by the server to the terminal is a SYN ACK message
  • the SYN ACK message is also an end message used to establish a service path.
  • the last service message of the service flow sent by the terminal to the server is a FIN message
  • the last service message of the service flow sent by the server to the terminal is a FIN message or an RST message.
  • the forwarding device obtains the target service packet from the service flow in this step, including at least one of the SYN packet, the SYNACK packet, the FIN packet, or the RST packet, etc. One.
  • the forwarding device when the forwarding device receives a service message, the forwarding device detects whether the protocol type included in the service message is UDP, and if the protocol type is UDP, it will The business message is regarded as a target business message and the target business message is stored.
  • Step 103 The forwarding device obtains the service information of the service flow according to the at least one target service message.
  • the service information includes the identification information of at least one network object to which the service flow belongs and M KPIs of the service flow. KPIs are used to describe the characteristics of the business flow.
  • the identification information of the network object may be the address of the network object or the like.
  • the address may be an internet protocol (IP) address or a media access control (MAC) address, etc.
  • the M KPIs include the network delay between the forwarding device and the network object, the amount of data sent by the network object belongs to the service flow, and the data received by the network object belongs to the At least one of the data volume or status identifier of the service flow, and the status identifier is used to identify the status of the service flow.
  • the network object can be a terminal or a server.
  • the M KPIs include at least one of the amount of data belonging to the service flow sent by the network object or the amount of data belonging to the service flow received by the network object.
  • the forwarding device obtains a target service message belonging to the same service flow from the at least one target service message, and obtains M KPIs of the service flow according to the target service message belonging to the service flow.
  • Each target service message includes quintuple information, and the quintuple information is used to identify the service flow to which the service message belongs.
  • the quintuple information may include the address of the source device, the address of the destination device, the port number of the source device, the port number of the destination device, and the protocol type.
  • the forwarding device obtains a target service message that includes the same quintuple information from the at least one target service message as the target service message belonging to the same service flow.
  • the network object can be a server or a terminal.
  • the forwarding device forwards the obtained target service packet when the target service packet includes the first target service packet and the second target service packet.
  • the device obtains the network delay between the forwarding device and the network object according to the first time when the first target service message is received and the second time when the second target service message is received.
  • the first target service message is sent to the network object
  • the second target service message is a message corresponding to the first target service message sent by the network object.
  • the first target service message is a start message used to establish a service path
  • the second target service message is an end message used to establish a service path
  • the first target service message may be a SYN message sent by the terminal
  • the second target service message may be a SYN ACK message sent by the server.
  • the network object can be a server or a terminal.
  • the target service message obtained by the forwarding device includes the first start message and the first end message
  • the forwarding device The sequence number of the first start message and the sequence number of the first end message are used to obtain the amount of data belonging to the service flow sent by the network object.
  • the first start message is the first message of the service flow sent by the network object.
  • the first end message is the last message of the service flow sent by the network object.
  • sequence number of the first end message is subtracted from the sequence number of the first start message to obtain the amount of data belonging to the service flow sent by the network object.
  • the first start message is a SYN message sent by the terminal
  • the first end message is a FIN message sent by the terminal.
  • the forwarding device subtracts the sequence number of the SYN message from the sequence number of the FIN message to obtain the amount of data that belongs to the service flow sent by the terminal.
  • the first start message is a SYNACK message sent by the server
  • the first end message is a FIN message or an RST message sent by the server.
  • the forwarding device subtracts the sequence number of the SYN ACK message from the sequence number of the FIN message to obtain the amount of data belonging to the service flow sent by the server.
  • the forwarding device subtracts the sequence number of the SYN ACK packet from the sequence number of the RST packet to obtain the amount of data belonging to the service flow sent by the server.
  • the network object can be a server or a terminal, and the forwarding device is the forwarding device when the target service packet obtained includes the second start packet and the second end packet.
  • the sequence number of the second start message and the sequence number of the second end message the amount of data belonging to the service flow received by the network object is obtained.
  • the second start message is the first one of the service flow received by the network object.
  • the second end message is the last message of the service flow received by the network object.
  • the second start message is a SYN ACK message received by the terminal
  • the second end message is a FIN message or an RST message received by the terminal.
  • the forwarding device subtracts the sequence number of the SYN ACK message from the sequence number of the FIN message to obtain the amount of data that belongs to the service flow received by the terminal.
  • the forwarding device subtracts the sequence number of the SYN ACK packet from the sequence number of the RST packet to obtain the amount of data belonging to the service flow received by the terminal.
  • the second start message is a SYN message received by the server
  • the second end message is a FIN message received by the server.
  • the forwarding device subtracts the sequence number of the SYN message from the sequence number of the FIN message to obtain the amount of data belonging to the service flow received by the server.
  • the forwarding device sets the status identifier if it receives the first end message within the first time period after the third time when the target service message obtained by the forwarding device includes the first start message.
  • the status of the identified service flow is the success state; if the first end message is not received, the status of the service flow identified by the status identifier is set to the failure state, and the third time is the time when the first start message is received.
  • the first start message is a SYNACK message sent by the server
  • the first end message is a FIN message or an RST message sent by the server.
  • the forwarding device judges whether the length of time from receiving the last service message belonging to the service flow reaches the second time length. If it reaches, it means that the network object has transmitted the service flow. .
  • the forwarding device obtains the target service packet whose source device's address is the address of the network object from the target service packets belonging to the service flow, and each obtained target service packet is a service packet sent by the network object.
  • the acquired data volume of each target service message is accumulated to obtain the data volume sent by the network object belonging to the service flow.
  • the forwarding device obtains a target service packet whose address of the source device is the address of the terminal from the target service packet belonging to the service flow, and obtains the data of each target service packet.
  • the amount of data is accumulated to obtain the amount of data that belongs to the service flow sent by the terminal.
  • the forwarding device obtains the target service packet whose address of the source device is the address of the server from the target service packet belonging to the service flow, and obtains the data of each target service packet. The amount is accumulated, and the amount of data belonging to the service flow sent by the server is obtained.
  • the address of the destination device obtained from the target service packet belonging to the service flow is the address of the network object
  • Each target business message obtained is a business message received by the network object, and the data volume of each target business message obtained is accumulated to obtain the data received by the network object belonging to the service flow The amount of data.
  • the forwarding device obtains the target service packet whose address of the destination device is the address of the terminal from the target service packet belonging to the service flow, and obtains the data of each target service packet.
  • the amount of data is accumulated to obtain the amount of data that belongs to the service flow received by the terminal.
  • the forwarding device obtains the target service packet whose address of the destination device is the address of the server from the target service packet belonging to the service flow, and obtains the data of each target service packet.
  • the amount of data is accumulated to obtain the amount of data belonging to the service flow received by the server.
  • the forwarding device may determine the type of KPI to be obtained based on the fault status included in the configuration policy information, and then obtain the KPI of the certain type of the service flow through this step , The obtained KPI is the KPI related to the fault state.
  • the forwarding device may save the corresponding relationship between the fault status and the type of KPI, and the forwarding device may determine the type of KPI to be acquired based on the fault status included in the configuration policy information and the corresponding relationship.
  • Step 104 The forwarding device sends the service information of the service flow to the first device, where the service information includes identification information of at least one network object to which the service flow belongs and M KPIs of the service flow.
  • the business information may also collect the collection time of each KPI in the M KPIs, and the collection time of each KPI is used to determine the period to which each KPI belongs.
  • the first device is an analyzer platform or a cloud platform
  • a network connection is established between the forwarding device and the analyzer platform or the cloud platform, and the forwarding device sends the business information of the business flow to the analyzer platform or the cloud platform.
  • the forwarding device When the first device is an upstream device connected to the forwarding device, the forwarding device sends the service information of the service flow to the first device.
  • the forwarding device is an ONT or an OLT
  • the upstream device connected to the forwarding device is the BRAS
  • the forwarding device sends the service information of the service flow to the BRAS.
  • the forwarding device is Leaf
  • the upstream device connected to the forwarding device is the spine
  • the forwarding device sends the service information of the service flow to the spine.
  • the any forwarding device performs the steps 101 to 104 above on the received service flow to obtain and send the service information of the service flow.
  • Step 105 The first device receives service information of at least one service flow.
  • the first device continuously receives the service information of the service flow sent by different forwarding devices in the network architecture.
  • Step 106 The first device obtains the KPIs of N service flows belonging to the target network object in the first period, where N is an integer greater than 0, the first period is in the first time period, and the target network object is in the first period The network object to which any service flow belongs.
  • the KPI collection times of the N business flows are all located in the first cycle, and the first cycle may be any cycle.
  • the first period may be the current period.
  • N pieces of service information including the identification information of the target network object are obtained, and the KPIs of the N service flows are obtained from the N pieces of service information.
  • Step 107 The first device generates a training sample according to the KPIs of the N business flows, and the training sample includes the feature set acquired based on the KPIs of the N business flows.
  • the first device may be based on the fault state included in the fault configuration information, Determine the type of KPI to be selected, and then select the KPI of the certain type from the KPIs of any business flow, that is, select the KPI related to the fault state.
  • the KPIs of the other N-1 business flows the same method is used to obtain the KPIs related to the fault type of the N business flows, and then the KPIs related to the fault type of the N business flows are obtained.
  • the training samples corresponding to the fault state can be generated through the following operations 1071 to 1074.
  • a training sample can be generated through the following operations from 1071 to 1074.
  • the operations from 1071 to 1074 can be:
  • the first device obtains M KPI sets, any KPI set includes one KPI of each of the N business flows, and the KPI types included in any KPI set are the same.
  • the M KPIs of any service flow include the network delay between the forwarding device and the network object, the amount of data sent by the network object belonging to the service flow, and the amount of data belonging to the service flow received by the network object and the status identifier. Therefore, the M KPI sets acquired by the first device include a network delay set, a sent data volume set, a received data volume set, and a state identifier set.
  • the network delay set includes N network delays, and the N network delays belong to the N service flows respectively.
  • the sent data volume set includes N sent data volumes, and the N sent data volumes belong to the N service streams respectively.
  • the received data volume set includes N received data volumes, and the N received data volumes belong to the N service streams respectively.
  • the state identifier set includes the state identifiers of the N service flows.
  • the first device uses at least one first calculation method to calculate the KPIs included in any KPI set to obtain at least one statistical feature corresponding to any KPI set .
  • the at least one first calculation method includes one or more of the following: performing statistics on the KPIs in any KPI set, and calculating the mean, variance, dispersion, skewness, or kurtosis of the KPIs included in any KPI set.
  • the dispersion is equal to the ratio between the variance of the KPIs included in the any KPI set and the mean value of the KPIs included in the any KPI set.
  • the skewness of the KPI included in any KPI set is
  • the kurtosis of the KPI included in any KPI set is the kurtosis of the KPI included in any KPI set.
  • i 1, 2,...,N.
  • the first device may calculate at least one of the mean, variance, dispersion, skewness, or kurtosis of the N network delays included in the network delay set to obtain at least one corresponding to the network delay set
  • a statistical feature includes at least one of network delay mean, network delay variance, network delay dispersion, network delay skewness, or network delay kurtosis.
  • the first device may calculate at least one of the mean, variance, dispersion, skewness, or kurtosis of the N sent data volumes included in the sent data volume set to obtain at least one corresponding to the sent data volume set.
  • a statistical feature includes at least one of the average value of the sent data volume, the variance of the sent data volume, the dispersion of the sent data volume, the skewness of the sent data volume, or the kurtosis of the sent data volume.
  • the first device may calculate at least one of the mean, variance, dispersion, skewness, or kurtosis of the N received data volumes included in the received data volume set to obtain at least one corresponding to the received data volume set.
  • a statistical feature includes at least one of the average value of the received data volume, the variance of the received data volume, the dispersion of the received data volume, the skewness of the received data volume, or the kurtosis of the received data volume.
  • the first device may count the number of state identifiers used to identify the successful state and the number of state identifiers used to identify the failed state in the state identifier set. The number of state identifiers of the state and/or the number of state identifiers used to identify the failed state.
  • the first device calculates the proportion of service flows in the successful state according to the number of state identifiers used to identify the successful state and the number of state identifiers N included in the state identifier set. And/or, the first device calculates the proportion of service flows in the failed state according to the number of state identifiers used to identify the failed state and the number N of state identifiers included in the state identifier set.
  • the first device further obtains at least one statistical feature set, for any statistical feature set, the any statistical feature set includes K statistical features, and the K statistical features are calculated in K cycles and belong to Statistical characteristics of the same type.
  • the statistical features included in any one statistical feature set are calculated to obtain at least one time-domain feature corresponding to any one statistical feature set.
  • the K periods include the first period and K-1 periods before the first period, and at least one second calculation method includes one or more of the following: calculating the difference between two adjacent statistical features in the statistical feature set The ring ratio or difference value is used to perform feature fitting on the statistical features in the statistical feature set.
  • the ring ratio of the statistical feature set includes the ratio between the statistical feature of the second period and the statistical feature of the first period, and the statistical feature of the third period and the second period The ratio between the statistical characteristics of,..., the ratio between the statistical characteristics of the K-th period and the statistical characteristics of the K-1th period.
  • the difference value of the statistical feature set includes the difference between the statistical feature of the second cycle and the statistical feature of the first cycle, and the difference between the statistical feature of the third cycle and the statistical feature of the second cycle ,..., the difference between the statistical characteristics of the K-th period and the statistical characteristics of the K-1 period.
  • the first device may use the following first formula to perform feature fitting on the statistical features in the statistical feature set to obtain the time domain features;
  • v is the time-domain feature obtained after feature fitting, ⁇ 1 , ⁇ 2 ,..., ⁇ K respectively correspond to the first cycle, the second cycle, ..., the Kth cycle
  • the weight corresponding to the period is larger.
  • v 1 , v 2 , ..., v K are the statistical characteristics of the first period, the statistical characteristics of the second period, ising, the statistical characteristics of the K-th period.
  • the statistical feature set obtained by the first device may be the network delay mean set, the network delay variance set, the network delay dispersion set, the network delay skewness set, or the network delay set.
  • Extension kurtosis collection includes the network delay average value calculated by the forwarding device in K cycles.
  • the network delay variance set includes the network delay variance calculated by the forwarding device in K cycles.
  • the network delay skewness set includes the network delay skewness calculated by the forwarding device in K cycles.
  • the network delay dispersion set includes the network delay dispersion calculated by the forwarding device in K cycles.
  • the network delay kurtosis set includes the network delay kurtosis calculated by the forwarding device in K cycles.
  • the first device calculates the network delay average of the second cycle and the first The ratio between the average network delay in 1 cycle, the ratio between the average network delay in the third cycle and the average network delay in the second cycle, ..., the average network delay in the Kth cycle and The ratio between the average network delays in the K-1 cycle, thereby obtaining the ring ratio between the average network delays in the network delay average set.
  • the first device calculates the difference between the average network delay of the second cycle and the average network delay of the first cycle , The difference between the average network delay of the third cycle and the average network delay of the second cycle, ..., the average network delay of the K-th cycle and the average network delay of the K-1 cycle The difference between, and the difference between two adjacent network delay averages in the network delay average set is thus obtained.
  • the first device For feature fitting of K network delay averages in the network delay average set, the first device replaces v 1 , v 2 , ..., v K in the first formula with the K network delay averages respectively , And then perform feature fitting on the K network delay averages through the above-mentioned first formula to obtain time-domain features, which are sliding averages.
  • the first device calculates the network delay variance of the second cycle and The ratio between the network delay variance of the first cycle, the ratio between the network delay variance of the third cycle and the network delay variance of the second cycle,..., the network delay variance of the Kth cycle The ratio between the variance of the network delay and the variance of the network delay of the K-1 cycle, thereby obtaining the ring ratio between the variances of the two adjacent network delays in the network delay variance set.
  • the first device calculates the difference between the network delay variance of the second cycle and the network delay variance of the first cycle , The difference between the network delay variance of the third cycle and the network delay variance of the second cycle..., the network delay variance of the Kth cycle and the network delay variance of the K-1 cycle The difference between, thus obtains the difference between two adjacent network delay variances in the network delay variance set.
  • the first device For feature fitting of the K network delay variances in the network delay variance set, the first device replaces v 1 , v 2 , ..., v K in the first formula with the K network delay variances respectively , And then perform feature fitting on the K network delay variances by the above-mentioned first formula to obtain a time domain feature, which is a sliding fluctuation value.
  • the first device For the network delay dispersion set, the network delay skewness set, or the network delay kurtosis set, the first device performs the same operation on the network delay mean set as described above to obtain at least one time domain feature corresponding to each set.
  • the first device performs the same operation on the network delay set as described above, obtains at least one statistical feature set corresponding to any other KPI set, and then performs at least one second calculation In this way, calculation processing is performed on each statistical feature set to obtain at least one time domain feature corresponding to each statistical feature set.
  • the detailed implementation process will not be listed one by one.
  • the first device acquires a feature set, where the feature set includes at least one statistical feature corresponding to each KPI set in the M KPI sets.
  • the feature set further includes at least one of at least one time domain feature corresponding to each statistical feature set, a proportion of business flows in a successful state, or a proportion of business flows in a failed state, and the like.
  • the first device generates a training sample, the training sample includes the feature set, or the training sample includes the feature set and the label of the training sample.
  • the label of the training sample is used to identify the fault state; the state of the target network object in the first time period is normal In the state, the label of the training sample is used to identify the normal state.
  • the first device determines whether the identification information of the target network object is included in the object set, and when the identification information of the target network object is included in the object set, determines that the state of the target network object in the first time period is a fault state; When the identification information of the target network object is not included in the object set, it is determined that the state of the target network object in the first time period is a normal state.
  • the first device repeatedly executes the operations of steps 106 to 107 in the first time period, thereby obtaining a large number of training samples, and composing the obtained large number of training samples into a training sample set. Then perform the operation of step 108 as follows.
  • Step 108 The first device trains the intelligent model according to the training sample set to obtain the fault detection model.
  • the first device can train the intelligent model in a supervised training mode or an unsupervised training mode.
  • each training sample in the training sample set has corresponding label information, and the training process can be as follows:
  • the first device inputs the training sample set to the intelligent model.
  • the first device may input training samples included in the training sample set to the intelligent model multiple times, and input A training samples to the intelligent model each time, where A is an integer greater than 0.
  • the intelligent model processes each training sample in the training sample set, and the processing result corresponding to each training sample.
  • the intelligent model processes the input A training samples to obtain a processing result corresponding to each training sample in the A training samples.
  • the intelligent model calculates a gradient matrix according to the label information and processing results corresponding to each training sample and uses the gradient descent function corresponding to each network parameter included in the parameter set, and adjusts at least one network parameter in the intelligent model according to the gradient matrix.
  • the parameter set includes the at least one network parameter.
  • the gradient value corresponding to each training sample is calculated through the gradient descent function corresponding to any network parameter, and the gradient value corresponding to each training sample is calculated.
  • the gradient values corresponding to the samples form a row of the gradient matrix.
  • the intelligent model calculates the gradient value corresponding to each training sample through the gradient descent function corresponding to any one of the network parameters according to the label information and processing result corresponding to each training sample in the A training samples.
  • the first device then inputs A uninput training samples to the smart model to the smart model, and then the smart model performs the above operations from 1082 to 1083. If there is no uninput training sample in the training sample set, perform the following operation 1084,
  • the intelligent model uses the loss function to calculate the loss function value according to the label information and processing results corresponding to each training sample in the training sample set, and determines whether to continue training according to the loss function value. When it is determined to continue training, it returns to execute 1081, When it is determined to stop the training, the intelligent model at this time is used as the fault detection model, and the return is ended.
  • the loss function value is less than the loss threshold, it is determined to stop training, otherwise, it is determined to continue training.
  • the intelligent model adopted in the supervised training method is a support vector machine (SVM), a logistic regression algorithm, a random forest algorithm, or a neural network model.
  • Neural network models can be deep neural networks (DNN), convolutional neural networks (convolutional neural networks, CNN), recurrent neural networks (RNN), or long short-term memory networks (long short-term memory, LSTM) and so on.
  • the intelligent model When the unsupervised training method is adopted, the intelligent model includes an encoder and a decoder. There is no corresponding label information for each training sample in the training sample set.
  • the training process can be:
  • the first device inputs the training sample set to the intelligent model.
  • the first device may input training samples included in the training sample set to the intelligent model multiple times, and input A training samples to the intelligent model each time.
  • the intelligent model processes each training sample in the training sample set, and each training sample corresponds to the first processing result.
  • the encoder included in the intelligent model performs encoding processing on the input A training samples to obtain the second processing result corresponding to each training sample in the A training samples.
  • the decoder included in the intelligent model performs recovery processing on the second processing result corresponding to each training sample, and obtains the first processing result corresponding to each training sample.
  • the decoder included in the intelligent model performs recovery processing on the second processing result corresponding to the training sample to try to recover the training sample, but the training sample recovered by the decoder may be different from the original training sample, that is, the original There may be a difference between the training sample and the first processing result corresponding to the training sample.
  • the intelligent model calculates the gradient matrix according to each training sample and the first processing result corresponding to each training sample through the gradient descent function corresponding to each network parameter included in the parameter set, and adjusts at least one of the intelligent models according to the gradient matrix.
  • a network parameter, and the parameter set includes the at least one network parameter.
  • the gradient value corresponding to each training sample is calculated through the gradient descent function corresponding to any network parameter,
  • the gradient value corresponding to each training sample is formed into a row of the gradient matrix.
  • the intelligent model calculates the first processing result corresponding to each training sample and the first processing result corresponding to each training sample through the gradient descent function corresponding to any one of the network parameters.
  • the gradient value is the gradient value.
  • the first device inputs A uninput training samples to the smart model to the smart model, and then the smart model performs the above-mentioned operations from 1182 to 1183. If there is no uninput training sample in the training sample set, perform the following operation 1184,
  • the intelligent model uses the loss function to calculate the loss function value according to each training sample in the training sample set and the first processing result corresponding to each training sample, and determines whether to continue training according to the loss function value. When determining to continue training, Return to execution 1181. When it is determined to stop training, use the intelligent model at this time as the fault detection model, and end the return.
  • the loss function value is less than the loss threshold, it is determined to stop training, otherwise, it is determined to continue training.
  • the intelligent model used in the unsupervised training method is a variational autoencoder (VAE) model or kmeans, etc.
  • VAE variational autoencoder
  • the fault detection model trained by the first device is used to detect whether the network object is in a fault state, and the fault state is the fault state in the fault configuration information sent by the network management device.
  • the network management device sends fault configuration information corresponding to other various failure states to some network objects and the first device in the network architecture.
  • the fault configuration information corresponding to any one of the fault states refers to the fault configuration information including the any one of the fault states.
  • the forwarding device and the first device train the intelligent model according to the above steps 101 to 108, and obtain a fault detection model for detecting any one of the fault states. In this way, the fault detection model corresponding to each fault state is obtained.
  • the first device can also train the same intelligent model according to the above steps 101 to 108 to obtain a fault detection model, so that the fault detection model can be used to detect different fault states.
  • the first device may send the obtained feature set to the training device, and the training device receives the feature set and generates a training sample, the training sample includes the feature set, or the training sample includes the feature set and the training sample Label.
  • the training device can generate a large number of training samples, and use the generated training samples to train the fault detection model.
  • the training device like the first device, can collect the above two training methods to train a fault detection model, which will not be described in detail here.
  • the training device may also send the trained fault detection model to the first device.
  • the first device receives the fault detection model.
  • the training device may not send the fault detection model to the first device, so that when detecting the network object, the training device can act as a detection device to detect the network object.
  • the service information obtained by the service flow includes the identification information of the network object and M KPIs, so that the data volume of the service information is much smaller than the data volume of the service flow. Therefore, when the forwarding device sends the service information of the service flow to the first device, the consumption of network resources will be greatly reduced, and the consumption of bandwidth resources will be greatly reduced.
  • the above-mentioned first device may be a cloud platform or an analyzer platform, and all forwarding devices in the network architecture send the business information of the service flow to the cloud platform or analyzer platform, and the cloud platform or analyzer platform can uniformly train the failure detection model
  • the above-mentioned first device may be an upstream device connected to the forwarding device, and each upstream device receives service information of the service flow sent by the forwarding device connected to it, so that different forwarding devices are trained separately, which can improve training efficiency.
  • an embodiment of the present application provides a fault detection method, which can be applied to the network architecture provided by any of the embodiments shown in Figs. 1 to 3.
  • the forwarding device obtains the service received by it.
  • the service information of the flow is sent to the first device.
  • the first device receives the service information of the service flow sent by the forwarding device, generates a detection sample according to the received service information, and detects the network object through a fault detection model based on the detection sample.
  • the fault detection model can be as shown in Figure 4 above.
  • the method includes:
  • Steps 201 to 205 the same as steps 101 to 105, respectively, and will not be described in detail here.
  • the forwarding device obtains the business information of the business flow, and the KPIs in the business information include KPIs related to each failure state in at least one failure state.
  • the business information may also collect the collection time of the KPI, which is used to determine the period to which the KPI belongs.
  • Step 206 The first device obtains KPIs of N service flows belonging to the target network object in the current period, where N is an integer greater than 0, and the target network object is a network object to which any service flow in the current period belongs.
  • the collection time of the KPIs of the N business flows are all within the current cycle.
  • the forwarding device may be an ONT or an OLT in the data communication network shown in FIG. 2, or an access device such as Leaf in the data center network shown in FIG. 3.
  • the first device is a cloud platform, an analyzer platform, a BRAS in a data communication network, a Spine in a data center network, or other third-party devices, etc. or,
  • the forwarding device may be the BRAS in the data communication network shown in FIG. 2 or the Spine and other devices in the data center network shown in FIG. 3.
  • the first device is a cloud platform, an analyzer platform, or other third-party devices, etc.
  • Step 207 The first device generates a detection sample according to the KPIs of the N service flows, and the detection sample includes the feature set acquired based on the KPIs of the N service flows.
  • the process of generating the detection sample by the first device is the same as the process of generating the training sample in step 107 of the embodiment shown in FIG. 4, and will not be described in detail here.
  • the first device may determine the type of KPI to be selected based on a fault state, and then select the KPI of the certain type from the KPIs of any business flow, that is, select and The KPI related to the failure state.
  • the same method is used to obtain the KPIs related to the failure type of the N business flows, and then according to the KPIs related to the failure type of the N business flows KPI, generate a test sample corresponding to the fault state.
  • the process of generating a detection sample corresponding to the fault state refer to the process of generating a training sample corresponding to the fault state in the operations of steps 1071 to 1074 shown in FIG. 4. In this way, the first device can generate test samples corresponding to different fault states.
  • the first device may not distinguish between different fault states, that is, according to the KPIs of the N service flows, a detection sample is generated, and the detection sample includes detection samples corresponding to different fault states.
  • Step 208 The first device uses the fault detection model to detect whether the target network object is in a fault state according to the detection sample.
  • the first device includes a plurality of fault detection models corresponding to different fault states, and for any fault detection model corresponding to any one of the fault states, the first device passes the task according to the detection samples corresponding to any one of the fault states.
  • a fault detection model corresponding to a fault state detects whether the target network object is in any of the fault states. In this way, the target network object is detected separately through the fault detection model corresponding to each fault state, and it is detected that the target network object may be in one or more fault states.
  • the first device includes multiple fault detection models in different fault states, and the first device can detect a possible type of target network object through the fault detection models corresponding to different fault states based on the detection samples corresponding to different fault states. Or multiple fault states.
  • the first device includes a fault detection model that can detect different fault states.
  • the first device generates a detection sample, and according to the detection sample, the fault detection model is used to detect whether the target network object is in the One or more fault conditions.
  • the first device when detecting that the target network object is in a fault state, obtains at least one KPI of the target network object and/or the service flow of the target network object, and according to the at least one KPI of the target network object and/or the target network object To locate the fault in the business flow of the company.
  • the at least one KPI of the target network object may include at least one of the CPU usage rate, memory usage rate, or throughput rate of the target network object.
  • the first device may determine a target forwarding device, which is a forwarding device that sends service information of a service flow belonging to the target network object, and sends a collection instruction to the target forwarding device.
  • the collection instruction includes identification information of the target network object.
  • the target forwarding device receives the collection instruction, mirrors the received service flow belonging to the target network object according to the identification information of the target network object included in the collection instruction, and sends the service flow obtained by the mirroring to the first device.
  • the forwarding device Only when it is detected that the target network object is in a fault state, the forwarding device is allowed to mirror the received service flow belonging to the target network object, and send the service flow obtained by the mirroring to the first device. In this way, on-demand collection is realized, which avoids collection of service flows of all network objects, saves bandwidth resources, and saves computing resources required for unnecessary data analysis when the first device performs fault location.
  • the first device may send the obtained detection sample to the third-party device.
  • the third-party device receives the detection sample, and uses the fault detection model to detect whether the target network object is in a fault state based on the detection sample.
  • the service information obtained by the service flow includes the identification information of the network object and M KPIs, so that the data volume of the service information is much smaller than the data volume of the service flow. Therefore, when the forwarding device sends the service information of the service flow to the first device, the consumption of network resources will be greatly reduced, and the consumption of bandwidth resources will be greatly reduced.
  • the above-mentioned first device may be a cloud platform or an analyzer platform, and all forwarding devices in the network architecture send the business information of the service flow to the cloud platform or analyzer platform, and the cloud platform or analyzer platform can uniformly detect network objects .
  • the above-mentioned first device may be an upstream device connected to the forwarding device, and the upstream device receives the service information of the service flow sent by the forwarding device connected to the upstream device, so that the upstream device detects it, which can improve detection efficiency and achieve the purpose of real-time detection.
  • an embodiment of the present application provides a method for training a fault detection model, and the training method can be applied to the network architecture provided by any of the embodiments shown in FIGS. 1 to 3.
  • the forwarding device obtains the service information of the service flow, obtains the feature set based on the service information, sends the feature set to the first device, and the first device receives the feature set and trains the fault detection model.
  • the method includes:
  • Steps 301 to 303 the same as steps 101 to 103, respectively, and will not be described in detail here.
  • Step 304 The forwarding device obtains the KPIs of N service flows belonging to the target network object in the first period, where N is an integer greater than 0, the first period is in the first time period, and the target network object is in the first period The network object to which any service flow belongs.
  • the KPI collection times of the N business flows are all located in the first cycle, and the first cycle may be any cycle.
  • the first period may be the current period.
  • Step 305 The forwarding device generates a feature set corresponding to the target network object according to the KPIs of the N service flows.
  • the forwarding device may refer to the process of generating the feature set by the first device in 1071 to 1073 in the embodiment shown in FIG. 4, which will not be described in detail here.
  • Step 306 The forwarding device sends the feature set corresponding to the target network object to the first device.
  • the forwarding device can repeat the operations of steps 301 to 306 above to obtain the feature sets of different network objects, and send the feature sets corresponding to the different network objects to the first device.
  • Step 307 The first device receives the feature set of the target network object, and generates a training sample, the training sample includes the feature set, or the training sample includes the feature set and the label of the training sample.
  • the label of the training sample is used to identify the fault state, and when the target network object is processed in a normal state, the label of the training sample is used to identify the normal state.
  • the first device may receive the feature set of at least one network object sent by different forwarding devices, and generate a large number of training samples, and then train the fault detection model through the following operation in step 308.
  • Step 308 Same as step 108, and will not be described in detail here.
  • the forwarding device may be the BRAS in the data communication network shown in FIG. 2 or the Spine and other devices in the data center network shown in FIG. 3.
  • the first device is a cloud platform, an analyzer platform, or other third-party devices, etc.
  • the service information obtained by the service flow includes the identification information of the network object and M KPIs, and the feature set is obtained based on the service information of the network object.
  • the data volume of the feature set is much smaller than the data volume of the service flow, so that when the forwarding device sends the feature set to the first device, the consumption of network resources will be greatly reduced, and the consumption of bandwidth resources will be greatly reduced.
  • an embodiment of the present application provides a fault detection method, which can be applied to the network architecture provided by any of the embodiments shown in Figs. 1 to 3.
  • the forwarding device obtains the service it receives.
  • Flow business information generate a feature set based on the business information of the business flow, send the feature set to the first device, the feature set received by the first device, generate a detection sample based on the feature set, and use the fault detection model to detect the sample based on the detection sample
  • the fault detection model can be obtained through training in the embodiment shown in FIG. 4 or FIG. 7.
  • the method includes:
  • Steps 401 to 403 the same as steps 301 to 303, respectively, and will not be described in detail here.
  • Step 404 The forwarding device obtains KPIs of N service flows belonging to the target network object in the current period, where N is an integer greater than 0, and the target network object is a network object to which any service flow in the current period belongs.
  • the collection time of the KPIs of the N business flows are all within the current cycle.
  • Steps 405-406 the same as steps 305-306 respectively, and will not be described in detail here.
  • Step 407 The first device receives the feature set of the target network object, and generates a detection sample, where the detection sample includes the feature set.
  • Step 408 Same as step 208, and will not be described in detail here.
  • the forwarding device may be the BRAS in the data communication network shown in FIG. 2 or the Spine and other devices in the data center network shown in FIG. 3.
  • the first device is a cloud platform, an analyzer platform, or other third-party devices, etc.
  • the service information of the service flow includes the identification information of the network object and M KPIs
  • the characteristic set is obtained based on the service information of the service flow, so that the characteristic set of The data volume is much smaller than the data volume of the service flow, so when the forwarding device sends the feature set to the first device, the consumption of network resources will be greatly reduced, which greatly reduces the consumption of bandwidth resources.
  • an embodiment of the present application provides an apparatus 500 for training a fault detection model.
  • the apparatus 500 may be deployed on the forwarding device of any of the foregoing embodiments, and includes:
  • the receiving unit 501 is configured to receive at least one service flow
  • the processing unit 502 is configured to obtain the service information of the at least one service flow.
  • the service information of the service flow includes the identification information of the network object to which the service flow belongs and M key performance indicators KPIs of the service flow, where M is greater than An integer of 0, the network object includes one or more devices;
  • the sending unit 503 is configured to send training information to the first device, where the training information includes service information of the at least one service flow or a feature set obtained based on the service information of the at least one service flow, and the training information is used for Training a fault detection model, which is used to detect whether the network object is in a fault state.
  • the protocol type of the service flow is Transmission Control Protocol TCP
  • the processing unit 502 is configured to:
  • M KPIs of the service flow are acquired.
  • the M KPIs include the network delay between the device and the network object, the amount of data sent by the network object belonging to the service flow, and the amount of data received by the network object belonging to the service At least one of the data volume of the stream;
  • the processing unit 502 is configured to:
  • the at least one target service message includes a first target service message and a second target service message, and is based on the first time when the first target service message is received and the second target service message when the second target service message is received. Time, the network delay between the device and the network object is acquired, the first target service message is a message sent to the network object, and the second target service message is the network object The sent message corresponding to the first target service message; and/or,
  • the at least one target service message includes a first start message and a first end message, and according to the sequence number of the first start message and the sequence number of the first end message, the network object is sent The amount of data belonging to the service flow, the first start message is the first message of the service flow sent by the network object, and the first end message is all the data sent by the network object The last message of the service flow; and/or,
  • the at least one target service message includes a second start message and a second end message, and the network is acquired according to the sequence number of the second start message and the sequence number of the second end message
  • the amount of data that belongs to the service flow received by the object is the first message of the service flow received by the network object, and the second end message is the network object The last message of the received service flow.
  • the M KPIs include a status identifier, and the status identifier is used to identify the status of the service flow;
  • the processing unit 502 is configured to
  • the at least one target service message includes a first start message, and if the first end message is received within the first length of time after the third time, the status of the status identifier is set to a successful state; if not When the first end message is received, the status of the status identifier is set to a failed state, the third time is the time when the first start message is received, and the first start message is the network The first packet of the service flow sent by the object, and the first end packet is the last packet of the service flow sent by the network object.
  • processing unit 502 is further configured to:
  • N Acquiring KPIs of N service flows belonging to a target network object in the first cycle from the at least one service flow, where the target network object is a network object to which any one of the at least one service flow belongs, N is an integer greater than 0;
  • the feature set is acquired based on the KPIs of the N business flows.
  • the feature set includes at least one statistical feature
  • the processing unit 502 is configured to:
  • any KPI set includes one KPI for each of the N business flows, and the KPI types included in any KPI set are the same;
  • the at least one first calculation method includes one or more of the following : Perform statistics on the KPIs in any KPI set, and calculate the mean, variance, dispersion, skewness, or kurtosis of the KPIs included in any KPI set.
  • the feature set further includes at least one temporal feature
  • the processing unit 502 is further configured to:
  • the statistical feature set includes K statistical features, and the K statistical features are statistical features of the same type calculated in K periods, and the K periods include the first period and the For K-1 periods before the first period, the at least one second calculation method includes one or more of the following: calculating the ring ratio or difference value between two adjacent statistical features in the statistical feature set , Performing feature fitting on the statistical features in the statistical feature set.
  • the status identifier of the N service flows included in the any one KPI set, and the status identifier of any one of the N service flows is used to identify the status of any one of the service flows;
  • the statistical characteristics of any one of the KPI sets include the number of status identifiers used to identify the success state and the number of status identifiers used to identify the failure state;
  • the feature set also includes the proportion of business flows in the successful state and/or the proportion of business flows in the failed state ;
  • the processing unit 502 is further configured to:
  • the number of status identifiers used to identify the success status and the number of KPIs included in any one of the KPI sets calculate the business flow ratio of the success status; and/or,
  • the proportion of the service flow in the failure status is calculated.
  • the first device is a cloud platform, an analyzer platform, or an upstream device of the device.
  • the network object is a terminal, a server, a client, a virtual machine, a router, a device in a virtual local area network VLAN of a switch, or a device in a designated network segment.
  • the M KPIs are used to describe the characteristics of the service flow.
  • the receiving unit receives at least one service flow.
  • the processing unit obtains the business information of the at least one business flow, and the business information of the business flow includes the identification information of the network object to which the business flow belongs and M key performance indicators KPIs of the business flow.
  • the sending unit sends training information to the first device. Since the training information obtained by the processing unit includes the identification information of the network object and M KPIs, or the feature set obtained based on the M KPIs of the network object, the data volume of the training information is much smaller than the service flow, and the sending unit sends the training to the first device
  • the network resources required for information are far less than the network resources required for sending service streams, so that the consumption of network resources can be reduced.
  • an embodiment of the present application provides an apparatus 600 for training a fault detection model.
  • the apparatus 600 is deployed on the first device described in any of the foregoing embodiments, and includes:
  • the receiving unit 601 is configured to receive service information of at least one service flow sent by the first forwarding device, where the service information of the service flow includes the identification information of the network object to which the service flow belongs and M key performance indicators KPIs of the service flow , M is an integer greater than 0, and the network object includes one or more devices;
  • the processing unit 602 is configured to train a fault detection model according to the service information of the at least one service flow, or obtain at least one feature set used to train the fault detection model according to the service information of the at least one service flow, and the fault detection The model is used to detect whether the network object is in a fault state.
  • processing unit 602 is configured to:
  • any feature set includes at least one feature obtained based on the KPI of each service flow belonging to a target network object, the target network object being a service flow to which any one of the at least one service flow belongs Network object
  • any one of the feature sets includes at least one statistical feature
  • the processing unit 602 is configured to:
  • any KPI set includes one KPI for each of the N business flows, and the KPI types included in any KPI set are the same;
  • the at least one first calculation method includes one or more of the following : Perform statistics on the KPIs in any KPI set, and calculate the mean, variance, dispersion, skewness, or kurtosis of the KPIs included in any KPI set.
  • any one of the feature sets further includes at least one temporal feature
  • the processing unit 602 is further configured to:
  • the statistical feature set includes K statistical features, and the K statistical features are statistical features of the same type calculated in K periods, and the K periods include the first period and the For K-1 periods before the first period, the at least one second calculation method includes one or more of the following: calculating the ring ratio or difference value between two adjacent statistical features in the statistical feature set , Performing feature fitting on the statistical features in the statistical feature set.
  • the any one KPI set includes the status identifier of the N service flows, and the status identifier of any one of the N service flows is used to identify the status of any one of the service flows;
  • the statistical features of any KPI set include the number of status identifiers used to identify the success state and the number of status identifiers used to identify the failure state; the any feature set also includes the proportion of business flows in the successful state and/or the business flows in the failed state. Proportion;
  • the processing unit 602 is further configured to:
  • the number of status identifiers used to identify the success status and the number of KPIs included in any one of the KPI sets calculate the business flow ratio of the success status; and/or,
  • the proportion of service flows in the failure status is calculated.
  • processing unit 602 is further configured to:
  • the training sample includes any one of the feature sets and a label of the training sample.
  • the label is used to identify the fault state.
  • the tag is used to identify the normal state.
  • the apparatus 600 further includes: a sending unit 603,
  • the sending unit 603 is configured to send the at least one feature set to a training device, and the at least one feature set is used for the training device to train a fault detection model;
  • the receiving unit 601 is configured to receive the fault detection model sent by the training device.
  • the receiving unit receives the service information of at least one service flow sent by the first forwarding device, and the service information of the service flow includes the identification information of the network object to which the service flow belongs and the M keys of the service flow. Performance indicator KPI.
  • the processing unit trains a fault detection model according to the business information of the at least one service flow, or obtains at least one feature set for training the fault detection model. Since the service information sent by the forwarding device includes the identification information and KPI of the network object, the data volume of the service information is much smaller than the data volume of the service flow, thereby reducing the network resources consumed by the receiving unit to receive the service information.
  • an embodiment of the present application provides a schematic diagram of a training device 700 for a fault detection model.
  • the apparatus 700 may be the forwarding device in any of the foregoing embodiments.
  • the device 700 includes at least one processor 701, a bus system 702, a memory 703, and at least one transceiver 704.
  • the device 700 is a device with a hardware structure, and can be used to implement the functional modules in the device 500 described in FIG. 9.
  • the processing unit 502 in the device 500 shown in FIG. 9 can be implemented by calling the code in the memory 703 by the at least one processor 701.
  • the sending unit 503 may be implemented by the transceiver 704.
  • processor 701 may be a general-purpose central processing unit (central processing unit, CPU), network processor (NP), microprocessor, application-specific integrated circuit (ASIC) , Or one or more integrated circuits used to control the execution of the program of this application.
  • CPU central processing unit
  • NP network processor
  • ASIC application-specific integrated circuit
  • the above-mentioned bus system 702 may include a path for transferring information between the above-mentioned components.
  • the aforementioned transceiver 704 is used to communicate with other devices or communication networks.
  • the aforementioned memory 703 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions.
  • the type of dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), or other optical disk storage, optical discs Storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by Any other medium accessed by the computer, but not limited to this.
  • the memory can exist independently and is connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 703 is used to store application program codes for executing the solutions of the present application, and the processor 701 controls the execution.
  • the processor 701 is configured to execute the application program code stored in the memory 703, so as to realize the functions in the method of the present patent.
  • the processor 701 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 11.
  • the apparatus 700 may include multiple processors, such as the processor 701 and the processor 707 in FIG. 11.
  • processors can be a single-CPU (single-CPU) processor or a multi-core (multi-CPU) processor.
  • the processor here may refer to one or more devices, circuits, and/or processing cores for processing data (for example, computer program instructions).
  • an embodiment of the present application provides a schematic diagram of a training device 800 for a fault detection model.
  • the apparatus 800 may be the forwarding device in any of the foregoing embodiments.
  • the device 800 includes at least one processor 801, a bus system 802, a memory 803, and at least one transceiver 804.
  • the device 800 is a device with a hardware structure, and can be used to implement the functional modules in the device 600 described in FIG. 10.
  • the processing unit 602 in the device 600 shown in FIG. 10 can be implemented by calling the code in the memory 803 by the at least one processor 801.
  • the sending unit 603 can be implemented by the transceiver 804.
  • processor 801 may be a general-purpose central processing unit (central processing unit, CPU), network processor (network processor, NP), microprocessor, application-specific integrated circuit (ASIC) , Or one or more integrated circuits used to control the execution of the program of this application.
  • CPU central processing unit
  • NP network processor
  • ASIC application-specific integrated circuit
  • the above-mentioned bus system 802 may include a path for transferring information between the above-mentioned components.
  • the above-mentioned transceiver 804 is used to communicate with other devices or a communication network.
  • the aforementioned memory 803 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), or other types that can store information and instructions.
  • the type of dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), or other optical disk storage, optical discs Storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by Any other medium accessed by the computer, but not limited to this.
  • the memory can exist independently and is connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 803 is used to store application program codes for executing the solutions of the present application, and the processor 801 controls the execution.
  • the processor 801 is configured to execute the application program code stored in the memory 803, so as to realize the functions in the method of the present patent.
  • the processor 801 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 12.
  • the apparatus 800 may include multiple processors, such as the processor 801 and the processor 807 in FIG. 12. Each of these processors can be a single-CPU (single-CPU) processor or a multi-core (multi-CPU) processor.
  • the processor here may refer to one or more devices, circuits, and/or processing cores for processing data (for example, computer program instructions).
  • an embodiment of the present application provides a training system 900 for a fault detection model.
  • the system 900 includes: the device of the embodiment shown in FIG. 9 and the device of the embodiment shown in FIG. 10, or, as shown in FIG. The device of the embodiment described in 11 and the device of the embodiment described in FIG. 12.
  • the device in the embodiment described in FIG. 9 or FIG. 11 may be the forwarding device 901, and the device in the embodiment described in FIG. 10 or FIG. 12 may be the first device 902.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Optimization (AREA)
  • Physics & Mathematics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请公开了一种故障检测模型的训练方法、装置及系统,属于通信领域。所述方法包括:转发设备接收至少一个业务流;所述转发设备获取所述至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备;所述转发设备向第一设备发送训练信息,所述训练信息包括所述至少一个业务流的业务信息或基于所述至少一个业务流的业务信息获取的特征集合,所述训练信息用于训练故障检测模型,所述故障检测模型用于检测所述网络对象是否处于故障状态。本申请能够减小对网络资源的消耗。

Description

故障检测模型的训练方法、装置及系统
本申请要求于2020年01月24日提交的申请号为202010077206.X、发明名称为“故障检测模型的训练方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信领域,特别涉及一种故障检测模型的训练方法、装置及系统。
背景技术
数据通信网络或数据中心网络包括大量的终端或服务器等网络对象,网络对象连接到接入设备,接入设备再通过转发设备连接到广域网,这样网络对象可以通过该接入设备、该转发设备和广域网来传输业务流。
其中,网络对象出现故障时,可能导致业务中断,造成严重损失,因此需要对网络对象的健康度进行检测,基于检测的结果可以及时发现网络对象出现的故障,然后采取相应措施对该网络对象进行处理。目前可以部署一个分析平台,首先在分析平台上训练出一个故障检测模型,分析平台通过该故障检测模型对任一网络对象的健康度进行检测。
在训练故障检测模型时,对于任一网络对象的业务流,数据通信网络或数据中心网络中的接入设备或转发设备在接收到该业务流时,对该业务流进行镜像,向分析平台发送镜像得到的该业务流。分析平台可以接收各网络对象的业务流,根据各网络对象的业务流训练出故障检测模型。
在实现本申请的过程中,发明人发现现有技术至少存在以下问题:
由于故障检测模型是基于网络对象的业务流训练得到的,所以要求接入设备或转发设备对业务流进行镜像,然后将镜像得到的该业务流发送至分析平台,这样会消耗大量的网络资源。
发明内容
本申请提供了一种故障检测模型的训练方法、装置及系统,以减小对网络资源的消耗。所述技术方案如下:
第一方面,本申请提供了一种故障检测模型的训练方法,在所述方法中:转发设备接收至少一个业务流。转发设备获取所述至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备。所述转发设备向第一设备发送训练信息,所述训练信息包括所述至少一个业务流的业务信息或基于所述至少一个业务流的业务信息获取的特征集合,所述训练信息用于训练故障检测模型,所述故障检测模型用于检测所述网络对象是否处于故障状态。
由于转发设备获取的训练信息包括网络对象的标识信息和M个KPI,或基于网络对象的 M个KPI获取的特征集合,所以训练信息的数据量远小于业务流,向第一设备发送业务信息所需要的网络资源远小于发送业务流所需要的网络资源,从而可以减少对网络资源的消耗。
在一种可能的实现方式,所述转发设备根据配置策略信息,从所述业务流中获取至少一个目标业务报文,所述配置策略信息包括至少一个预设报文类型。所述转发设备根据所述至少一个目标业务报文,获取所述业务流的M个KPI。由于从业务流中的获取目标业务报文,根据目标业务报文获取业务流的M个KPI,这样可以减少需要分析的报文数目,提高获取KPI的效率。
在另一种可能的实现方式,所述M个KPI包括所述转发设备与所述网络对象之间的网络时延,所述网络对象发送的属于所述业务流的数据量和所述网络对象接收的属于所述业务流的数据量中的至少一个。所述至少一个目标业务报文包括第一目标业务报文和第二目标业务报文,所述转发设备根据接收所述第一目标业务报文的第一时间和接收所述第二目标业务报文的第二时间,获取所述转发设备与所述网络对象之间的网络时延,所述第一目标业务报文是发送给所述网络对象的报文,所述第二目标业务报文是所述网络对象发送的与所述第一目标业务报文相对应的报文。和/或,所述至少一个目标业务报文包括第一起始报文和第一结束报文,所述转发设备根据所述第一起始报文的序列号和所述第一结束报文的序列号,获取所述网络对象发送的属于所述业务流的数据量,所述第一起始报文是所述网络对象发送的所述业务流的第一个报文,所述第一结束报文是所述网络对象发送的所述业务流的最后一个报文。和/或,所述至少一个目标业务报文包括第二起始报文和第二结束报文,所述转发设备根据所述第二起始报文的序列号和所述第二结束报文的序列号,获取所述网络对象接收的属于所述业务流的数据量,所述第二起始报文是所述网络对象接收的所述业务流的第一个报文,所述第二结束报文是所述网络对象接收的所述业务流的最后一个报文。如此可以准确地获取到网络时延,网络对象发送的数据量或接收的数据量。
在另一种可能的实现方式,所述M个KPI包括状态标识,所述状态标识用于标识所述业务流的状态。所述至少一个目标业务报文包括第一起始报文,所述转发设备在第三时间之后的第一时间长度内,如果接收到第一结束报文,设置所述状态标识标识的状态为成功状态;如果未接收到所述第一结束报文,设置所述状态标识标识的状态为失败状态,所述第三时间为接收所述第一起始报文的时间,所述第一起始报文是所述网络对象发送的所述业务流的第一个报文,所述第一结束报文是所述网络对象发送的所述业务流的最后一个报文。如此可以准确地获取到业务流的状态标识,提高获取状态标识的准确性。
在另一种可能的实现方式,所述转发设备从所述至少一个业务流中获取第一周期内的属于目标网络对象的N个业务流的KPI,所述目标网络对象是所述至少一个业务流中的任一个业务流所属于的网络对象,N为大于0的整数。所述转发设备基于所述N个业务流的KPI获取特征集合。由于特征集合包括基于属于目标网络对象的每个业务流的KPI获取的特征,这样该特征集合更能反应网络状态的健康状态,根据该特征集合训练的故障检测模型更加精确。
在另一种可能的实现方式,所述特征集合包括至少一个统计特征。所述转发设备获取M个KPI集合,任一个KPI集合包括所述N个业务流中的每个业务流的一个KPI,所述任一个KPI集合包括的KPI的类型相同。所述转发设备通过至少一个第一计算方式,对所述任一个KPI集合包括的KPI进行计算,得到所述任一个KPI集合对应的至少一个统计特征,所述至少一个第一计算方式包括如下一种或多种:对所述任一个KPI集合中的KPI进行统计,计算所述任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。由于将统计出不同的统计特征组成特征集合,从而丰富了特征集合包括的特征,使得特征集合更能反应网络对象的健康状况。
在另一种可能的实现方式,所述特征集合还包括至少一个时域特征。所述转发设备通过至少一个第二计算方式,对统计特征集合包括的统计特征进行计算,得到至少一个时域特征。其中,所述统计特征集合包括K个统计特征,所述K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征,所述K个周期包括所述第一周期和位于所述第一周期之前的K-1个周期,所述至少一个第二计算方式包括如下一种或多种:计算所述统计特征集合中的相邻两个统计特征之间的环比值或差分值,对所述统计特征集合中的统计特征进行特征拟合。由于时域特征是基于K个周期的统计特征得到的,且特征集合还包括该时域特征,使得特征集合包括具有时序性的特征。
在另一种可能的实现方式,所述任一个KPI集合包括的所述N个业务流的状态标识,所述N个业务流中的任一个业务流的状态标识用于标识所述任一个业务流的状态;所述任一个KPI集合的统计特征包括用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目;所述特征集合还包括成功状态的业务流比例和/或失败状态的业务流比例。根据所述用于标识成功状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算成功状态的业务流比例;和/或,根据所述用于标识失败状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算失败状态的业务流比例。由于特征集合还包括状态标识,从而更加丰富了特征集合中的特征。
在另一种可能的实现方式,所述第一设备为云平台、分析器平台或所述转发设备的上游设备。
在另一种可能的实现方式,所述网络对象是终端、服务器、客户端、虚拟机、路由器、交换机、虚拟局域网VLAN中的设备或指定网段中的设备。
在另一种可能的实现方式,所述M个KPI用于描述所述业务流的特征。
第二方面,本申请提供了一种故障检测模型的训练方法,在所述方法中:第一设备接收第一转发设备发送的至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备。所述第一设备根据所述至少一个业务流的业务信息训练故障检测 模型,或者,根据所述至少一个业务流的业务信息获取用于训练故障检测模型的至少一个故障检测模型,所述故障检测模型用于检测所述网络对象是否处于故障状态。由于转发设备发送的业务信息包括网络对象的标识信息和KPI,使得业务信息的数据量远小于业务流的数据量,从而减少第一设备接收业务信息所消耗的网络资源。
在一种可能的实现方式,所述第一设备获取至少一个特征集合,任一个特征集合包括基于属于目标网络对象的每个业务流的KPI获取的至少一个特征,所述目标网络对象是所述至少一个业务流中的任一个业务流所属于的网络对象。所述第一设备根据所述至少一个特征集合训练故障检测模型。由于特征集合包括基于属于目标网络对象的每个业务流的KPI获取的特征,这样该特征集合更能反应网络状态的健康状态,根据该特征集合训练的故障检测模型更加精确。
在另一种可能的实现方式中,所述第一设备获取第一周期内的属于所述目标网络对象的N个业务流的KPI,所述第一周期位于所述第一时间段内,N为大于0的整数。所述第一设备获取M个KPI集合,任一个KPI集合包括所述N个业务流中的每个业务流的一个KPI,所述任一个KPI集合包括的KPI的类型相同。所述第一设备通过至少一个第一计算方式,对所述任一个KPI集合包括的KPI进行计算,得到所述任一个KPI集合对应的至少一个统计特征,所述至少一个第一计算方式包括如下一种或多种:对所述任一个KPI集合中的KPI进行统计,计算所述任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。由于统计出不同的统计特征组成特征集合,从而丰富了特征集合包括的特征,使得特征集合更能反应网络对象的健康状况。
在另一种可能的实现方式中,所述任一个特征集合还包括至少一个时域特征。过至少一个第二计算方式,对统计特征集合包括的统计特征进行计算,得到至少一个时域特征。其中,所述统计特征集合包括K个统计特征,所述K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征,所述K个周期包括所述第一周期和位于所述第一周期之前的K-1个周期,所述至少一个第二计算方式包括如下一种或多种:计算所述统计特征集合中的相邻两个统计特征之间的环比值或差分值,对所述统计特征集合中的统计特征进行特征拟合。由于时域特征是基于K个周期的统计特征得到的,且特征集合还包括该时域特征,使得特征集合包括具有时序性的特征。
在另一种可能的实现方式中,所述任一个KPI集合包括所述N个业务流的状态标识,所述N个业务流中的任一个业务流的状态标识用于标识所述任一个业务流的状态;所述任一个KPI集合的统计特征包括用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目;所述任一个特征集合还包括成功状态的业务流比例和/或失败状态的业务流比例。根据所述用于标识成功状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算成功状态的业务流比例;和/或,根据所述用于标识失败状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算失败状态的业务流比例。由于特征集合还包括状态标识,从而更加丰富了特征集合中的特征。
在另一种可能的实现方式中,生成训练样本,所述训练样本包括所述任一个特征集合和所述训练样本的标签,在所述目标网络对象处于故障状态的情况下,所述标签用于标识所述故障状态,在所述目标网络对象处于正常状态的情况下,所述标签用于标识所述正常状态。由于设置了训练样本的标签,这样可以采用监督方式训练故障检测模型。
在另一种可能的实现方式中,所述第一设备向训练设备发送所述至少一个特征集合,所述至少一个特征集合用于所述训练设备训练故障检测模型。所述第一设备接收所述训练设备发送的所述故障检测模型。这样可以采用一个性能较高的训练设备做故障检测模型的训练,提高训练的效率。
在另一种可能的实现方式中,所述业务流的M个KPI包括所述网络对象与所述转发设备之间的网络时延,所述网络对象发送的属于所述业务流的数据量,所述网络对象接收的属于所述业务流的数据量,或,所述业务流的状态标识中的至少一个,所述状态标识信息用于标识所述业务流的状态。
在另一种可能的实现方式中,所述第一设备是云平台、分析器平台或是所述转发设备的上游设备。
在另一种可能的实现方式中,所述网络对象是终端、服务器、客户端、虚拟机、路由器、交换机、虚拟局域网VLAN中的设备或指定网段中的设备。
在另一种可能的实现方式中,所述M个KPI用于描述所述业务流的特征。
第三方面,本申请提供了一种故障检测模型的训练装置,用于执行第一方面或第一方面的任意一种可能实现方式中的方法。具体地,所述装置包括用于执行第一方面或第一方面的任意一种可能实现方式的方法的单元。
第四方面,本申请提供了一种故障检测模型的训练装置,用于执行第二方面或第二方面的任意一种可能实现方式中的方法。具体地,所述装置包括用于执行第二方面或第二方面的任意一种可能实现方式的方法的单元。
第五方面,本申请提供了一种故障检测模型的训练装置,所述装置包括:处理器、存储器和收发器。其中,所述处理器、所述存储器和所述收发器之间可以通过总线系统相连。所述存储器用于存储一个或多个程序,所述处理器用于执行所述存储器中的一个或多个程序,使得所述装置完成第一方面或第一方面的任意可能实现方式中的方法。
第六方面,本申请提供了一种故障检测模型的训练装置,所述装置包括:处理器、存储器和收发器。其中,所述处理器、所述存储器和所述收发器之间可以通过总线系统相连。所 述存储器用于存储一个或多个程序,所述处理器用于执行所述存储器中的一个或多个程序,使得所述装置完成第二方面或第二方面的任意可能实现方式中的方法。
第七方面,本申请提供了一种计算机可读存储介质,计算机可读存储介质中存储有程序代码,当其在计算机上运行时,使得计算机执行上述第一方面、第二方面、第一方面的任意可能实现方式或第二方面的任意可能实现方式中的方法。
第八方面,本申请提供了一种包含程序代码的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面、第二方面、第一方面的任意可能实现方式或第二方面的任意可能实现方式中的方法。
第九方面,本申请提供了一种故障检测模型的训练系统,所述系统包括第三方面所述的装置和第四方面所述的装置;或者,所述系统包括第五方面所述的装置和第六方面所述的装置。
附图说明
图1是本申请实施例提供的一种网络架构的示意图;
图2是本申请实施例提供的一种数据通信网络的结构示意图;
图3是本申请实施例提供的一种数据中心网络的结构示意图;
图4是本申请实施例提供的一种故障检测模型的训练方法流程图;
图5是本申请实施例提供的一种传输业务流的流程图;
图6是本申请实施例提供的一种故障检测方法流程图;
图7是本申请实施例提供的另一种故障检测模型的训练方法流程图;
图8是本申请实施例提供的另一种故障检测方法流程图;
图9是本申请实施例提供的一种故障检测模型的训练装置结构示意图;
图10是本申请实施例提供的另一种故障检测模型的训练装置结构示意图;
图11是本申请实施例提供的另一种故障检测模型的训练装置结构示意图;
图12是本申请实施例提供的另一种故障检测模型的训练装置结构示意图;
图13是本申请实施例提供的一种故障检测模型的训练系统结构示意图。
具体实施方式
下面将结合附图对本申请实施方式作进一步地详细描述。
参见图1,本申请实施例提供了网络架构,该网络架构包括:
网络对象、转发设备和第一设备,转发设备与网络对象之间建立有网络连接,转发设备与第一设备之间也建立有网络连接。
其中,对于属于网络对象的业务流,用于传输该业务流的业务路径经过转发设备。也就是说,网络对象与转发设备之间的网络连接是该业务路径的一部分。对于属于该网络对象的业务流包括的任一个业务报文,该任一个业务报文可能是该网络对象发送的,网络对象在发 送该任一个业务报文后,会被转发设备接收到,再由转发设备向其上游设备转发该任一个业务报文。或者,该任一个业务报文可能是需要发送给该网络对象的,转发设备会先接收到该任一个业务报文,再向该网络对象转发该任一个业务报文。
可选的,该网络架构还可以包括训练设备,第一设备与训练设备之间可以建立有网络连接。
可选的,网络对象可以为终端、服务器、路由器、交换机、客户端、虚拟机、虚拟局域网(virtual local area network,VLAN)中的设备或指定网段中的设备等。其中,网段是一地址范围,包括多个设备的地址。
可选的,在网络对象为终端、服务器、路由器、交换机、客户端或虚拟机等的情况下,网络对象的标识信息为网络对象的地址。在网络对象为VLAN或网段的情况下,网络对象的标识信息为VLAN的标识信息或网段的标识信息。
可选的,第一设备为转发设备的上游设备,或者,第一设备为云平台或分析器平台。在第一设备为转发设备的上游设备的情况下,用于传输该业务流的业务路径可以经过该第一设备,也就是说:转发设备与第一设备之间的网络连接也是该业务路径的一部分,第一设备可以用于转发该网络对象的业务流包括的报文。
可选的,该网络架构还包括网管设备,网管设备可以与该网络架构中的各网络对象之间建立有网络连接,与该网络架构中的第一设备之间建立有网络连接。
可选的,上述网络架构可以应用于数据通信网络,参见图2所示的数据通信网络,该数据通信网络包括至少一个终端、至少一个光网络终端(optical network terminal,ONT)、至少一个光网络终端(optical network terminal,OLT)、宽带接入服务器(broadband remote access server,BRAS)、核心路由器(core router,CR)。针对任一个终端,该任一个终端接入到一个ONT。针对任一个ONT,该任一个ONT与一个OLT相连。该至少一个OLT中的每个OLT还与BRAS相连,BRAS还与CR相连,CR可以连接到广域网。
可选的,在数据通信网络中还可设置云平台或分析器平台,该云平台或该分析器平台与数据通信网络中的各ONT之间建立有网络连接,和/或,云平台或该分析器平台与数据通信网络中的各OLT之间建立有网络连接,和/或,云平台或该分析器平台与数据通信网络中的BRAS之间建立有网络连接。
在数据通信网络中,转发设备可以为ONT、OLT或BRAS等。网络对象可以为终端。第一设备可以为转发设备的上游设备,例如转发设备为ONT或OLT,第一设备可以为BRAS。或者,第一设备可以为与转发设备之间建立有网络连接的云平台或分析器平台。
可选的,上述网络架构可以应用于数据中心网络,参见图3所示的数据中心网络,该数据中心网络包括至少一个服务器、至少一个叶子(Leaf)、至少一个骨干交换机(Spine)和网关(gateway,GW)。针对任一个服务器,该任一个服务器接入到一个Leaf。针对任一个Leaf,该任一个Leaf与至少一个Spine相连。该至少一个Spine中的每个Spine还与GW相连,GW还可以连接到广域网。
可选的,在数据中心网络中还可设置云平台,该云平台与数据中心网络中的各Leaf之间建立有网络连接,和/或,与各Spine之间建立有网络连接。或者,在数据中心网络中还可设置分析器平台,该分析器平台与数据中心网络中的各Leaf之间建立有网络连接,和/或,与各Spine之间建立有网络连接。
在数据中心网络中,转发设备可以为Leaf、Spine或GW等。网络对象可以为服务器。第一设备可以为转发设备的上游设备,例如转发设备为Leaf,第一设备可以为Spine。或者,第一设备可以为与各转发设备之间存在网络连接的云平台或分析器平台。
本申请实施例提供的网络架构包括的网络对象可能会出现故障,当网络对象出现故障时,可能会出现业务中断,造成严重损失,因此需要及时检测出出现故障的网络对象。为此,第一设备可以训练出故障检测模型,或者通过训练设备训练出故障检测模型,该故障检测模型用于检测网络对象是否处于故障状态,这样第一设备可以通过该故障检测模型及时检测出出现故障的网络对象。
为了能够训练出故障检测模型,网管设备可以配置网络架构中的部分网络对象在第一时间段内处于某种故障状态。对于该网络架构中的转发设备,转发设备在接收到网络对象的业务流时,获取该业务流的业务信息,该业务信息包括网络对象的标识信息和M个关键性能指标(keyperformanceindicator,KPI),M为大于0的整数,向第一设备发送训练信息,该训练信息包括该业务流的业务信息或基于该业务信息获取的特征集合。第一设备接收该网络架构包括的转发设备发送的训练信息,根据接收的训练信息,训练智能模型,得到故障检测模型,该故障检测模型可用于检测网络架构中的网络对象是否处于该故障状态。
其中,转发设备获取检测信息的详细获取过程,以及第一设备训练故障检测模型的详细训练过程,将在后续图4或图7所示的实施例进行详细说明,在此先不介绍。
可选的,故障状态可以为时延故障状态或建链故障状态等。
在第一设备训练出故障检测模型后,转发设备在接收到网络对象的业务流时,获取该业务流的业务信息,向第一设备发送检测信息,该检测信息包括该业务流的业务信息或基于该业务信息获取的特征集合。第一设备接收该网络架构包括的转发设备发送的检测信息,根据接收的检测信息,通过故障检测模型检测该网络架构中处于该故障状态的网络对象。
其中,在检测网络对象是否处于该故障状态的阶段内,转发设备获取检测信息的详细获取过程,以及第一设备检测网络对象的详细检测过程,将在后续图6或图8所示的实施例进行详细说明,在此先不介绍。
参见图4,本申请实施例提供了一种故障检测模型的训练方法,该训练方法可以应用于图1至3所示的任一实施例提供的网络架构。在该方法中,转发设备获取业务流的业务信息,向第一设备发送业务流的业务信息,第一设备接收该业务信息并训练故障检测模型。该方法包括:
步骤101:转发设备接收业务流。
转发设备是用于传输该业务流的业务路径经过的设备,所以转发设备会接收到属于该业务流的任一个业务报文。在转发设备接收到业务流包括的业务报文时,转发设备就可继续执行如下步骤102的操作。
需要说明的是:在执行本申请实施例之前,网管设备还可以向该网络架构中的部分网络对象和第一设备发送故障配置信息,故障配置信息包括第一时间段的起始时间和一种故障状态。以及,网管设备向转发设备发送配置策略信息,配置策略信息包括至少一个预设报文类型和协议类型等中的至少一个。
可选的,该故障配置信息还包括第一时间段的结束时间。该配置策略信息还包括该故障 状态。
第一时间段是训练故障检测模型的时间,即在第一时间段内通过本申请实施例提供的训练方法训练出故障检测模型。
网管设备还向第一设备发送对象集合,该对象集合包括在第一时间段内处于故障状态的各网络对象的标识信息。
对于接收到故障配置信息的网络对象,在该故障配置信息包括第一时间段的起始时间和故障状态的情况下,网络对象根据第一时间段的起始时间和时长阈值,确定第一时间段。在该故障配置信息还包括第一时间段的结束时间的情况下,网络对象根据第一时间段的起始时间和结束时间,确定第一时间段。然后该网络对象在第一时间段工作在该故障状态。
对于第一设备,在该故障配置信息包括第一时间段的起始时间和故障状态的情况下,第一设备根据第一时间段的起始时间和时长阈值,确定第一时间段。在该故障配置信息还包括第一时间段的结束时间的情况下,第一设备根据第一时间段的起始时间和结束时间,确定第一时间段。然后第一设备在第一时间段内开始执行训练故障检测模型的流程。
第一设备在接收对象集合时,还保存接收的该对象集合。
对于配置策略信息,在配置策略信息包括的协议类型为传输控制协议(transmission control protocol,TCP)的情况下,该配置策略信息包括至少一个预设报文类型。至少一个预设报文类型可以包括同步(synchronous,SYN)报文、同步确认(synchronous acknowledgement,SYN ACK)报文、结束(finish,FIN)报文或重置(reset,RST)报文等中的至少一个。在配置策略信息包括的协议类型为用户数据报协议(user datagram protocol,UDP)的情况下,该配置策略信息可以不包括预设报文类型。
可选的,技术人员在网管设备中输入第一时间段的起始时间和故障状态。网管设备接收第一时间段的起始时间和故障状态。技术人员还可向网管设备输入第一时间段的结束时间,网管设备还可接收第一时间段的结束时间。网管设备基于接收的信息生成故障配置信息。
可选的,技术人员在网管设备中输入网络架构中的部分网络对象的标识信息,然后网管设备根据该部分网络对象的标识信息,向该部分网络对象发送故障配置信息。
可选的,转发设备可以为图2所示的数据通信网络中的ONT或OLT,或者,为图3所示的数据中心网络中的Leaf等接入设备。第一设备为云平台、分析器平台、数据通信网络中的BRAS、数据中心网络中的Spine或其他第三方设备等。或者,
可选的,转发设备可以为图2所示的数据通信网络中的BRAS,或者,为图3所示的数据中心网络中的Spine或GW等设备。第一设备为云平台、分析器平台或其他第三方设备等。
步骤102:转发设备根据配置策略信息,从该业务流中获取至少一个目标业务报文。
在本步骤中,在该配置策略信息包括的协议类型为TCP的情况下,当转发设备接收到一个业务报文时,转发设备检测该业务报文包括的协议类型是否为TCP,以及该业务报文的报文类型是否是该配置策略信息包括的某个预设报文类型,如果该协议类型是TCP且该报文类型是某个预设报文类型,则将该业务报文作为一个目标业务报文并保存该目标业务报文。
参见图5,终端与服务器如果采用TCP协议来传输业务流时,终端首先向服务器发送SYN报文,该SYN报文用于请求建立终端与服务器之间的TCP连接,该TCP连接是用于传输终端与服务器之间的业务流的业务路径。服务器接收该SYN报文后,向终端发送SYN ACK报文,终端接收该SYN ACK报文,此时终端与服务器之间的TCP连接建立完成。然后终端与 服务器之间使用该TCP连接来传输业务报文。在终端发送完或接收完业务流时,终端向服务器发送FIN报文。服务器接收该FIN报文,向终端发送FIN报文或RST报文,终端接收该FIN报文或RST报文,断开与服务器之间的TCP连接。
基于上述终端与服务器传输业务流的过程,可以得出:终端向服务器发送业务流的第一个业务报文为SYN报文,该SYN报文也是用于建立业务路径的起始报文。服务器向终端发送业务流的第一个业务报文为SYN ACK报文,该SYN ACK报文也是用于建立业务路径的结束报文。终端向服务器发送业务流的最后一个业务报文是FIN报文,服务器向终端发送业务流的最后一个业务报文是FIN报文或RST报文。
在该配置策略信息包括的协议类型为TCP的情况下,在本步骤中转发设备从业务流中获取目标业务报文包括SYN报文、SYNACK报文、FIN报文或RST报文等中的至少一个。
在该配置策略信息包括的协议类型为UDP的情况下,当转发设备接收到一个业务报文时,转发设备检测该业务报文包括的协议类型是否为UDP,如果该协议类型是UDP,则将该业务报文作为一个目标业务报文并保存该目标业务报文。
步骤103:转发设备根据该至少一个目标业务报文,获取该业务流的业务信息,该业务信息包括该业务流属于的至少一个网络对象的标识信息和该业务流的M个KPI,该M个KPI用于描述该业务流的特征。
网络对象的标识信息可以为网络对象的地址等。该地址可以为网际互连协议(internet protocol,IP)地址或媒体介入控制层(media access control,MAC)地址等。
在该配置策略信息包括的协议类型为TCP的情况下,该M个KPI包括转发设备与网络对象之间的网络时延,网络对象发送的属于该业务流的数据量、网络对象接收的属于该业务流的数据量或状态标识等中的至少一个,该状态标识用于标识业务流的状态。该网络对象可以为终端或服务器等。
在该配置策略信息包括的协议类型为UDP的情况下,该M个KPI包括网络对象发送的属于该业务流的数据量或网络对象接收的属于该业务流的数据量等中的至少一个。
在本步骤中,转发设备从该至少一个目标业务报文中获取属于同一业务流的目标业务报文,根据属于该业务流的目标业务报文,获取该业务流的M个KPI。
每个目标业务报文中包括五元组信息,该五元组信息用于标识该业务报文属于的业务流。该五元组信息可以包括源设备的地址、目的设备的地址、源设备的端口号、目的设备的端口号和协议类型。
可选的,转发设备从该至少一个目标业务报文中获取包括相同的五元组信息的目标业务报文,作为属于同一业务流的目标业务报文。
对于属于同一业务流的目标业务报文的协议类型为TCP的情况下,接下来,一一说明转发设备获取各KPI的过程,详细说明如下:
对于转发设备与网络对象之间的网络时延,该网络对象可以为服务器或终端,转发设备在获取的目标业务报文包括第一目标业务报文和第二目标业务报文的情况下,转发设备根据接收第一目标业务报文的第一时间和接收第二目标业务报文的第二时间,获取转发设备与网络对象之间的网络时延,第一目标业务报文是发送给网络对象的报文,第二目标业务报文是网络对象发送的与第一目标业务报文对应的报文。
可选的,第一目标业务报文是用于建立业务路径的起始报文,第二目标业务报文是用于 建立业务路径的结束报文。
可选的,将第二时间减去第一时间,得到转发设备与网络对象之间的网络时延。
可选的,第一目标业务报文可以为终端发送的SYN报文,第二目标业务报文可以为服务器发送的SYN ACK报文。
对于网络对象发送的属于该业务流的数据量,该网络对象可以为服务器或终端,转发设备在获取的目标业务报文包括第一起始报文和第一结束报文的情况下,转发设备根据第一起始报文的序列号和第一结束报文的序列号,获取网络对象发送的属于业务流的数据量,第一起始报文是网络对象发送的该业务流的第一个报文,第一结束报文是该网络对象发送的该业务流的最后一个报文。
可选的,将第一结束报文的序列号减去第一起始报文的序列号,得到网络对象发送的属于业务流的数据量。
在该网络对象为终端的情况下,第一起始报文为终端发送的SYN报文,第一结束报文为终端发送的FIN报文。转发设备将该FIN报文的序列号减去该SYN报文的序列号,得到终端发送的属于该业务流的数据量。
在该网络对象为服务器的情况下,第一起始报文为服务器发送的SYNACK报文,第一结束报文为服务器发送的FIN报文或RST报文。转发设备将该FIN报文的序列号减去该SYN ACK报文的序列号,得到服务器发送的属于该业务流的数据量。或者,转发设备将该RST报文的序列号减去该SYN ACK报文的序列号,得到服务器发送的属于该业务流的数据量。
对于网络对象接收的属于该业务流的数据量,该网络对象可以为服务器或终端,转发设备在获取的目标业务报文包括第二起始报文和第二结束报文的情况下,转发设备根据第二起始报文的序列号和第二结束报文的序列号,获取网络对象接收的属于业务流的数据量,第二起始报文是网络对象接收的该业务流的第一个报文,第二结束报文是该网络对象接收的该业务流的最后一个报文。
可选的,将第二结束报文的序列号减去第二起始报文的序列号,得到网络对象发送的属于业务流的数据量。
在该网络对象为终端的情况下,第二起始报文为终端接收的SYN ACK报文,第二结束报文为终端接收的FIN报文或RST报文。转发设备将该FIN报文的序列号减去该SYN ACK报文的序列号,得到终端接收的属于该业务流的数据量。或者,转发设备将该RST报文的序列号减去该SYN ACK报文的序列号,得到终端接收的属于该业务流的数据量。
在该网络对象为服务器的情况下,第二起始报文为服务器接收的SYN报文,第二结束报文为服务器接收的FIN报文。转发设备将该FIN报文的序列号减去该SYN报文的序列号,得到服务器接收的属于该业务流的数据量。
对于状态标识,转发设备在获取的目标业务报文包括第一起始报文的情况下,转发设备在第三时间之后的第一时间长度内,如果接收到第一结束报文,设置该状态标识标识的业务流的状态为成功状态;如果未接收到第一结束报文,设置该状态标识标识的业务流的状态为失败状态,第三时间为接收第一起始报文的时间。第一起始报文为服务器发送的SYNACK报文,第一结束报文为服务器发送的FIN报文或RST报文。
对于属于同一业务流的目标业务报文的协议类型为UDP的情况下,接下来,一一说明转发设备获取各KPI的过程,详细说明如下:
对于网络对象发送的属于该业务流的数据量,转发设备判断距离接收属于该业务流的最后一个业务报文的时间长度是否达到第二时间长度,如果达到,表示网络对象已传输完该业务流。转发设备从属于该业务流的目标业务报文中获取源设备的地址是该网络对象的地址的目标业务报文,获取的每个目标业务报文均是该网络对象发送的业务报文,对获取的每个目标业务报文的数据量进行累加,得到网络对象发送的属于该业务流的数据量。
例如,在该网络对象为终端时,转发设备从属于该业务流的目标业务报文中获取源设备的地址是该终端的地址的目标业务报文,对获取的每个目标业务报文的数据量进行累加,得到该终端发送的属于该业务流的数据量。或者,在该网络对象为服务器时,转发设备从属于该业务流的目标业务报文中获取源设备的地址是该服务器的地址的目标业务报文,对获取的每个目标业务报文的数据量进行累加,得到服务器发送的属于该业务流的数据量。
对于网络对象接收的属于该业务流的数据量,转发设备在判断出网络对象已传输完该业务流时,从属于该业务流的目标业务报文中获取目的设备的地址是该网络对象的地址的目标业务报文,获取的每个目标业务报文均是该网络对象接收的业务报文,对获取的每个目标业务报文的数据量进行累加,得到网络对象接收的属于该业务流的数据量。
例如,在该网络对象为终端时,转发设备从属于该业务流的目标业务报文中获取目的设备的地址是该终端的地址的目标业务报文,对获取的每个目标业务报文的数据量进行累加,得到该终端接收的属于该业务流的数据量。或者,在该网络对象为服务器时,转发设备从属于该业务流的目标业务报文中获取目的设备的地址是该服务器的地址的目标业务报文,对获取的每个目标业务报文的数据量进行累加,得到该服务器接收的属于该业务流的数据量。
可选的,在配置策略信息还包括故障状态的情况下,转发设备可基于配置策略信息包括的故障状态,确定待获取的KPI的类型,然后通过本步骤获取该业务流的属于确定类型的KPI,获取的KPI为与该故障状态相关的KPI。
可选的,转发设备可以保存故障状态与KPI的类型的对应关系,转发设备可以基于配置策略信息包括的故障状态和该对应关系,确定待获取的KPI的类型。
步骤104:转发设备向第一设备发送该业务流的业务信息,该业务信息包括该业务流属于的至少一个网络对象的标识信息和该业务流的M个KPI。
可选的,该业务信息还可以采集该M个KPI中的各KPI的采集时间,各KPI的采集时间用于确定各KPI所属的周期。
在第一设备为分析器平台或云平台的情况下,转发设备与分析器平台或云平台之间建立有网络连接,转发设备向分析器平台或云平台发送该业务流的业务信息。
在第一设备为与转发设备相连的上游设备的情况下,转发设备向第一设备发送该业务流的业务信息。例如,在转发设备为ONT或OLT的情况下,与转发设备相连的上游设备为BRAS,转发设备向BRAS发送该业务流的业务信息。在转发设备为Leaf的情况下,与转发设备相连的上游设备为spine,转发设备向spine发送该业务流的业务信息。
对于该网络架构中的任一个转发设备,该任一个转发设备对接收的业务流,执行上述101至104的步骤得到并发送业务流的业务信息。
步骤105:第一设备接收至少一个业务流的业务信息。
第一设备持续接收该网络架构中不同的转发设备发送的业务流的业务信息。
步骤106:第一设备获取第一周期内的属于目标网络对象的N个业务流的KPI,N为大 于0的整数,第一周期位于第一时间段内,目标网络对象是在第一周期内的任一个业务流属于的网络对象。
该N个业务流的KPI的采集时间均位于第一周期内,第一周期可以是任一个周期。
可选的,第一周期可以为当前周期。
在本步骤中,从第一周期内接收的业务流的业务信息中,获取包括目标网络对象的标识信息的N个业务信息,从该N个业务信息中获取N个业务流的KPI。
步骤107:第一设备根据该N个业务流的KPI,生成一个训练样本,该训练样本包括基于该N个业务流的KPI获取的特征集合。
可选的,针对任一个业务流的KPI,如果该任一个业务流的KPI除了包括与该故障状态相关的KPI外,还包括其他KPI,则第一设备可基于故障配置信息包括的故障状态,确定待选择的KPI的类型,然后从该任一个业务流的KPI中选择属于确定类型的KPI,即选择与该故障状态相关的KPI。对于其他N-1个业务流的KPI,也按上述相同的方式处理,得到N个业务流的与该故障类型相关的KPI,然后根据该N个业务流的与该故障类型相关的KPI,获取该故障状态对应的该训练样本。可以通过如下1071至1074的操作生成该故障状态对应的训练样本。
在本步骤中,可以通过如下1071至1074的操作生成一个训练样本,该1071至1074的操作可以为:
1071:第一设备获取M个KPI集合,任一个KPI集合包括该N个业务流中的每个业务流的一个KPI,该任一个KPI集合包括的KPI的类型相同。
例如,任一个业务流的M个KPI包括转发设备与网络对象之间的网络时延,网络对象发送的属于该业务流的数据量、网络对象接收的属于该业务流的数据量和状态标识。所以第一设备获取的M个KPI集合包括网络时延集合、发送数据量集合、接收数据量集合和状态标识集合。
网络时延集合包括N个网络时延,该N个网络时延分别属于该N个业务流的。发送数据量集合包括N个发送数据量,该N个发送数据量分别属于该N个业务流的。接收数据量集合包括N个接收数据量,该N个接收数据量分别属于该N个业务流的。状态标识集合包括该N个业务流的状态标识。
1072:针对该M个KPI集合中的任一个KPI集合,第一设备通过至少一个第一计算方式,对该任一个KPI集合包括的KPI进行计算,得到该任一个KPI集合对应的至少一个统计特征。
该至少一个第一计算方式包括如下一种或多种:对该任一个KPI集合中的KPI进行统计,计算该任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。
对于该任一个KPI集合包括的KPI的离散度,该离散度等于该任一个KPI集合包括的KPI方差与该任一个KPI集合包括的KPI均值之间的比值。
对于该任一个KPI集合包括的KPI的偏度,该偏度为
Figure PCTCN2020119031-appb-000001
其中,X i为该任一个KPI集合中的第i个KPI,σ为该任一个KPI集合包括的KPI方差,μ为该任一个KPI集合包括的KPI均值,E为求期望值运算。
对于该任一个KPI集合包括的KPI的峰度,该峰度为
Figure PCTCN2020119031-appb-000002
其中,i=1、2、……、N。
对于网络时延集合,第一设备可以计算该网络时延集合包括的N个网络时延的均值、方差、离散度、偏度或峰度等中的至少一个,得到网络时延集合对应的至少一个统计特征包括网络时延均值、网络时延方差、网络时延离散度、网络时延偏度或网络时延峰度等中至少一个。
对于发送数据量集合,第一设备可以计算该发送数据量集合包括的N个发送数据量的均值、方差、离散度、偏度或峰度等中的至少一个,得到发送数据量集合对应的至少一个统计特征包括发送数据量均值、发送数据量方差、发送数据量离散度、发送数据量偏度或发送数据量峰度等中至少一个。
对于接收数据量集合,第一设备可以计算该接收数据量集合包括的N个接收数据量的均值、方差、离散度、偏度或峰度等中的至少一个,得到接收数据量集合对应的至少一个统计特征包括接收数据量均值、接收数据量方差、接收数据量离散度、接收数据量偏度或接收数据量峰度等中至少一个。
对于状态标识集合,第一设备可以统计该状态标识集合中用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目,即状态标识集合对应的至少一个统计特征包括用于标识成功状态的状态标识数目和/或用于标识失败状态的状态标识数目。
可选的,第一设备根据该用于标识成功状态的状态标识数目和该状态标识集合包括的状态标识数目N,计算成功状态的业务流比例。和/或,第一设备根据用于标识失败状态的状态标识数目和该状态标识集合包括的状态标识数目N,计算失败状态的业务流比例。
可选的,第一设备还获取至少一个统计特征集合,针对任一个统计特征集合,该任一个统计特征集合包括K个统计特征,该K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征。通过至少一个第二计算方式,对该任一个统计特征集合包括的统计特征进行计算,得到该任一个统计特征集合对应的至少一个时域特征。
其中K个周期包括第一周期和位于第一周期之前的K-1个周期,至少一个第二计算方式包括如下一种或多种:计算统计特征集合中的相邻两个统计特征之间的环比值或差分值,对统计特征集合中的统计特征进行特征拟合。
假设第K个周期为第一周期,则统计特征集合的环比值包括第2个周期的统计特征与第1个周期的统计特征之间的比值,第3个周期的统计特征与第2个周期的统计特征之间的比值,……,第K个周期的统计特征与第K-1个周期的统计特征之间的比值。以及,统计特征集合的差分值包括第2个周期的统计特征与第1个周期的统计特征之间的差值,第3个周期的统计特征与第2个周期的统计特征之间的差值,……,第K个周期的统计特征与第K-1个周期的统计特征之间的差值。
可选的,第一设备可以通过如下第一公式,对统计特征集合中的统计特征进行特征拟合,得到时域特征;
第一公式为:v=λ 1*v 12*v 2+……+λ K*v K
在第一公式中,v为经过特征拟合后得到的时域特征,λ 1、λ 2、……、λ K分别为第1个周期、第2个周期、……、第K个周期对应的权重,对于离第一周期越近的周期,该周期对应的权重越大,v 1、v 2、……、v K分别为第1个周期的统计特征、第2个周期的统计特征、……、第K个周期的统计特征。
例如,对于网络时延集合对应的统计特征,第一设备获取的统计特征集合可以为网络时延均值集合、网络时延方差集合、网络时延离散度集合、网络时延偏度集合或网络时延峰度集合。网络时延均值集合包括转发设备在K周期计算得到的网络时延均值。网络时延方差集合包括转发设备在K个周期计算得到的网络时延方差。网络时延偏度集合包括转发设备在K个周期计算得到的网络时延偏度。网络时延离散度集合包括转发设备在K个周期计算得到的网络时延离散度。网络时延峰度集合包括转发设备在K个周期计算得到的网络时延峰度。
接下来以网络时延均值集合为例进行说明,对于网络时延均值集合中的相邻两个网络时延均值之间的环比值,第一设备计算第2个周期的网络时延均值与第1个周期的网络时延均值之间的比值,第3个周期的网络时延均值与第2个周期的网络时延均值之间的比值,……,第K个周期的网络时延均值与第K-1个周期的网络时延均值之间的比值,从而得到网络时延均值集合中的相邻两个网络时延均值之间的环比值。
对于网络时延均值集合中的相邻两个网络时延均值之间的差分值,第一设备计算第2个周期的网络时延均值与第1个周期的网络时延均值之间的差值,第3个周期的网络时延均值与第2个周期的网络时延均值之间的差值,……,第K个周期的网络时延均值与第K-1个周期的网络时延均值之间的差值,从而得到网络时延均值集合中的相邻两个网络时延均值之间的差分值。
对于对网络时延均值集合中的K个网络时延均值进行特征拟合,第一设备将第一公式中的v 1、v 2、……、v K分别替换为该K个网络时延均值,然后通过上述第一公式对该K个网络时延均值进行特征拟合,得到时域特征,该时域特征为滑动均值。
接下来还以网络时延方差集合为例进行说明,对于网络时延方差集合中的相邻两个网络时延方差之间的环比值,第一设备计算第2个周期的网络时延方差与第1个周期的网络时延方差之间的比值,第3个周期的网络时延方差与第2个周期的网络时延方差之间的比值,……,第K个周期的网络时延方差与第K-1个周期的网络时延方差之间的比值,从而得到网络时延方差集合中的相邻两个网络时延方差之间的环比值。
对于网络时延方差集合中的相邻两个网络时延方差之间的差分值,第一设备计算第2个周期的网络时延方差与第1个周期的网络时延方差之间的差值,第3个周期的网络时延方差与第2个周期的网络时延方差之间的差值,……,第K个周期的网络时延方差与第K-1个周期的网络时延方差之间的差值,从而得到网络时延方差集合中的相邻两个网络时延方差之间的差分值。
对于对网络时延方差集合中的K个网络时延方差进行特征拟合,第一设备将第一公式中的v 1、v 2、……、v K分别替换为该K个网络时延方差,然后通过上述第一公式对该K个网络时延方差进行特征拟合,得到时域特征,该时域特征为滑动波动值。
对于网络时延离散度集合、网络时延偏度集合或网络时延峰度集合,第一设备按上述对网络时延均值集合执行相同的操作,得到各集合对应的至少一个时域特征。
以及,对于其他任一个KPI集合对应的统计特征,第一设备按上述对网络时延集合执行相同的操作,获取其他任一个KPI集合对应的至少一个统计特征集合,然后通过至少一种第二计算方式,对每个统计特征集合进行计算处理,得到每个统计特征集合对应的至少一个时域特征。详细实现过程就不再一一列举说明。
1073:第一设备获取特征集合,该特征集合包括该M个KPI集合中的每个KPI集合对应的至少一个统计特征。
可选的,该特征集合还包括各统计特征集合对应的至少一个时域特征、成功状态的业务流比例或失败状态的业务流比例等中的至少一个。
1074:第一设备生成训练样本,该训练样本包括该特征集合,或者,该训练样本包括该特征集合和该训练样本的标签。
可选的,第一设备在目标网络对象在第一时间段内的状态为故障状态时,该训练样本的标签用于标识该故障状态;在目标网络对象在第一时间段内的状态为正常状态时,该训练样本的标签用于标识该正常状态。
可选的,第一设备判断对象集合中是否包括目标网络对象的标识信息,在该对象集合中包括目标网络对象的标识信息时,确定目标网络对象在第一时间段内的状态为故障状态;在该对象集合中不包括目标网络对象的标识信息时,确定目标网络对象在第一时间段内的状态为正常状态。
第一设备在第一时间段内重复执行上述步骤106至107的操作,从而得到大量的训练样本,将得到的大量训练样本组成训练样本集合。然后执行如下步骤108的操作。
步骤108:第一设备根据训练样本集合,训练智能模型,得到故障检测模型。
在本步骤中,第一设备可以采用监督训练方式或无监督训练方式,训练智能模型。
在采用监督训练方式时,训练样本集合中的每个训练样本存在对应的标注信息,训练过程可以为:
1081:第一设备将训练样本集合输入到智能模型。
可选的,第一设备可以分多次向智能模型输入训练样本集合包括的训练样本,每次向智能模型输入A个训练样本,A为大于0的整数。
1082:智能模型对训练样本集合中的每个训练样本进行处理,每个训练样本对应的处理结果。
可选的,智能模型对输入的A个训练样本进行处理,得到该A个训练样本中的每个训练样本对应的处理结果。
1083:智能模型根据每个训练样本对应的标注信息和处理结果,通过参数集合包括的每个网络参数对应的梯度下降函数,计算梯度矩阵,根据该梯度矩阵调整智能模型中的至少一个网络参数,该参数集合包括该至少一个网络参数。
对于参数集合中的任一个网络参数,根据每个训练样本对应的标注信息和处理结果,通过该任一个网络参数对应的梯度下降函数,计算得到每个训练样本对应的梯度值,将每个训练样本对应的梯度值组成梯度矩阵的一行。
可选的,智能模型根据该A个训练样本中的每个训练样本对应的标注信息和处理结果, 通过该任一个网络参数对应的梯度下降函数,计算得到每个训练样本对应的梯度值。
如果训练样本集合中还有未输入的训练样本,则第一设备再向智能模型输入A个未输入的训练样本给智能模型,然后智能模型再执行上述1082至1083的操作。如果训练样本集合中没有未输入的训练样本,则执行如下1084的操作,
1084:智能模型根据训练样本集合中的每个训练样本对应的标注信息和处理结果,利用损失函数计算损失函数值,根据该损失函数值确定是否继续训练,在确定继续训练时,返回执行1081,在确定停止训练时,将此时的智能模型作为故障检测模型,结束返回。
在本步骤中,在该损失函数值小于损失阈值,则确定停止训练,否则,确定继续训练。
可选的,监督训练方式采用的智能模型为支持向量机(support vector machine,SVM)、逻辑(logistic)回归算法、随机森林算法或者神经网络模型。神经网络模型可以为深度神经网络(deep neural networks,DNN)、卷积神经网络(convolutional neural networks,CNN)、循环神经网络(recurrent neural network,RNN)或长短期记忆网络(long short-term memory,LSTM)等。
在采用无监督训练方式时,智能模型包括编码器和解码器,训练样本集合中的每个训练样本不存在对应的标注信息,训练过程可以为:
1181:第一设备将训练样本集合输入到智能模型。
可选的,第一设备可以分多次向智能模型输入训练样本集合包括的训练样本,每次向智能模型输入A个训练样本。
1182:智能模型对训练样本集合中的每个训练样本进行处理,每个训练样本对应的第一处理结果。
可选的,智能模型包括的编码器对输入的A个训练样本进行编码处理,得到该A个训练样本中的每个训练样本对应的第二处理结果。智能模型包括的解码器对每个训练样本对应的第二处理结果进行恢复处理,得到每个训练样本对应的第一处理结果。
智能模型包括的解码器对训练样本对应的第二处理结果进行恢复处理,以尽量恢复出该训练样本,但解码器恢复出的训练样本与原始的该训练样本可能还存在差异,即就是说原始的该训练样本和该训练样本对应的第一处理结果之间可能存在差异。
1183:智能模型根据每个训练样本和每个训练样本对应的第一处理结果,通过参数集合包括的每个网络参数对应的梯度下降函数,计算梯度矩阵,根据该梯度矩阵调整智能模型中的至少一个网络参数,该参数集合包括该至少一个网络参数。
对于参数集合中的任一个网络参数,根据每个训练样本和每个训练样本对应的第一处理结果,通过该任一个网络参数对应的梯度下降函数,计算得到每个训练样本对应的梯度值,将每个训练样本对应的梯度值组成梯度矩阵的一行。
可选的,智能模型根据该A个训练样本中的每个训练样本和每个训练样本对应的第一处理结果,通过该任一个网络参数对应的梯度下降函数,计算得到每个训练样本对应的梯度值。
如果训练样本集合中还有未输入的训练样本,则第一设备再向智能模型输入A个未输入的训练样本给智能模型,然后智能模型再执行上述1182至1183的操作。如果训练样本集合中没有未输入的训练样本,则执行如下1184的操作,
1184:智能模型根据训练样本集合中的每个训练样本和每个训练样本对应的第一处理结果,利用损失函数计算损失函数值,根据该损失函数值确定是否继续训练,在确定继续训练 时,返回执行1181,在确定停止训练时,将此时的智能模型作为故障检测模型,结束返回。
在本步骤中,在该损失函数值小于损失阈值,则确定停止训练,否则,确定继续训练。
可选的,无监督训练方式采用的智能模型为变分自编码器(variational autoencoder,VAE)模型或kmeans等。
可选的,第一设备训练出的故障检测模型用于检测网络对象是否处于故障状态,该故障状态是网管设备发送的故障配置信息中的故障状态。
网络对象的故障状态可能有多种,对于其他各种故障状态,网管设备向网络架构中的部分网络对象和第一设备发送其他各种故障状态对应的故障配置信息。其中,对于其个各种中的任一种故障状态,该任一种故障状态对应的故障配置信息是指包括该任一种故障状态的故障配置信息。然后转发设备和第一设备按上述步骤101至108的流程训练智能模型,得到用于检测该任一种故障状态的故障检测模型。如此得到每种故障状态对应的故障检测模型。
针对不同的故障状态,第一设备也可以按上述步骤101至108的流程训练同一个智能模型,得到一个故障检测模型,使该故障检测模型可以用于检测不同的故障状态。
可选的,第一设备可以将得到的特征集合发送给训练设备,训练设备接收特征集合,生成训练样本,该训练样本包括该特征集合,或者,该训练样本包括该特征集合和该训练样本的标签。
训练设备可以生成大量的训练样本,使用生成的训练样本训练故障检测模型。训练设备同第一设备一样可以采集上述两种训练方式训练出故障检测模型,在此不再详细说明。
可选的,训练设备还可以向第一设备发送训练出的故障检测模型。第一设备接收该故障检测模型。
可选的,训练设备也可以不向第一设备发送故障检测模型,这样在检测网络对象时,训练设备可以充当检测设备的角色,来检测网络对象。
在本申请实施例中,由于转发设备在接收业务流时,获取业务流的业务信息包括网络对象的标识信息和M个KPI,这样使得该业务信息的数据量远小于该业务流的数据量,从而转发设备向第一设备发送该业务流的业务信息时,会大量减少对网络资源的消耗,由其是大量减小对带宽资源的消耗。另外,上述第一设备可以是云平台或分析器平台,网络架构中所有的转发设备将业务流的业务信息发送给云平台或分析器平台,云平台或分析器平台可以统一训练出故障检测模型,但由于转发设备的数量多,云平台或分析器平台的带宽资源有限,在接收业务信息时可能需要较长时间,导致延长了训练时间。然而上述第一设备可以是与转发设备相连的上游设备,每个上游设备接收与其相连的转发设备发送的业务流的业务信息,这样由不同的转发设备分别来训练,可以提高训练效率。
参见图6,本申请实施例提供了一种故障检测方法,该检测方法可以应用于图1至3所示的任一实施例提供的网络架构,在该方法中,转发设备获取其接收的业务流的业务信息,向第一设备发送该业务流的业务信息。第一设备接收转发设备发送的业务流的业务信息,根据接收的业务信息生成检测样本,根据该检测样本通过故障检测模型来对网络对象进行检测,该故障检测模型可以通过上述图4所示的实施例训练得到的。该方法包括:
步骤201至205:分别与步骤101至105相同,在此不再详细说明。
其中,需要说明的是:转发设备获取业务流的业务信息,该业务信息中的KPI包括与至 少一个故障状态中的各故障状态相关的KPI。
可选的,该业务信息还可以采集KPI的采集时间,KPI的采集时间用于确定KPI所属的周期。
步骤206:第一设备获取当前周期内的属于目标网络对象的N个业务流的KPI,N为大于0的整数,目标网络对象是在当前周期内的任一个业务流属于的网络对象。
该N个业务流的KPI的采集时间均位于当前周期内。
可选的,转发设备可以图2所示的数据通信网络中的ONT或OLT,或者,为图3所示的数据中心网络中的Leaf等接入设备。第一设备为云平台、分析器平台、数据通信网络中的BRAS、数据中心网络中的Spine或其他第三方设备等。或者,
可选的,转发设备可以图2所示的数据通信网络中的BRAS,或者,为图3所示的数据中心网络中的Spine等设备。第一设备为云平台、分析器平台或其他第三方设备等。
步骤207:第一设备根据该N个业务流的KPI,生成一个检测样本,该检测样本包括基于该N个业务流的KPI获取的特征集合。
第一设备生成检测样本的过程与上述图4所示实施例的步骤107中生成训练样本的过程相同,在此不再详细说明。
可选的,针对任一个业务流的KPI,第一设备可基于一种故障状态,确定待选择的KPI的类型,然后从该任一个业务流的KPI中选择属于确定类型的KPI,即选择与该一种故障状态相关的KPI。对于其他N-1个业务流的KPI,也按上述相同的方式处理,得到N个业务流的与该一种故障类型相关的KPI,然后根据该N个业务流的与该一种故障类型相关的KPI,生成该故障状态对应的一个检测样本。生成该故障状态对应的一个检测样本的过程,可以参见图4所示的步骤1071至1074的操作生成该故障状态对应的一个训练样本的过程。如此,第一设备可以生成出不同故障状态对应的检测样本。
可选的,第一设备可以不区分不同的故障状态,即根据该N个业务流的KPI,生成一个检测样本,该检测样本包括不同故障状态对应的检测样本。
步骤208:第一设备根据该检测样本,通过故障检测模型检测目标网络对象是否处于故障状态。
可选的,第一设备中包括多个不同故障状态对应的故障检测模型,针对任一种故障状态对应的故障检测模型,第一设备根据该任一种故障状态对应的检测样本,通过该任一种故障状态对应的故障检测模型,检测目标网络对象是否处于该任一种故障状态。这样通过每种故障状态对应的故障检测模型分别对目标网络对象进行检测,检测出目标网络对象可能处于一种或多种故障状态。
可选的,第一设备包括多个不同故障状态的故障检测模型,第一设备根据不同故障状态对应的检测样本,通过不同故障状态对应的故障检测模型可以检测出目标网络对象可能处于的一种或多种故障状态。
可选的,第一设备中包括一个可检测不同故障状态的故障检测模型,第一设备在步骤207中生成一个检测样本,根据该检测样本,通过该故障检测模型,检测目标网络对象是否处于该一种或多种故障状态。
可选的,在检测出目标网络对象处于故障状态时,第一设备获取目标网络对象的至少一个KPI和/或目标网络对象的业务流,根据目标网络对象的至少一个KPI和/或目标网络对象 的业务流,进行故障定位。
目标网络对象的至少一个KPI可以包括目标网络对象的CPU使用率,内存使用率或吞吐率等中的至少一个。
第一设备可以确定出目标转发设备,目标转发设备是发送属于目标网络对象的业务流的业务信息的转发设备,向目标转发设备发送采集指令,该采集指令包括目标网络对象的标识信息。
目标转发设备接收该采集指令,根据该采集指令包括的目标网络对象的标识信息,对接收的属于目标网络对象的业务流进行镜像,向第一设备发送镜像得到的业务流。
由于在检测出目标网络对象处于故障状态时,才让转发设备对接收的属于目标网络对象的业务流进行镜像,向第一设备发送镜像得到的业务流。这样实现按需采集,避免对全部网络对象的业务流进行采集,节省了带宽资源,节省了第一设备在进行故障定位时对不必要的数据解析所需要的计算资源。
如果第一设备中不包括故障检测模型,故障检测模型位于第三方设备中,第一设备可以向第三方设备发送得到的检测样本。第三方设备接收该检测样本,根据该检测样本,通过故障检测模型检测目标网络对象是否处于故障状态。
在本申请实施例中,由于转发设备在接收业务流时,获取业务流的业务信息包括网络对象的标识信息和M个KPI,这样使得该业务信息的数据量远小于该业务流的数据量,从而转发设备向第一设备发送该业务流的业务信息时,会大量减少对网络资源的消耗,由其是大量减小对带宽资源的消耗。另外,上述第一设备可以是云平台或分析器平台,网络架构中所有的转发设备将业务流的业务信息发送给云平台或分析器平台,云平台或分析器平台可以统一对网络对象进行检测。但由于转发设备的数量多,云平台或分析器平台的带宽资源有限,在接收业务信息时可能需要较长时间,导致延长了检测时间。然而上述第一设备可以是与转发设备相连的上游设备,上游设备接收与其相连的转发设备发送的业务流的业务信息,这样由该上游设备来检测,可以提高检测效率,达到实时检测的目的。
参见图7,本申请实施例提供了一种故障检测模型的训练方法,该训练方法可以应用于图1至3所示的任一实施例提供的网络架构。在该方法中,转发设备获取业务流的业务信息,基于该业务信息获取特征集合,向第一设备发送特征集合,第一设备接收该特征集合并训练故障检测模型。该方法包括:
步骤301至303:分别与步骤101至103相同,在此不再详细说明。
步骤304:转发设备获取第一周期内的属于目标网络对象的N个业务流的KPI,N为大于0的整数,第一周期位于第一时间段内,目标网络对象是在第一周期内的任一个业务流属于的网络对象。
该N个业务流的KPI的采集时间均位于第一周期内,第一周期可以是任一个周期。可选的,第一周期可以为当前周期。
步骤305:转发设备根据该N个业务流的KPI,生成目标网络对象对应的一个特征集合。
转发设备可以参考图4所示实施例中的1071至1073中的第一设备生成特征集合的过程,在此不再详细说明。
步骤306:转发设备向第一设备发送该目标网络对象对应的特征集合。
转发设备可以重复执行上述步骤301至306的操作,得到不同网络对象的特征集合,向第一设备发送不同网络对象对应的特征集合。
步骤307:第一设备接收该目标网络对象的特征集合,生成训练样本,该训练样本包括该特征集合,或者,该训练样本包括该特征集合和该训练样本的标签。
在该目标网络对象处于故障状态下,该训练样本的标签用于标识该故障状态,在该目标网络对象处理正常状态下,该训练样本的标签用于标识正常状态。
第一设备可以接收不同转发设备发送的至少一个网络对象的特征集合,并生成大量的训练样本,然后通过如下步骤308的操作训练故障检测模型。
步骤308:与步骤108相同,在此不再详细说明。
可选的,转发设备可以图2所示的数据通信网络中的BRAS,或者,为图3所示的数据中心网络中的Spine等设备。第一设备为云平台、分析器平台或其他第三方设备等。
在本申请实施例中,由于转发设备在接收业务流时,获取业务流的业务信息包括网络对象的标识信息和M个KPI,基于该网络对象的业务信息,获取特征集合。这样使得该特征集合的数据量远小于该业务流的数据量,从而转发设备向第一设备发送该特征集合时,会大量减少对网络资源的消耗,由其是大量减小对带宽资源的消耗。
参见图8,本申请实施例提供了一种故障检测方法,该检测方法可以应用于图1至3所示的任一实施例提供的网络架构,在该方法中,转发设备获取其接收的业务流的业务信息,根据业务流的业务信息,生成特征集合,向第一设备发送该特征集合,第一设备接收的特征集合,根据该特征集合生成检测样本,根据该检测样本通过故障检测模型来对网络对象进行检测,该故障检测模型可以通过上述图4或图7所示的实施例训练得到的。该方法包括:
步骤401至403:分别与步骤301至303相同,在此不再详细说明。
步骤404:转发设备获取当前周期内的属于目标网络对象的N个业务流的KPI,N为大于0的整数,目标网络对象是在当前周期内的任一个业务流属于的网络对象。
该N个业务流的KPI的采集时间均位于当前周期内。
步骤405-406:分别与步骤305-306相同,在此不再详细说明。
步骤407:第一设备接收该目标网络对象的特征集合,生成检测样本,该检测样本包括该特征集合。
步骤408:与步骤208相同,在此不再详细说明。
可选的,转发设备可以图2所示的数据通信网络中的BRAS,或者,为图3所示的数据中心网络中的Spine等设备。第一设备为云平台、分析器平台或其他第三方设备等。
在本申请实施例中,由于转发设备在接收业务流时,获取业务流的业务信息包括网络对象的标识信息和M个KPI,基于该业务流的业务信息获取特征集合,这样使得该特征集合的数据量远小于该业务流的数据量,从而转发设备向第一设备发送该特征集合时,会大量减少对网络资源的消耗,由其是大量减小对带宽资源的消耗。
参见图9、本申请实施例提供了一种故障检测模型的训练装置500,所述装置500可以部署在上述任一实施例的转发设备上,包括:
接收单元501,用于接收至少一个业务流;
处理单元502,用于获取所述至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备;
发送单元503,用于向第一设备发送训练信息,所述训练信息包括所述至少一个业务流的业务信息或基于所述至少一个业务流的业务信息获取的特征集合,所述训练信息用于训练故障检测模型,所述故障检测模型用于检测所述网络对象是否处于故障状态。
可选的,所述业务流的协议类型为传输控制协议TCP,所述处理单元502,用于:
根据配置策略信息,从所述业务流中获取至少一个目标业务报文,所述配置策略信息包括至少一个预设报文类型;
根据所述至少一个目标业务报文,获取所述业务流的M个KPI。
可选的,所述M个KPI包括所述装置与所述网络对象之间的网络时延,所述网络对象发送的属于所述业务流的数据量和所述网络对象接收的属于所述业务流的数据量中的至少一个;
所述处理单元502,用于:
所述至少一个目标业务报文包括第一目标业务报文和第二目标业务报文,根据接收所述第一目标业务报文的第一时间和接收所述第二目标业务报文的第二时间,获取所述装置与所述网络对象之间的网络时延,所述第一目标业务报文是发送给所述网络对象的报文,所述第二目标业务报文是所述网络对象发送的与所述第一目标业务报文相对应的报文;和/或,
所述至少一个目标业务报文包括第一起始报文和第一结束报文,根据所述第一起始报文的序列号和所述第一结束报文的序列号,获取所述网络对象发送的属于所述业务流的数据量,所述第一起始报文是所述网络对象发送的所述业务流的第一个报文,所述第一结束报文是所述网络对象发送的所述业务流的最后一个报文;和/或,
所述至少一个目标业务报文包括第二起始报文和第二结束报文,根据所述第二起始报文的序列号和所述第二结束报文的序列号,获取所述网络对象接收的属于所述业务流的数据量,所述第二起始报文是所述网络对象接收的所述业务流的第一个报文,所述第二结束报文是所述网络对象接收的所述业务流的最后一个报文。
可选的,所述M个KPI包括状态标识,所述状态标识用于标识所述业务流的状态;
所述处理单元502,用于
所述至少一个目标业务报文包括第一起始报文,在第三时间之后的第一时间长度内,如果接收到第一结束报文,设置所述状态标识标识的状态为成功状态;如果未接收到所述第一结束报文,设置所述状态标识标识的状态为失败状态,所述第三时间为接收所述第一起始报文的时间,所述第一起始报文是所述网络对象发送的所述业务流的第一个报文,所述第一结束报文是所述网络对象发送的所述业务流的最后一个报文。
可选的,所述处理单元502,还用于:
从所述至少一个业务流中获取第一周期内的属于目标网络对象的N个业务流的KPI,所述目标网络对象是所述至少一个业务流中的任一个业务流所属于的网络对象,N为大于0的整数;
基于所述N个业务流的KPI获取特征集合。
可选的,所述特征集合包括至少一个统计特征,所述处理单元502,用于:
获取M个KPI集合,任一个KPI集合包括所述N个业务流中的每个业务流的一个KPI, 所述任一个KPI集合包括的KPI的类型相同;
通过至少一个第一计算方式,对所述任一个KPI集合包括的KPI进行计算,得到所述任一个KPI集合对应的至少一个统计特征,所述至少一个第一计算方式包括如下一种或多种:对所述任一个KPI集合中的KPI进行统计,计算所述任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。
可选的,所述特征集合还包括至少一个时域特征,所述处理单元502,还用于:
通过至少一个第二计算方式,对统计特征集合包括的统计特征进行计算,得到至少一个时域特征;
其中,所述统计特征集合包括K个统计特征,所述K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征,所述K个周期包括所述第一周期和位于所述第一周期之前的K-1个周期,所述至少一个第二计算方式包括如下一种或多种:计算所述统计特征集合中的相邻两个统计特征之间的环比值或差分值,对所述统计特征集合中的统计特征进行特征拟合。
可选的,所述任一个KPI集合包括的所述N个业务流的状态标识,所述N个业务流中的任一个业务流的状态标识用于标识所述任一个业务流的状态;所述任一个KPI集合的统计特征包括用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目;所述特征集合还包括成功状态的业务流比例和/或失败状态的业务流比例;
所述处理单元502,还用于:
根据所述用于标识成功状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算成功状态的业务流比例;和/或,
根据所述用于标识失败状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算失败状态的业务流比例。
可选的,所述第一设备为云平台、分析器平台或所述装置的上游设备。
可选的,所述网络对象是终端、服务器、客户端、虚拟机、路由器、交换机虚拟局域网VLAN中的设备或指定网段中的设备。
可选的,所述M个KPI用于描述所述业务流的特征。
在本申请实施例中,接收单元接收至少一个业务流。处理单元获取所述至少一个业务流的业务信息,业务流的业务信息包括业务流属于的网络对象的标识信息和该业务流的M个关键性能指标KPI。发送单元向第一设备发送训练信息。由于处理单元获取的训练信息包括网络对象的标识信息和M个KPI,或基于网络对象的M个KPI获取的特征集合,所以训练信息的数据量远小于业务流,发送单元向第一设备发送训练信息所需要的网络资源远小于发送业务流所需要的网络资源,从而可以减少对网络资源的消耗。
参见图10、本申请实施例提供了一种故障检测模型的训练装置600,所述装置600部署在上述任一实施例所述的第一设备上,包括:
接收单元601,用于接收第一转发设备发送的至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备;
处理单元602,用于根据所述至少一个业务流的业务信息训练故障检测模型,或者,根 据所述至少一个业务流的业务信息获取用于训练故障检测模型的至少一个特征集合,所述故障检测模型用于检测所述网络对象是否处于故障状态。
可选的,所述处理单元602,用于:
获取至少一个特征集合,任一个特征集合包括基于属于目标网络对象的每个业务流的KPI获取的至少一个特征,所述目标网络对象是所述至少一个业务流中的任一个业务流所属于的网络对象;
根据所述至少一个特征集合训练故障检测模型。
可选的,所述任一个特征集合包括至少一个统计特征,所述处理单元602,用于:
获取第一周期内的属于所述目标网络对象的N个业务流的KPI,所述第一周期位于所述第一时间段内,N为大于0的整数;
获取M个KPI集合,任一个KPI集合包括所述N个业务流中的每个业务流的一个KPI,所述任一个KPI集合包括的KPI的类型相同;
通过至少一个第一计算方式,对所述任一个KPI集合包括的KPI进行计算,得到所述任一个KPI集合对应的至少一个统计特征,所述至少一个第一计算方式包括如下一种或多种:对所述任一个KPI集合中的KPI进行统计,计算所述任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。
可选的,所述任一个特征集合还包括至少一个时域特征,所述处理单元602,还用于:
通过至少一个第二计算方式,对统计特征集合包括的统计特征进行计算,得到至少一个时域特征;
其中,所述统计特征集合包括K个统计特征,所述K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征,所述K个周期包括所述第一周期和位于所述第一周期之前的K-1个周期,所述至少一个第二计算方式包括如下一种或多种:计算所述统计特征集合中的相邻两个统计特征之间的环比值或差分值,对所述统计特征集合中的统计特征进行特征拟合。
可选的,所述任一个KPI集合包括所述N个业务流的状态标识,所述N个业务流中的任一个业务流的状态标识用于标识所述任一个业务流的状态;所述任一个KPI集合的统计特征包括用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目;所述任一个特征集合还包括成功状态的业务流比例和/或失败状态的业务流比例;
所述处理单元602,还用于:
根据所述用于标识成功状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算成功状态的业务流比例;和/或,
根据所述用于标识失败状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算失败状态的业务流比例。
可选的,所述处理单元602,还用于:
生成训练样本,所述训练样本包括所述任一个特征集合和所述训练样本的标签,在所述目标网络对象处于故障状态的情况下,所述标签用于标识所述故障状态,在所述目标网络对象处于正常状态的情况下,所述标签用于标识所述正常状态。
可选的,所述装置600还包括:发送单元603,
所述发送单元603,用于向训练设备发送所述至少一个特征集合,所述至少一个特征集 合用于所述训练设备训练故障检测模型;
所述接收单元601,用于接收所述训练设备发送的所述故障检测模型。
在本申请实施例中,接收单元接收第一转发设备发送的至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI。处理单元根据所述至少一个业务流的业务信息训练故障检测模型,或者,获取用于训练故障检测模型的至少一个特征集合。由于转发设备发送的业务信息包括网络对象的标识信息和KPI,使得业务信息的数据量远小于业务流的数据量,从而减少接收单元接收业务信息所消耗的网络资源。
参见图11,本申请实施例提供了一种故障检测模型的训练装置700示意图。该装置700可以是上述任一实施例中的转发设备。该装置700包括至少一个处理器701,总线系统702,存储器703以及至少一个收发器704。
该装置700是一种硬件结构的装置,可以用于实现图9所述的装置500中的功能模块。例如,本领域技术人员可以想到图9所示的装置500中的处理单元502可以通过该至少一个处理器701调用存储器703中的代码来实现,图9所示的装置500中的接收单元501和发送单元503可以通过该收发器704来实现。
可选的,上述处理器701可以是一个通用中央处理器(central processing unit,CPU),网络处理器(network processor,NP),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。
上述总线系统702可包括一通路,在上述组件之间传送信息。
上述收发器704,用于与其他设备或通信网络通信。
上述存储器703可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器703用于存储执行本申请方案的应用程序代码,并由处理器701来控制执行。处理器701用于执行存储器703中存储的应用程序代码,从而实现本专利方法中的功能。
在具体实现中,作为一种实施例,处理器701可以包括一个或多个CPU,例如图11中的CPU0和CPU1。
在具体实现中,作为一种实施例,该装置700可以包括多个处理器,例如图11中的处理器701和处理器707。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
参见图12,本申请实施例提供了一种故障检测模型的训练装置800示意图。该装置800 可以是上述任一实施例中的转发设备。该装置800包括至少一个处理器801,总线系统802,存储器803以及至少一个收发器804。
该装置800是一种硬件结构的装置,可以用于实现图10所述的装置600中的功能模块。例如,本领域技术人员可以想到图10所示的装置600中的处理单元602可以通过该至少一个处理器801调用存储器803中的代码来实现,图10所示的装置600中的接收单元601和发送单元603可以通过该收发器804来实现。
可选的,上述处理器801可以是一个通用中央处理器(central processing unit,CPU),网络处理器(network processor,NP),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。
上述总线系统802可包括一通路,在上述组件之间传送信息。
上述收发器804,用于与其他设备或通信网络通信。
上述存储器803可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器803用于存储执行本申请方案的应用程序代码,并由处理器801来控制执行。处理器801用于执行存储器803中存储的应用程序代码,从而实现本专利方法中的功能。
在具体实现中,作为一种实施例,处理器801可以包括一个或多个CPU,例如图12中的CPU0和CPU1。
在具体实现中,作为一种实施例,该装置800可以包括多个处理器,例如图12中的处理器801和处理器807。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
参见图13,本申请实施例提供了一种故障检测模型的训练系统900,所述系统900包括:如图9所述实施例的装置和如图10所述实施例的装置,或者,如图11所述实施例的装置和如图12所述实施例的装置。
如图9或图11所述实施例的装置可以为转发设备901,如图10或图12所述实施例的装置可以为第一设备902。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (37)

  1. 一种故障检测模型的训练方法,其特征在于,所述方法包括:
    转发设备接收至少一个业务流;
    所述转发设备获取所述至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备;
    所述转发设备向第一设备发送训练信息,所述训练信息包括所述至少一个业务流的业务信息或基于所述至少一个业务流的业务信息获取的特征集合,所述训练信息用于训练故障检测模型,所述故障检测模型用于检测所述网络对象是否处于故障状态。
  2. 如权利要求1所述的方法,其特征在于,所述业务流的协议类型为传输控制协议TCP,所述转发设备获取所述业务流的至少一个关键性能指标KPI,包括:
    所述转发设备根据配置策略信息,从所述业务流中获取至少一个目标业务报文,所述配置策略信息包括至少一个预设报文类型;
    所述转发设备根据所述至少一个目标业务报文,获取所述业务流的M个KPI。
  3. 如权利要求2所述的方法,其特征在于,所述M个KPI包括所述转发设备与所述网络对象之间的网络时延,所述网络对象发送的属于所述业务流的数据量和所述网络对象接收的属于所述业务流的数据量中的至少一个;
    所述转发设备根据所述至少一个目标业务报文,获取所述业务流的M个KPI,包括:
    所述至少一个目标业务报文包括第一目标业务报文和第二目标业务报文,所述转发设备根据接收所述第一目标业务报文的第一时间和接收所述第二目标业务报文的第二时间,获取所述转发设备与所述网络对象之间的网络时延,所述第一目标业务报文是发送给所述网络对象的报文,所述第二目标业务报文是所述网络对象发送的与所述第一目标业务报文相对应的报文;和/或,
    所述至少一个目标业务报文包括第一起始报文和第一结束报文,所述转发设备根据所述第一起始报文的序列号和所述第一结束报文的序列号,获取所述网络对象发送的属于所述业务流的数据量,所述第一起始报文是所述网络对象发送的所述业务流的第一个报文,所述第一结束报文是所述网络对象发送的所述业务流的最后一个报文;和/或,
    所述至少一个目标业务报文包括第二起始报文和第二结束报文,所述转发设备根据所述第二起始报文的序列号和所述第二结束报文的序列号,获取所述网络对象接收的属于所述业务流的数据量,所述第二起始报文是所述网络对象接收的所述业务流的第一个报文,所述第二结束报文是所述网络对象接收的所述业务流的最后一个报文。
  4. 如权利要求2或3所述的方法,其特征在于,所述M个KPI包括状态标识,所述状态标识用于标识所述业务流的状态;
    所述转发设备根据所述至少一个目标业务报文,获取所述业务流的至少一个KPI,包括:
    所述至少一个目标业务报文包括第一起始报文,所述转发设备在第三时间之后的第一时间长度内,如果接收到第一结束报文,设置所述状态标识标识的状态为成功状态;如果未接收到所述第一结束报文,设置所述状态标识标识的状态为失败状态,所述第三时间为接收所述第一起始报文的时间,所述第一起始报文是所述网络对象发送的所述业务流的第一个报文,所述第一结束报文是所述网络对象发送的所述业务流的最后一个报文。
  5. 如权利要求1至4任一项所述的方法,其特征在于,所述转发设备向第一设备发送训练信息之前,还包括:
    所述转发设备从所述至少一个业务流中获取第一周期内的属于目标网络对象的N个业务流的KPI,所述目标网络对象是所述至少一个业务流中的任一个业务流所属于的网络对象,N为大于0的整数;
    所述转发设备基于所述N个业务流的KPI获取特征集合。
  6. 如权利要求5所述的方法,其特征在于,所述特征集合包括至少一个统计特征,所述转发设备基于所述N个业务流的KPI获取特征集合,包括:
    所述转发设备获取M个KPI集合,任一个KPI集合包括所述N个业务流中的每个业务流的一个KPI,所述任一个KPI集合包括的KPI的类型相同;
    所述转发设备通过至少一个第一计算方式,对所述任一个KPI集合包括的KPI进行计算,得到所述任一个KPI集合对应的至少一个统计特征,所述至少一个第一计算方式包括如下一种或多种:对所述任一个KPI集合中的KPI进行统计,计算所述任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。
  7. 如权利要求6所述的方法,其特征在于,所述特征集合还包括至少一个时域特征,所述对所述任一个KPI集合包括的KPI进行计算之后,还包括:
    所述转发设备通过至少一个第二计算方式,对统计特征集合包括的统计特征进行计算,得到至少一个时域特征;
    其中,所述统计特征集合包括K个统计特征,所述K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征,所述K个周期包括所述第一周期和位于所述第一周期之前的K-1个周期,所述至少一个第二计算方式包括如下一种或多种:计算所述统计特征集合中的相邻两个统计特征之间的环比值或差分值,对所述统计特征集合中的统计特征进行特征拟合。
  8. 如权利要求6或7所述的方法,其特征在于,所述任一个KPI集合包括的所述N个业务流的状态标识,所述N个业务流中的任一个业务流的状态标识用于标识所述任一个业务流的状态;所述任一个KPI集合的统计特征包括用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目;所述特征集合还包括成功状态的业务流比例和/或失败状态的业务流比例;
    所述对所述任一个KPI集合包括的KPI进行计算之后,还包括:
    根据所述用于标识成功状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计 算成功状态的业务流比例;和/或,
    根据所述用于标识失败状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算失败状态的业务流比例。
  9. 如权利要求1至8任一项所述的方法,其特征在于,所述第一设备为云平台、分析器平台或所述转发设备的上游设备。
  10. 如权利要求1至9任一项所述的方法,其特征在于,所述网络对象是终端、服务器、客户端、虚拟机、路由器、交换机、虚拟局域网VLAN中的设备或指定网段中的设备。
  11. 如权利要求1至10任一项所述的方法,其特征在于,所述M个KPI用于描述所述业务流的特征。
  12. 一种故障检测模型的训练方法,其特征在于,所述方法包括:
    第一设备接收第一转发设备发送的至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备;
    所述第一设备根据所述至少一个业务流的业务信息训练故障检测模型,或者,根据所述至少一个业务流的业务信息获取用于训练故障检测模型的至少一个特征集合,所述故障检测模型用于检测所述网络对象是否处于故障状态。
  13. 如权利要求12所述的方法,其特征在于,所述第一设备根据所述至少一个业务流的业务信息训练故障检测模型,包括:
    所述第一设备获取至少一个特征集合,任一个特征集合包括基于属于目标网络对象的每个业务流的KPI获取的至少一个特征,所述目标网络对象是所述至少一个业务流中的任一个业务流所属于的网络对象;
    所述第一设备根据所述至少一个特征集合训练故障检测模型。
  14. 如权利要求12或13所述的方法,其特征在于,任一个特征集合包括至少一个统计特征,所述第一设备获取任一个特征集合,包括:
    所述第一设备获取第一周期内的属于所述目标网络对象的N个业务流的KPI,所述第一周期位于所述第一时间段内,N为大于0的整数;
    所述第一设备获取M个KPI集合,任一个KPI集合包括所述N个业务流中的每个业务流的一个KPI,所述任一个KPI集合包括的KPI的类型相同;
    所述第一设备通过至少一个第一计算方式,对所述任一个KPI集合包括的KPI进行计算,得到所述任一个KPI集合对应的至少一个统计特征,所述至少一个第一计算方式包括如下一种或多种:对所述任一个KPI集合中的KPI进行统计,计算所述任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。
  15. 如权利要求14所述的方法,其特征在于,所述任一个特征集合还包括至少一个时域特征,所述对所述任一个KPI集合包括的KPI进行计算之后,还包括:
    通过至少一个第二计算方式,对统计特征集合包括的统计特征进行计算,得到至少一个时域特征;
    其中,所述统计特征集合包括K个统计特征,所述K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征,所述K个周期包括所述第一周期和位于所述第一周期之前的K-1个周期,所述至少一个第二计算方式包括如下一种或多种:计算所述统计特征集合中的相邻两个统计特征之间的环比值或差分值,对所述统计特征集合中的统计特征进行特征拟合。
  16. 如权利要求14或15所述的方法,其特征在于,所述任一个KPI集合包括所述N个业务流的状态标识,所述N个业务流中的任一个业务流的状态标识用于标识所述任一个业务流的状态;所述任一个KPI集合的统计特征包括用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目;所述任一个特征集合还包括成功状态的业务流比例和/或失败状态的业务流比例;
    所述对所述任一个KPI集合包括的KPI进行计算之后,还包括:
    根据所述用于标识成功状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算成功状态的业务流比例;和/或,
    根据所述用于标识失败状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算失败状态的业务流比例。
  17. 如权利要求14至16任一项所述的方法,其特征在于,所述第一设备获取所述任一个特征集合之后,还包括:
    生成训练样本,所述训练样本包括所述任一个特征集合和所述训练样本的标签,在所述目标网络对象处于故障状态的情况下,所述标签用于标识所述故障状态,在所述目标网络对象处于正常状态的情况下,所述标签用于标识所述正常状态。
  18. 如权利要求13所述的方法,其特征在于,所述第一设备根据所述至少一个特征集合训练故障检测模型,包括:
    所述第一设备向训练设备发送所述至少一个特征集合,所述至少一个特征集合用于所述训练设备训练故障检测模型;
    所述第一设备接收所述训练设备发送的所述故障检测模型。
  19. 一种故障检测模型的训练装置,其特征在于,所述装置包括:
    接收单元,用于接收至少一个业务流;
    处理单元,用于获取所述至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备;
    发送单元,用于向第一设备发送训练信息,所述训练信息包括所述至少一个业务流的业 务信息或基于所述至少一个业务流的业务信息获取的特征集合,所述训练信息用于训练故障检测模型,所述故障检测模型用于检测所述网络对象是否处于故障状态。
  20. 如权利要求19所述的装置,其特征在于,所述业务流的协议类型为传输控制协议TCP,所述处理单元,用于:
    根据配置策略信息,从所述业务流中获取至少一个目标业务报文,所述配置策略信息包括至少一个预设报文类型;
    根据所述至少一个目标业务报文,获取所述业务流的M个KPI。
  21. 如权利要求20所述的装置,其特征在于,所述M个KPI包括所述装置与所述网络对象之间的网络时延,所述网络对象发送的属于所述业务流的数据量和所述网络对象接收的属于所述业务流的数据量中的至少一个;
    所述处理单元,用于:
    所述至少一个目标业务报文包括第一目标业务报文和第二目标业务报文,根据接收所述第一目标业务报文的第一时间和接收所述第二目标业务报文的第二时间,获取所述装置与所述网络对象之间的网络时延,所述第一目标业务报文是发送给所述网络对象的报文,所述第二目标业务报文是所述网络对象发送的与所述第一目标业务报文相对应的报文;和/或,
    所述至少一个目标业务报文包括第一起始报文和第一结束报文,根据所述第一起始报文的序列号和所述第一结束报文的序列号,获取所述网络对象发送的属于所述业务流的数据量,所述第一起始报文是所述网络对象发送的所述业务流的第一个报文,所述第一结束报文是所述网络对象发送的所述业务流的最后一个报文;和/或,
    所述至少一个目标业务报文包括第二起始报文和第二结束报文,根据所述第二起始报文的序列号和所述第二结束报文的序列号,获取所述网络对象接收的属于所述业务流的数据量,所述第二起始报文是所述网络对象接收的所述业务流的第一个报文,所述第二结束报文是所述网络对象接收的所述业务流的最后一个报文。
  22. 如权利要求20或21所述的装置,其特征在于,所述M个KPI包括状态标识,所述状态标识用于标识所述业务流的状态;
    所述处理单元,用于:
    所述至少一个目标业务报文包括第一起始报文,在第三时间之后的第一时间长度内,如果接收到第一结束报文,设置所述状态标识标识的状态为成功状态;如果未接收到所述第一结束报文,设置所述状态标识标识的状态为失败状态,所述第三时间为接收所述第一起始报文的时间,所述第一起始报文是所述网络对象发送的所述业务流的第一个报文,所述第一结束报文是所述网络对象发送的所述业务流的最后一个报文。
  23. 如权利要求19至22任一项所述的装置,其特征在于,所述处理单元,还用于:
    从所述至少一个业务流中获取第一周期内的属于目标网络对象的N个业务流的KPI,所述目标网络对象是所述至少一个业务流中的任一个业务流所属于的网络对象,N为大于0的整数;
    基于所述N个业务流的KPI获取特征集合。
  24. 如权利要求23所述的装置,其特征在于,所述特征集合包括至少一个统计特征,所述处理单元,用于:
    获取M个KPI集合,任一个KPI集合包括所述N个业务流中的每个业务流的一个KPI,所述任一个KPI集合包括的KPI的类型相同;
    通过至少一个第一计算方式,对所述任一个KPI集合包括的KPI进行计算,得到所述任一个KPI集合对应的至少一个统计特征,所述至少一个第一计算方式包括如下一种或多种:对所述任一个KPI集合中的KPI进行统计,计算所述任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。
  25. 如权利要求24所述的装置,其特征在于,所述特征集合还包括至少一个时域特征,所述处理单元,还用于:
    通过至少一个第二计算方式,对统计特征集合包括的统计特征进行计算,得到至少一个时域特征;
    其中,所述统计特征集合包括K个统计特征,所述K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征,所述K个周期包括所述第一周期和位于所述第一周期之前的K-1个周期,所述至少一个第二计算方式包括如下一种或多种:计算所述统计特征集合中的相邻两个统计特征之间的环比值或差分值,对所述统计特征集合中的统计特征进行特征拟合。
  26. 如权利要求24或25所述的装置,其特征在于,所述任一个KPI集合包括的所述N个业务流的状态标识,所述N个业务流中的任一个业务流的状态标识用于标识所述任一个业务流的状态;所述任一个KPI集合的统计特征包括用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目;所述特征集合还包括成功状态的业务流比例和/或失败状态的业务流比例;
    所述处理单元,还用于:
    根据所述用于标识成功状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算成功状态的业务流比例;和/或,
    根据所述用于标识失败状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算失败状态的业务流比例。
  27. 如权利要求19至26任一项所述的装置,其特征在于,所述第一设备为云平台、分析器平台或所述装置的上游设备。
  28. 如权利要求19至27任一项所述的装置,其特征在于,所述网络对象是终端、服务器、客户端、虚拟机、路由器、交换机、虚拟局域网VLAN中的设备或指定网段中的设备。
  29. 如权利要求19至28任一项所述的装置,其特征在于,所述M个KPI用于描述所述 业务流的特征。
  30. 一种故障检测模型的训练装置,其特征在于,所述装置包括:
    接收单元,用于接收第一转发设备发送的至少一个业务流的业务信息,业务流的业务信息包括所述业务流属于的网络对象的标识信息和所述业务流的M个关键性能指标KPI,M为大于0的整数,所述网络对象包括一个或多个设备;
    处理单元,用于根据所述至少一个业务流的业务信息训练故障检测模型,或者,根据所述至少一个业务流的业务信息获取用于训练故障检测模型的至少一个特征集合,所述故障检测模型用于检测所述网络对象是否处于故障状态。
  31. 如权利要求30所述的装置,其特征在于,所述处理单元,用于:
    获取至少一个特征集合,任一个特征集合包括基于属于目标网络对象的每个业务流的KPI获取的至少一个特征,所述目标网络对象是所述至少一个业务流中的任一个业务流所属于的网络对象;
    根据所述至少一个特征集合训练故障检测模型。
  32. 如权利要求30或31所述的装置,其特征在于,所述任一个特征集合包括至少一个统计特征,所述处理单元,用于:
    获取第一周期内的属于所述目标网络对象的N个业务流的KPI,所述第一周期位于所述第一时间段内,N为大于0的整数;
    获取M个KPI集合,任一个KPI集合包括所述N个业务流中的每个业务流的一个KPI,所述任一个KPI集合包括的KPI的类型相同;
    通过至少一个第一计算方式,对所述任一个KPI集合包括的KPI进行计算,得到所述任一个KPI集合对应的至少一个统计特征,所述至少一个第一计算方式包括如下一种或多种:对所述任一个KPI集合中的KPI进行统计,计算所述任一个KPI集合包括的KPI的均值、方差、离散度、偏度或峰度。
  33. 如权利要求32所述的装置,其特征在于,所述任一个特征集合还包括至少一个时域特征,所述处理单元,还用于:
    通过至少一个第二计算方式,对统计特征集合包括的统计特征进行计算,得到至少一个时域特征;
    其中,所述统计特征集合包括K个统计特征,所述K个统计特征分别是在K个周期内计算得到的属于同一类型的统计特征,所述K个周期包括所述第一周期和位于所述第一周期之前的K-1个周期,所述至少一个第二计算方式包括如下一种或多种:计算所述统计特征集合中的相邻两个统计特征之间的环比值或差分值,对所述统计特征集合中的统计特征进行特征拟合。
  34. 如权利要求32或33所述的装置,其特征在于,所述任一个KPI集合包括所述N个业务流的状态标识,所述N个业务流中的任一个业务流的状态标识用于标识所述任一个业务 流的状态;所述任一个KPI集合的统计特征包括用于标识成功状态的状态标识数目和用于标识失败状态的状态标识数目;所述任一个特征集合还包括成功状态的业务流比例和/或失败状态的业务流比例;
    所述处理单元,还用于:
    根据所述用于标识成功状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算成功状态的业务流比例;和/或,
    根据所述用于标识失败状态的状态标识数目和所述任一个KPI集合包括的KPI数目,计算失败状态的业务流比例。
  35. 如权利要求32至34任一项所述的装置,其特征在于,所述处理单元,还用于:
    生成训练样本,所述训练样本包括所述任一个特征集合和所述训练样本的标签,在所述目标网络对象处于故障状态的情况下,所述标签用于标识所述故障状态,在所述目标网络对象处于正常状态的情况下,所述标签用于标识所述正常状态。
  36. 如权利要求31所述的装置,其特征在于,所述装置还包括:发送单元,
    所述发送单元,用于向训练设备发送所述至少一个特征集合,所述至少一个特征集合用于所述训练设备训练故障检测模型;
    所述接收单元,用于接收所述训练设备发送的所述故障检测模型。
  37. 一种故障检测模型的训练系统,其特征在于,所述系统包括:如权利要求19至29任一项所述的装置和如权利要求30至36任一项所述的装置。
PCT/CN2020/119031 2020-01-24 2020-09-29 故障检测模型的训练方法、装置及系统 WO2021147370A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20915373.3A EP4084410A4 (en) 2020-01-24 2020-09-29 Method, apparatus and system for training fault detection model
US17/871,498 US20220368606A1 (en) 2020-01-24 2022-07-22 Fault Detection Model Training Method, Apparatus, and System

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010077206.XA CN113179172B (zh) 2020-01-24 2020-01-24 故障检测模型的训练方法、装置及系统
CN202010077206.X 2020-01-24

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/871,498 Continuation US20220368606A1 (en) 2020-01-24 2022-07-22 Fault Detection Model Training Method, Apparatus, and System

Publications (1)

Publication Number Publication Date
WO2021147370A1 true WO2021147370A1 (zh) 2021-07-29

Family

ID=76921406

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119031 WO2021147370A1 (zh) 2020-01-24 2020-09-29 故障检测模型的训练方法、装置及系统

Country Status (4)

Country Link
US (1) US20220368606A1 (zh)
EP (1) EP4084410A4 (zh)
CN (1) CN113179172B (zh)
WO (1) WO2021147370A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114884883A (zh) * 2022-06-16 2022-08-09 深圳星云智联科技有限公司 一种流量转发方法、装置、设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116418704A (zh) * 2021-12-31 2023-07-11 中兴通讯股份有限公司 业务质量的检测方法、装置、服务器和存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160088502A1 (en) * 2013-05-14 2016-03-24 Nokia Solutions And Networks Oy Method and network device for cell anomaly detection
CN105554782A (zh) * 2015-12-09 2016-05-04 中国联合网络通信集团有限公司 用户感知指标的预测方法和装置
CN107623924A (zh) * 2016-07-15 2018-01-23 中兴通讯股份有限公司 一种验证影响关键质量指标kqi相关的关键性能指标kpi的方法和装置
CN108737193A (zh) * 2018-06-05 2018-11-02 亚信科技(中国)有限公司 一种故障预测方法及装置
US20190044830A1 (en) * 2016-02-12 2019-02-07 Telefonaktiebolaget Lm Ericsson (Publ) Calculating Service Performance Indicators
CN109547251A (zh) * 2018-11-27 2019-03-29 广东电网有限责任公司 一种基于监控数据的业务系统故障与性能预测方法
CN110502398A (zh) * 2019-08-21 2019-11-26 吉林吉大通信设计院股份有限公司 一种基于人工智能的交换机故障预测系统及方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201322571D0 (en) * 2013-12-19 2014-02-05 Bae Systems Plc Network fault detection and location
EP3105886A4 (en) * 2014-02-10 2017-02-22 Telefonaktiebolaget LM Ericsson (publ) Management system and network element for handling performance monitoring in a wireless communications system
CN103973496B (zh) * 2014-05-21 2017-10-17 华为技术有限公司 故障诊断方法及装置
US20170373950A1 (en) * 2015-01-27 2017-12-28 Nokia Solutions And Networks Oy Traffic flow monitoring
US20170215094A1 (en) * 2016-01-22 2017-07-27 Hitachi, Ltd. Method for analyzing and inferring wireless network performance
EP3473034B1 (en) * 2016-06-16 2021-03-31 Telefonaktiebolaget LM Ericsson (publ) Method for volte voice quality fault localization
CN108462591B (zh) * 2017-02-20 2020-04-14 华为技术有限公司 一种分组网络中处理业务流的方法及装置
US11018958B2 (en) * 2017-03-14 2021-05-25 Tupl Inc Communication network quality of experience extrapolation and diagnosis
CN109063886B (zh) * 2018-06-12 2022-05-31 创新先进技术有限公司 一种异常检测方法、装置以及设备
CN109446049A (zh) * 2018-11-01 2019-03-08 郑州云海信息技术有限公司 一种基于监督学习的服务器错误诊断方法和装置
CN109617715A (zh) * 2018-11-27 2019-04-12 中盈优创资讯科技有限公司 网络故障诊断方法、系统
CN110650052B (zh) * 2019-09-26 2022-08-12 科大国创软件股份有限公司 一种基于智能算法的客户原因故障识别处理方法及系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160088502A1 (en) * 2013-05-14 2016-03-24 Nokia Solutions And Networks Oy Method and network device for cell anomaly detection
CN105554782A (zh) * 2015-12-09 2016-05-04 中国联合网络通信集团有限公司 用户感知指标的预测方法和装置
US20190044830A1 (en) * 2016-02-12 2019-02-07 Telefonaktiebolaget Lm Ericsson (Publ) Calculating Service Performance Indicators
CN107623924A (zh) * 2016-07-15 2018-01-23 中兴通讯股份有限公司 一种验证影响关键质量指标kqi相关的关键性能指标kpi的方法和装置
CN108737193A (zh) * 2018-06-05 2018-11-02 亚信科技(中国)有限公司 一种故障预测方法及装置
CN109547251A (zh) * 2018-11-27 2019-03-29 广东电网有限责任公司 一种基于监控数据的业务系统故障与性能预测方法
CN110502398A (zh) * 2019-08-21 2019-11-26 吉林吉大通信设计院股份有限公司 一种基于人工智能的交换机故障预测系统及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4084410A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114884883A (zh) * 2022-06-16 2022-08-09 深圳星云智联科技有限公司 一种流量转发方法、装置、设备及存储介质
CN114884883B (zh) * 2022-06-16 2024-01-30 深圳星云智联科技有限公司 一种流量转发方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN113179172B (zh) 2022-12-30
US20220368606A1 (en) 2022-11-17
CN113179172A (zh) 2021-07-27
EP4084410A4 (en) 2023-06-28
EP4084410A1 (en) 2022-11-02

Similar Documents

Publication Publication Date Title
CN111193666B (zh) 使用自适应机器学习探测预测应用体验质量度量
Ghasemi et al. Dapper: Data plane performance diagnosis of tcp
US20220368606A1 (en) Fault Detection Model Training Method, Apparatus, and System
CN108076019B (zh) 基于流量镜像的异常流量检测方法及装置
US8229705B1 (en) Performance monitoring in computer networks
CN112787951B (zh) 拥塞控制方法、装置、设备和计算机可读存储介质
EP3334117B1 (en) Method, apparatus and system for quantizing defence result
US11102273B2 (en) Uplink performance management
Chen et al. SDATP: An SDN-based traffic-adaptive and service-oriented transmission protocol
Sundaresan et al. TCP congestion signatures
Attar et al. E-health communication system with multiservice data traffic evaluation based on a G/G/1 analysis method
WO2021147371A1 (zh) 故障检测方法、装置及系统
Hagos et al. A deep learning approach to dynamic passive RTT prediction model for TCP
Alhamed et al. P4 postcard telemetry collector in packet-optical networks
CN115914115A (zh) 网络拥塞控制方法、装置及通信系统
Qiao et al. Fine-Grained Active Queue Management in the Data Plane with P4
WO2017206499A1 (zh) 网络攻击检测方法以及攻击检测装置
Younes Modelling and analysis of TCP congestion control mechanisms using stochastic reward nets
Silva et al. Enhancing traffic sampling scope and efficiency
JP4282556B2 (ja) フローレベル通信品質管理装置と方法およびプログラム
Zhu et al. Introducing Additional Network Measurements into Active Queue Management
CN112825504B (zh) 一种数据监测方法、装置、设备及存储介质
US20230344741A1 (en) Systems and methods for activating fec processing per application probe class
US20240179082A1 (en) Systems and methods for activating fec processing per application probe class
JP2010124127A (ja) ネットワーク診断装置、ネットワーク診断方法およびプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20915373

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020915373

Country of ref document: EP

Effective date: 20220726

NENP Non-entry into the national phase

Ref country code: DE