CN115511106B

CN115511106B - Method, device and readable storage medium for generating training data based on time sequence data

Info

Publication number: CN115511106B
Application number: CN202211426405.2A
Authority: CN
Inventors: 芮藤长; 史洋洋; 潘涌; 杨帅; 钮骏凯; 肖雄; 韩泽鋆; 吕彪; 祝顺民
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-04-07
Anticipated expiration: 2042-11-15
Also published as: CN115511106A

Abstract

The application discloses a method, equipment and a readable storage medium for generating training data based on time sequence data, wherein the method comprises the following steps: acquiring time sequence data; taking other time sequence data segments except the normal time sequence data segment in the time sequence data as time sequence data segments to be marked; classifying the time sequence data segments to be labeled to obtain a plurality of types of time sequence data segments, and labeling the time sequence data segments of each type in the plurality of types, wherein the labels are used for adding labels to the time sequence data segments of the type; and using the marked time sequence data segments of multiple types as training data. Through the method and the device, the problems of low efficiency and high cost caused by manually labeling the time sequence data corresponding to the network indexes are solved, and then the data can be preprocessed, the processed data are labeled, the time sequence data labeling efficiency is improved to a certain extent compared with a pure manual labeling mode, and the labeling cost is reduced.

Description

Method, apparatus and readable storage medium for generating training data based on time series data

Technical Field

The present application relates to the field of machine learning, and more particularly, to a method, apparatus, and readable storage medium for generating training data based on time series data.

Background

Currently, the scale of a cloud network becomes larger and larger, in order to provide services for more users, a large number of devices and services are provided in the cloud network, and the devices and services need to be monitored so as to find abnormality in time and reduce the influence of the abnormality on the cloud network users.

Network indexes to be monitored in the cloud network are numerous, and the network indexes always generate time series data. The time series data is also called time series data, and is data acquired and recorded according to a time sequence for a network index, and whether the network index is abnormal or not can be judged through the time series data.

When the time series data is used for judging whether the network indexes are abnormal or not, the judgment can be carried out manually by maintenance personnel, but the judgment mode has low efficiency and depends on the experience of the maintenance personnel. In another determination mode, an anomaly detection algorithm may be used for determining whether an anomaly occurs, where multiple abnormal time series data segments are preconfigured in the algorithm, and then the currently acquired abnormal time series data segment is compared with the preconfigured time series data segment to determine whether an anomaly occurs. This way of comparison depends on how many kinds of the preconfigured abnormal time series data pieces are, and also only certain types of faults can be found.

With the development of machine learning technology, a supervised machine learning algorithm (which may be referred to as a supervised learning algorithm or a supervised learning algorithm for short) may be applied to the anomaly monitoring of network indicators, that is, an abnormal time series data segment may be discovered through the supervised machine learning algorithm. Supervised learning is a training mode in machine learning, and refers to a process of adjusting parameters of a classifier by using a set of training data of known classes to achieve required performance, and is a machine learning task for deducing a function from labeled training data. In supervised learning, an optimal model is trained according to an existing training data set (each set of training data in the training data set comprises input and output results) and according to the relation between the known input and output results. The training data used in supervised learning are all clearly labeled. For example, the training data may include a piece of time series data and a label used to label whether the piece of time series data is abnormal or an abnormal type. The machine learning model finds the relation between the time sequence data section and the label through training, the collected time sequence data section can be input into the trained machine learning model, the machine learning model can output the label corresponding to the time sequence data section, and whether the time sequence data section is abnormal or the corresponding abnormal type and the like can be determined through the label.

In consideration of the development importance of intellectualization to the cloud network, more and more supervised learning algorithms are required to be used for monitoring the abnormity of network indexes in the cloud network at present. The training of the supervised learning algorithm depends on the marked data, and meanwhile, the marked data can also evaluate the effect of the algorithm, so that the updating iteration of the algorithm is realized. Therefore, labeling of time series data segments is crucial for supervised learning algorithms.

In the prior art, time sequence data segments are generally marked manually, and in order to monitor the accuracy and efficiency of marking of the time sequence data segments, two manual marking modes, namely crowdsourcing marking and expert marking, are adopted in the prior art. These two labeling methods are explained below. Crowdsourcing labeling: the labeling method is generally performed by personnel irrelevant to the field, and although the method is controllable in cost, the labeling quality is not controllable, and effective guarantee is difficult to provide for training and evaluation of the algorithm. And (4) expert annotation: the method employs experts in the field for marking, and although the marking quality is high and can provide a good basis for an algorithm, the method is high in cost and not suitable for being used in the field of large-scale data.

No matter which kind of marking mode above-mentioned adopts, all adopted artificial marking mode, the chronogenesis data that present cloud network produced every day reach the hundred million level, if the data of this magnitude all mark through the manual work, can consume a large amount of costs, consequently, carry out the mode of marking to the chronogenesis data section by the manual work completely and can't adapt to current condition, need have more feasible scheme to carry out the data marking.

Disclosure of Invention

The embodiment of the application provides a method, equipment and a readable storage medium for generating training data based on time sequence data, so as to at least solve the problems of low efficiency and high cost caused by manually labeling the time sequence data corresponding to network indexes.

According to an aspect of the present application, there is provided a time series data annotation processing method, including: acquiring time sequence data, wherein the time sequence data is acquired when monitoring network indexes and is recorded according to a time sequence; taking other time sequence data segments except the normal time sequence data segment in the time sequence data as time sequence data segments to be marked, wherein the normal time sequence data segment is a segment of time sequence data generated when the network index is in a normal state; classifying the time sequence data segments to be labeled to obtain a plurality of types of time sequence data segments, and labeling each type of time sequence data segment in the plurality of types, wherein the label is used for adding a label to the type of time sequence data segment, and the label is used for indicating the abnormal type of the type of time sequence data segment; and using the marked time sequence data segments of multiple types as training data, wherein the training data is used for training a machine learning model, and the machine learning model is used for identifying the abnormal type of the time sequence data segments.

According to another aspect of the present application, there is also provided a time series data-based machine learning system, including: training data generating means for generating training data according to the above-mentioned method; a server for training a machine learning model using the training data from the training data generating device, wherein the machine learning model is used for identifying the abnormal type of the time series data segment.

According to another aspect of the present application, there is also provided an electronic device comprising a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the above-described method steps.

According to another aspect of the present application, there is also provided a readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement the above-described method steps.

In the embodiment of the application, time sequence data is acquired, wherein the time sequence data is acquired when network indexes are monitored, and the time sequence data is recorded according to a time sequence; taking other time sequence data segments except the normal time sequence data segment in the time sequence data as time sequence data segments to be marked, wherein the normal time sequence data segment is a segment of time sequence data generated when the network index is in a normal state; classifying the time sequence data segments to be labeled to obtain a plurality of types of time sequence data segments, and labeling each type of time sequence data segment in the plurality of types, wherein the label is used for adding a label to the type of time sequence data segment, and the label is used for indicating the abnormal type of the type of time sequence data segment; and using the marked time sequence data segments of multiple types as training data, wherein the training data is used for training a machine learning model, and the machine learning model is used for identifying the abnormal type of the time sequence data segments. Through the method and the device, the problems of low efficiency and high cost caused by manually labeling the time sequence data corresponding to the network indexes are solved, and then the data can be preprocessed, the processed data are labeled, the time sequence data labeling efficiency is improved to a certain extent compared with a pure manual labeling mode, and the labeling cost is reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

FIG. 1 is a schematic diagram of labeling timing data for machine learning training according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of generating training data based on time series data according to an embodiment of the present application; and the number of the first and second groups,

FIG. 3 is a schematic diagram of a flow of annotation of an abnormal time-series data segment according to an embodiment of the application.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

There are many different types of devices and services in the network environment, and different indexes are needed for the devices and services to be monitored, and the indexes for monitoring the network devices are all referred to as network indexes in this embodiment. These network devices may be indexes generated in the network transmission process, such as packet loss rate, network bandwidth, and the like, or may also be acquired hardware indexes of the network devices, such as memory usage rate, processing occupancy rate, and the like. For different network indexes, time sequence data corresponding to the network index can be collected and then processed. It should be noted that, because the time series data is acquired according to the time sequence, the time series data is always acquired as time goes on, and when the abnormality determination is performed, the time series data acquired within a period of time can be determined, so as to determine whether a network abnormality occurs within the period of time, and in the following description, the time series data corresponding to a period of time is referred to as a time series data segment. The labeling of the time series data referred to in the following embodiments is a process of labeling a time series data segment as normal or abnormal.

Fig. 1 is a schematic diagram of labeling time series data for machine learning training according to an embodiment of the present application, as shown in fig. 1, a network device is a running network device, a network index to be monitored, such as a bandwidth occupation situation of the network device, may be configured for the network device, after the network index is configured, data corresponding to the index may be collected, and since the data is collected continuously according to time, the obtained data is time series data associated with time. After the data is collected, the data may be stored in a data storage server, and a person may label the data, for example, the time series data segment 1 is labeled as "normal", the time series data segment 2 is labeled as "abnormal type a", the time series data segment 3 is labeled as "abnormal type B", the time series data segment 4 is labeled as "abnormal type C", the time series data segment 5 is labeled as "normal", the time series data segment N is labeled as "abnormal type B", these labels are also referred to as labels of the time series data, and a process of labeling the time series data segment may also be referred to as a marking process. After marking the data, the data can be used as training data, the data is sent to a server for training a machine learning model, and after the training is finished, a machine learning model can be obtained, the model is used for judging whether a time sequence data segment is an abnormal time sequence data segment, the input of the model is the time sequence data segment, the output of the model is whether the time sequence data segment is normal or abnormal, and if the time sequence data segment is abnormal, the type of the abnormality can be output. As can be seen from fig. 1, the quality of the training data greatly affects the quality of the model obtained by the final training, and therefore, a large amount of training data is required for training, but because the amount of the time series data obtained by collecting is very large, the efficiency of manually labeling the data is low, and the cost is high.

In order to solve the above problem, in the present embodiment, a time series data labeling processing method is provided, fig. 2 is a flowchart of a method for generating training data based on time series data according to an embodiment of the present application, and as shown in fig. 2, the steps included in fig. 2 are described below.

Step S202, obtaining time sequence data, wherein the time sequence data is collected when monitoring network indexes, and the time sequence data is recorded according to a time sequence.

In this step, time series data corresponding to the network index is collected for different network indexes, and then the time series data corresponding to the network index is processed in the following steps to obtain the time series data segment after labeling. These labeled time series data segments can be used to train a supervised machine learning model that can identify whether the network metric is abnormal based on the input time series data segments. It should be noted that, if the time series data of the multiple network indexes have the same or similar characteristics, a supervised machine learning model may also be trained for the multiple network indexes; alternatively, a supervised machine learning model may be trained for each different network metric. In the following embodiments, attention is paid to the labeling process of the time series data segments, and after the labeled time series data segments are obtained, the time series data segments can be used in machine learning training according to actual conditions, which is not described in detail herein.

Step S204, using other time sequence data segments except the normal time sequence data segment in the time sequence data as the time sequence data segment to be marked, wherein the normal time sequence data segment is a segment of time sequence data generated when the network index is in a normal state.

Although the amount of the time series data collected by the network indicator is relatively large, in most cases, it is considered that the network device is in a normal operating state, that is, most of the time series data collected by the network indicator is normal time series data, and the identification of the normal time series data segment is relatively easy to achieve, so that after the normal time series data segment is deleted in step S204, the amount of the time series data is reduced, and then the workload for labeling the remaining time series data is also reduced.

Step S206, classifying the time sequence data segments to be labeled to obtain a plurality of types of time sequence data segments, and labeling each type of time sequence data segment in the plurality of types, wherein the label is used for adding a label to the type of time sequence data segment, and the label is used for indicating the abnormal type of the type of time sequence data segment; the data segment marked with the abnormal type is an abnormal data segment, and the abnormal time sequence data segment is a segment of time sequence data generated by the network index in an abnormal state.

The time sequence data to be marked in the step is the time sequence data from which the normal time sequence data section is removed, and the marking workload is reduced. In order to further improve the labeling efficiency, the time series data segment to be labeled can be classified in advance before labeling, and although the data amount of the time series data segment to be labeled cannot be reduced by classifying the time series data segment to be labeled, the time series data after classification can help to improve the labeling efficiency. It should be noted that, because different abnormal situations may occur in the network indexes, the time sequence data segments may also be initially classified after the classification, so that the same labeling is easily performed on the abnormal time sequence data segments of the same kind during the labeling, thereby improving the labeling efficiency. For example, if there are 100 time series data segments in the classification a, the 100 time series data segments can be directly labeled as the exception type a, so that it is no longer necessary to label each time series data segment.

Step S208, using the marked time sequence data segments of multiple types as training data, wherein the training data is used for training a machine learning model, and the machine learning model is used for identifying the abnormal type of the time sequence data segments.

In the steps, the time sequence data corresponding to the network indexes are preprocessed before being marked, the preprocessing process comprises the processes of rejecting normal time sequence data segments and classifying, after the two processes are carried out, on one hand, the number of the time sequence data to be marked can be reduced, on the other hand, the possibility of carrying out unified marking on the time sequence data of the same type is provided through classifying, and the marking efficiency is improved. Therefore, the problems of low efficiency and high cost caused by manually labeling the time sequence data corresponding to the network indexes are solved through the steps, the data can be preprocessed, the processed data are labeled, the time sequence data labeling efficiency is improved to a certain extent compared with a pure manual labeling mode, and the labeling cost is reduced.

The identification of the normal time sequence data segment is involved in the above steps, and the network index can be determined to be the normal time sequence data segment if the network index shows a certain rule according to time, or can be regarded as the normal time sequence data segment if the network index keeps stable all the time. As an alternative embodiment, whether a certain time series data segment is a normal time series data segment may be determined according to whether the time series data has periodicity and/or stationarity. Identifying time sequence data segments with periodicity and/or stationarity from the time sequence data; and taking the time sequence data segment with the periodicity and/or the stationarity as the normal time sequence data segment.

For network indexes, different network indexes correspond to time sequence data with different rules. In practical application, whether the time series data corresponding to one network index has periodicity or stationarity can be judged, and if one network index has periodicity or stationarity, the time series data corresponding to the network index acquired in any time period should have the same periodicity or stationarity. At the moment, the time sequence data corresponding to the network index can be screened according to the periodicity or stationarity embodied in the historical rule of the network index. The method utilizes the property that the characteristic of one network index does not change too much to judge the normal time sequence data section, and the accuracy of the judging method for screening the normal time sequence data section is still higher.

For a network index, the trend law of the network index can be determined according to the periodicity or stationarity of the network index, and then the time sequence data segment conforming to the trend law is used as a normal time sequence data segment. Firstly, whether the time series data corresponding to the network index has periodicity or not can be determined, the periodicity law of the time series data corresponding to the network index is obtained under the condition of periodicity, and then the time series data section with the periodicity law is identified from the time series data corresponding to the network index.

The normal time sequence data section is screened from the time sequence data corresponding to the periodic network index, and the time sequence data section which accords with the periodic rule is used as the normal time sequence data section as long as the periodic rule of the network index is obtained. The periodic law of the network index can be manually extracted, the periodic law is drawn into a periodic curve, then each data point in the collected time sequence data section is arranged according to time and then is connected by a line, a curve can also be obtained, the curve is taken as a curve to be judged, then the curve to be judged is compared with the periodic curve, and under the condition that the similarity of the two curves exceeds a threshold value, the collected time sequence data section is determined to accord with the periodic law, so the time sequence data section corresponding to the curve to be judged is a normal time sequence data section.

In another optional implementation, multiple sets of time series data collected historically by the network index may be screened out, multiple sets of time series data historically may be obtained, and it is determined that the time series data are data without abnormality. For example, the time series data of a network indicator in one day (the one day is a period) is periodic, data of multiple days (for example, 10) may be selected, the time series data of each of the multiple days is sampled at time points with intervals of a predetermined duration, for example, every minute, then an average value of the time series data of each time point in the multiple days is obtained, and a periodic curve of the network indicator in one day is generated according to the average value corresponding to each time point. The periodic curve can then be used to screen out the normal time series data segments. For example, a curve to be compared is generated from the time series data of a certain day, the curve to be compared is compared with the periodic curve, and if the curve to be compared is consistent with the periodic curve, the time series data of the day is a normal time series data segment.

In the above example, whether the time-series data has periodicity is determined according to the time-series data itself, and the time-series data may be calculated to determine whether the time-series data corresponding to the network indicator has periodicity. That is, the periodic trend may be a trend of the time-series data itself, or a trend of a value calculated from the time-series data, for example, a trend of a difference between the time-series data.

In an optional embodiment, determining whether a network indicator has periodicity may further employ a periodic algorithm, and determining whether time-series data corresponding to the network indicator has periodicity executed in the periodic algorithm may include the following steps: dividing the time sequence data corresponding to the network index into a plurality of time sequence data sections; acquiring the distance between every two time sequence data segments in the plurality of time sequence data segments; and under the condition that the distances between every two time sequence data sections are smaller than a threshold value, determining that the time sequence data corresponding to the network index have periodicity. The above steps can be implemented by using a Dynamic Time Warping (DTW), and the minimum distance between two sequence segments can be calculated in the DTW algorithm.

For periodic time series data, it is sufficient to screen according to the periodic rule, but for stationary data, it can be understood as a process that fluctuates up and down around an average value of the time series data. This fluctuation process typically has a limit, which may be an upper limit or a lower limit, and in this embodiment, upper and lower limits are used to identify the upper and/or lower limits of the time-series data with stationarity. Whether a network index has stationarity or not is determined, and the determination may be performed according to historically acquired time series data of the network index, for example, time series data of historically acquired multiple time periods are acquired, and the time series data of the multiple time periods may be a normal time series data segment manually marked. And finding the maximum value and/or the minimum value of the time sequence data of the time periods, and taking the found maximum value and/or minimum value as the upper limit and/or the lower limit. It should be noted that if the difference between the maximum value and/or the minimum value of each of the time periods is within a predetermined range, it indicates that the time series data corresponding to the network indicator is stationary data (i.e., has stationarity), and if the difference between the maximum value and/or the minimum value of the time series data of each of the time periods is large, it indicates that the time series data corresponding to the network indicator may be non-stationary data (i.e., has no stationarity).

For a network index, determining whether time series data corresponding to the network index has stationarity, acquiring upper and lower limits of the time series data corresponding to the network index under the condition of stationarity, and identifying time series data segments within the upper and lower limit ranges from the time series data corresponding to the network index, wherein the time series data segments within the upper and lower limit ranges are used as the time series data segments with stationarity.

In the above embodiment, a way of confirming whether time series data has stationarity is provided, which is simple to implement, but a certain misjudgment rate may occur, and as another way, whether a unit root exists may be selected to determine whether the time series data is stationary, wherein the n-th unit root is a complex number with n-th power of 1. Namely, determining whether the time series data corresponding to the network index has stationarity may include the following steps: obtaining a polynomial corresponding to the time sequence data corresponding to the network index, wherein the polynomial is used for representing the time sequence data corresponding to the network index in time; and determining whether the time sequence data corresponding to the network index has stationarity or not according to whether the polynomial has a unit root or not.

In practical applications, various algorithms can be used to check the unit root, for example, an ADF check algorithm, a DF-GLS check algorithm, or a KPSS check algorithm can be used. Among them, ADF is called as Augmented bucket-filler Testing, which is one of the most commonly used unit root test methods, and whether the sequence is stable or not is judged by Testing whether the sequence has a unit root or not. The DF-GLS inspection algorithm is a unit root inspection method, is called as Dickey-Fuller Test with GLS debugging, namely 'inspection for removing trend by using generalized least square method', and is the most effective unit root inspection at present. KPSS is a statistical test method for testing the stationarity of data, and is called Kwiatkowski-Phillips-Schmidt-Shin. Of course, other algorithms may also be used to determine whether the time series data is stable, for example, a statistical method, a graph analysis method, etc., which are not described in detail herein.

In the above embodiment, a normal time series data segment can be screened out from the time series data, and considering that the normal time series data segment in the time series data occupies a high proportion, the amount of time series data remaining after the above embodiment is greatly reduced, so that the labeling efficiency can be improved. Considering that the types of the network anomalies which actually occur are limited, if the remaining time series data are classified, the classified batch labeling is realized, and the batch labeling manner inevitably improves the labeling efficiency again.

In order to improve the efficiency of classification, a classification algorithm may be used for classification. In an optional implementation manner, the number of abnormal situations occurring for a network index is limited, at this time, a plurality of categories are configured according to the abnormal situations that occur frequently, a typical time series data segment is found in each category, and then the time series data to be labeled is classified according to the found typical time series data. That is, in this optional embodiment, classifying the time series data to be labeled to obtain multiple types of time series data may include the following steps: acquiring a plurality of types which are configured in advance, wherein the characteristics of time sequence data segments corresponding to each type are different; obtaining a typical time sequence data section corresponding to each of a plurality of pre-configured types, wherein the typical time sequence data section is manually selected and has a time sequence data section specific to the type; and comparing each time sequence data segment to be marked with the typical time sequence data segment corresponding to each type, determining the typical time sequence data segment with the highest similarity with the time sequence data segment to be marked, and taking the type corresponding to the typical time sequence data segment with the highest similarity as the type of the time sequence data segment. The comparison of similarity between two time series data segments can be performed by using the present algorithm, which can be implemented by using the above mentioned dynamic time warping algorithm (DTW), wherein the minimum distance between two time series data segments can be calculated, and the minimum distance is the typical time series data segment with the highest similarity. For example, four kinds are previously arranged, and the four kinds correspond to the typical time-series data section 1, the typical time-series data section 2, the typical time-series data section 3, and the typical time-series data section 4, respectively. And respectively calculating the minimum distance between the time sequence data segment to be classified and four typical time sequence data segments by a DTS algorithm, wherein the minimum distance between the time sequence data segment to be classified and the typical time sequence data segment 1 is distance 1, the minimum distance between the time sequence data segment to be classified and the typical time sequence data segment 2 is distance 2, the minimum distance between the time sequence data segment to be classified and the typical time sequence data segment 3 is distance 3, and the minimum distance between the time sequence data segment to be classified and the typical time sequence data segment 4 is distance 4. Of these four distances, if the distance 2 is the smallest, the time-series data to be classified is classified into the category corresponding to the typical time-series data segment 2.

In the above classification, a typical time series data segment is required to be used for auxiliary classification, and in another mode, a clustering algorithm may also be used for classification, that is, the step of classifying the time series data to be labeled to obtain a plurality of kinds of time series data includes: and clustering the time sequence data to be marked by using a clustering algorithm to obtain a plurality of types of time sequence data. In this alternative, a clustering algorithm is used to cluster the time series data to be labeled. The clustering algorithm is a typical unsupervised learning algorithm, and is mainly used for automatically classifying similar samples (instant sequence data) into one category. In the clustering algorithm, samples are divided into different categories according to the similarity between the samples, different clustering results can be obtained for different similarity calculation methods, and the common similarity calculation method is an Euclidean distance method. The clustering algorithm is unsupervised learning, only samples are needed, results do not need to be marked, and the samples are divided into different types through learning training.

Common clustering algorithms are: k-means Algorithm (abbreviated as K-means), K-center point Algorithm (abbreviated as K-means), clustering Algorithm based on random selection (A Clustering Algorithm based on random Search, abbreviated as CLARANS), and the like. The K-means algorithm may be employed in this embodiment. In the K-means algorithm, data are divided into K groups in advance, K objects are randomly selected to serve as initial clustering centers, then the distance between each object and each seed clustering center is calculated, and each object is allocated to the clustering center closest to the object. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal.

The time series data to be marked can be based on several kinds of abnormal situations which occur frequently: the sudden increase, the sudden decrease, the drift and the sudden spurs are divided into four groups in advance (each group corresponds to a category), four time sequence data segments are randomly selected from the time sequence data to be marked to serve as initial clustering centers, then the distance between each time sequence data segment and the initial clustering centers is calculated, the time sequence data segment is distributed to the nearest clustering center, one time sequence data segment is distributed, the clustering centers of one group are recalculated, and the process is circulated until the termination condition is met.

Of course, other algorithms besides the k-means algorithm may be adopted, and in practical application, the clustering algorithm used may be flexibly selected according to practical situations, and details are not repeated again.

The time sequence data to be labeled can be classified into different types through the above embodiment, in one embodiment, the classified type of time sequence data can be directly handed to manual labeling, or in another optional mode, in order to further improve the efficiency, the existing abnormal detection algorithm can be used for automatic labeling firstly. The detection algorithms can be used for detecting abnormal time sequence data segments according to the characteristics of the abnormal time sequence data segments, or can also be machine learning algorithms obtained by training with a small amount of manually marked training data. The detection algorithms use the characteristics of the abnormal data segment and the manually marked data for detection, so that the abnormal time sequence data segment is detected to have the characteristics of high recall rate and low accuracy, wherein the recall rate is the correct number found by the detection algorithms divided by all the correct numbers, and the accuracy is the correct number found by the detection algorithms divided by all the numbers (including the correct and incorrect numbers). Because the anomaly detection algorithm has a certain inaccuracy rate, the labeling of the abnormal time sequence data segment needs to be checked manually when the anomaly detection algorithm is used for detection. That is, in this alternative, labeling the time series data of each of the multiple categories may include the following steps: detecting the time sequence data end of each kind by using an anomaly detection algorithm, wherein the anomaly detection algorithm is configured in advance; and providing the abnormal type corresponding to the time sequence data segment of each type detected by using the abnormal detection algorithm for manual work, and using the abnormal type as a reference for manually marking the abnormal type.

Through the optional mode, preliminary labeling can be carried out through the anomaly detection algorithm, then manual labeling is carried out on the abnormal time sequence data section detected by the anomaly detection algorithm manually, on one hand, the labeling accuracy is improved through the manual labeling, and on the other hand, the manual labeling efficiency is improved through the assistance of the anomaly detection algorithm.

In another optional mode, the abnormal type corresponding to the manually marked time sequence data segment can be obtained; comparing the anomaly type detected by using the anomaly detection algorithm with the manually marked anomaly type; and saving the time sequence data segment with the manually marked exception type different from the exception type detected by the exception detection algorithm. The time sequence data segments with different detection results can be used for optimizing the anomaly detection algorithm, so that the accuracy of the anomaly detection algorithm is further improved.

Through the optional implementation mode, not only is the normal time sequence data section removed from the network index time sequence data, but also a classification mode is used, and an abnormity detection algorithm is assisted to carry out preliminary labeling. An alternative example is described below with reference to the accompanying drawings. Fig. 3 is a schematic diagram of an abnormal time series data segment marking process according to an embodiment of the present application, in fig. 3, original time series data is first collected, after the original time series data is screened, an abnormal type can be defined, then clustering of abnormal time series data segments is performed, preliminary marking is performed through an abnormal detection algorithm, after the steps, manual marking and classification are performed by an expert, and finally, manually marked data can also be used for an abnormal detection algorithm used in the preliminary marking step of the swimming lake, so that closed-loop processing is formed. The various steps involved in fig. 3 are described immediately below.

1. Screening

The labeling of the time series data corresponding to the network index is to mark an abnormal time series data segment (referred to as an abnormal point for short) on one time series index, and the following conclusion can be obtained by observing and analyzing the time series data collected by the current cloud network history: the proportion of the abnormal points of the time series data needing to be marked in the whole time series data is relatively small, namely most of the time series data are normal time series data segments (referred to as normal points for short), and only a few of the time series data need to be marked as the abnormal points. The normal time sequence data section is usually regular or stable, in this example, the characteristic can be used for primarily screening the data to be labeled, the time sequence data section with regularity and stability is removed, and only the irregular and unstable time sequence data section is screened out for the next labeling.

2. Defining exception types

The labeled anomaly time series data segments are used in the training of machine learning algorithms in the future, and different machine learning algorithms can be used for identifying different anomalies. Considering that the time series data segments generated by different network indexes have different characteristics, different machine learning algorithms can be used for identifying abnormal time series data segments for different network indexes. Based on the situation, different algorithms have different sensitivities to different abnormal types, for example, some algorithms are better at recognizing sudden increase and sudden decrease, while some algorithms are better at recognizing drifting or sudden puncture, some abnormal types can be artificially defined according to the requirements of different algorithms, and then labeled abnormal time sequence data segments can be classified for training and iteration of different algorithms.

As an optional mode, considering that most of time series data of the cloud network is steady or periodic under normal conditions, a few abnormal data segments may be divided into several different abnormal types according to a certain rule: sudden increase, sudden decrease, drift and spurs.

3. Clustering of anomalous data

The data volume of irregular and non-stationary time sequence data screened out in the step 1 is greatly reduced, but the data volume may not be directly labeled manually. In this example, in order to further reduce the workload of manual labeling, the time series data after the normal time series data segment is provided may be clustered, and subsequent labeling may be performed in batch based on the clustered categories.

Existing algorithms can be used in clustering, for example, a relatively mature DTW algorithm is used in a similarity measurement method of time series in clustering, and the algorithm can perform similarity detection on time series with different lengths. There are many kinds of clustering algorithms, for example, in this example, the clustering algorithm uses a k-means algorithm, which has been described in the above embodiments and will not be described herein. The number of the abnormal time sequence data segments after clustering cannot be reduced, but subsequent manual labeling carries out batch labeling on the clustered classes, and the workload of the manual labeling can be reduced in the process.

4. Preliminarily marking abnormal points

In this step, the screened data may be initially marked using existing anomaly detection algorithms that have a high recall rate but may have a low accuracy rate. Due to the fact that the recall rate is high, most abnormal time sequences in the data set can be detected, and the data detected as the data of the normal time sequences through the abnormal detection algorithm can be marked without experts. The anomaly detection algorithm has already been described in the above embodiments, and is not described herein again.

5. Marking by experts

Through the processing of the steps, the expert can manually label the data pre-labeled by the anomaly detection algorithm, and in the process of manually labeling, the abnormal time sequence data segments are clustered, so that the data in a certain category can be labeled in batches, and the efficiency of manual labeling is improved.

The result of the manual marking can be fed back to the detection of the abnormal point in the fourth step, so that the abnormal detection algorithm can be more accurate, and the data size marked by a later expert is further reduced.

In the embodiment, the data to be labeled is screened and classified for multiple times, so that the data quantity required to be labeled by experts can be greatly reduced, the labeling cost can be reduced while the labeling quality is ensured, and data guarantee can be provided for training and iterative updating of various algorithms under limited cost.

There is also provided in this embodiment a time series data based machine learning system, including: training data generating means for generating training data according to the above-mentioned method; a server for training a machine learning model using the training data from the training data generating device, wherein the machine learning model is used for identifying the abnormal type of the time series data segment. It should be noted that the training data generation apparatus may be any device (such as a server) having a computing capability, or the training data generation apparatus may be a program running on a device having a computing capability.

In this embodiment, an electronic device is provided, comprising a memory in which a computer program is stored and a processor configured to run the computer program to perform the method in the above embodiments.

The programs described above may be run on a processor or may also be stored in memory (or referred to as computer-readable media), which includes both non-transitory and non-transitory, removable and non-removable media, that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

These computer programs may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks, and corresponding steps may be implemented by different modules.

Such an apparatus or system is provided in this embodiment. The apparatus is called an apparatus for generating training data based on time series data, and includes: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring time sequence data, the time sequence data is acquired when a network index is monitored, and the time sequence data is recorded according to a time sequence; the screening module is used for taking other time sequence data segments except the normal time sequence data segment in the time sequence data as the time sequence data segment to be marked, wherein the normal time sequence data segment is a segment of time sequence data generated when the network index is in a normal state; the labeling module is used for classifying the time sequence data segments to be labeled to obtain a plurality of types of time sequence data segments and labeling each type of time sequence data segment in the plurality of types, wherein the label is used for adding a label to the type of time sequence data segment, and the label is used for indicating the abnormal type of the type of time sequence data segment; and the processing module is used for using the marked time sequence data segments of multiple types as training data, wherein the training data is used for training a machine learning model, and the machine learning model is used for identifying the abnormal type of the time sequence data segments.

The system or the apparatus is used for implementing the functions of the method in the foregoing embodiments, and each module in the system or the apparatus corresponds to each step in the method, which has been described in the method and is not described herein again.

Optionally, the screening module is configured to identify a time series data segment with periodicity and/or stationarity from the time series data; and taking the time sequence data segment with the periodicity and/or the stationarity as the normal time sequence data segment.

Optionally, the screening module is configured to determine whether the time-series data corresponding to the network indicator has periodicity, acquire a periodicity law of the time-series data corresponding to the network indicator under the condition that the time-series data has periodicity, and identify a time-series data segment having the periodicity law from the time-series data; and/or determining whether the time sequence data corresponding to the network index has stationarity, acquiring the upper limit and the lower limit of the time sequence data corresponding to the network index under the condition of stationarity, and identifying the time sequence data segment within the range of the upper limit and the lower limit from the time sequence data as the time sequence data segment with stationarity.

Optionally, the determining, by the screening module, whether the time-series data corresponding to the network indicator has periodicity includes: dividing the time sequence data corresponding to the network index into a plurality of time sequence data sections; acquiring the distance between every two time sequence data segments in the plurality of time sequence data segments; and under the condition that the distances between every two time sequence data sections are smaller than a threshold value, determining that the time sequence data corresponding to the network index have periodicity.

Optionally, the screening module is configured to obtain a polynomial corresponding to the time series data corresponding to the network indicator, where the polynomial is used to represent the time series data corresponding to the network indicator in time; and determining whether the time sequence data corresponding to the network index has stationarity or not according to whether the polynomial has a unit root or not.

Optionally, the labeling module is configured to detect the time series data segment of each category by using an anomaly detection algorithm, where the anomaly detection algorithm is preconfigured; and providing the abnormal type corresponding to the time sequence data segment of each type detected by using the abnormal detection algorithm for manual work, and using the abnormal type as a reference for manually marking the abnormal type.

Optionally, the apparatus further comprises: the storage module is used for acquiring the abnormal type corresponding to the manually marked time sequence data segment; comparing the anomaly type detected by using the anomaly detection algorithm with the manually marked anomaly type; and storing the time sequence data segment with the manually marked exception type different from the exception type detected by the exception detection algorithm.

Optionally, the labeling module is configured to obtain the multiple pre-configured categories, where characteristics of the time series data segments corresponding to each category are different; obtaining a typical time sequence data section corresponding to each of a plurality of pre-configured types, wherein the typical time sequence data section is manually selected and has a time sequence data section specific to the type; and comparing each time sequence data segment to be marked with the typical time sequence data segment corresponding to each type, determining the typical time sequence data segment with the highest similarity with the time sequence data segment to be marked, and taking the type corresponding to the typical time sequence data segment with the highest similarity as the type of the time sequence data segment.

Optionally, the labeling module is configured to cluster the time series data segments to be labeled by using a clustering algorithm to obtain multiple types of time series data segments; wherein the clustering algorithm is an unsupervised machine learning algorithm.

Through the method and the device, the problems of low efficiency and high cost caused by manually labeling the time sequence data corresponding to the network indexes are solved, and then the data can be preprocessed, the processed data are labeled, the time sequence data labeling efficiency is improved to a certain extent compared with a pure manual labeling mode, and the labeling cost is reduced.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of generating training data based on time series data, comprising:

acquiring time sequence data, wherein the time sequence data is acquired when a network index is monitored, and the time sequence data is recorded according to a time sequence;

taking other time sequence data segments except the normal time sequence data segment in the time sequence data as time sequence data segments to be marked, wherein the normal time sequence data segment is a segment of time sequence data generated when the network index is in a normal state;

classifying the time sequence data segments to be labeled to obtain a plurality of types of time sequence data segments, and labeling each type of time sequence data segment in the plurality of types, wherein the label is used for adding a label to the type of time sequence data segment, the same label is carried out on the same type of time sequence data segment, and the label is used for indicating the abnormal type of the type of time sequence data segment; labeling the time sequence data segment of each kind comprises: detecting the time sequence data segments of each kind by using an anomaly detection algorithm, wherein the anomaly detection algorithm is configured in advance; providing the abnormal type corresponding to the time sequence data segment of each type detected by the abnormal detection algorithm for manual work, and using the abnormal type as a reference for manually marking the abnormal type; the step of classifying the time sequence data to be labeled to obtain multiple types of time sequence data comprises the following steps: acquiring the plurality of pre-configured categories, wherein the time sequence data segments corresponding to each category have different characteristics; acquiring a typical time sequence data segment corresponding to each of a plurality of pre-configured types, wherein the typical time sequence data segment is manually selected and has a time sequence data segment specific to the type; comparing each time sequence data segment to be marked with the typical time sequence data segment corresponding to each type, determining the typical time sequence data segment with the highest similarity with the time sequence data segment to be marked, and taking the type corresponding to the typical time sequence data segment with the highest similarity as the type of the time sequence data segment;

using the marked time sequence data segments of multiple types as training data, wherein the training data is used for training a machine learning model, and the machine learning model is used for identifying the abnormal type of the time sequence data segments;

acquiring an abnormal type corresponding to the manually marked time sequence data segment;

comparing the anomaly type detected by using the anomaly detection algorithm with the manually marked anomaly type;

and storing the time sequence data segment with the manually marked exception type different from the exception type detected by the exception detection algorithm.

2. The method according to claim 1, wherein before taking other time series data segments than the normal time series data segment in the time series data as the time series data segment to be labeled, the method further comprises:

identifying a time series data segment with periodicity and/or stationarity from the time series data;

and taking the time sequence data segment with the periodicity and/or the stationarity as the normal time sequence data segment.

3. The method of claim 2, wherein,

identifying a time-series data segment having a periodicity from the time-series data comprises: determining whether the time sequence data corresponding to the network index has periodicity, acquiring the periodicity regularity of the time sequence data corresponding to the network index under the condition of periodicity, and identifying a time sequence data segment with the periodicity regularity from the time sequence data; and/or the presence of a gas in the atmosphere,

identifying a time series data segment having stationarity from the time series data comprises: and determining whether the time sequence data corresponding to the network index has stationarity, acquiring upper and lower limits of the time sequence data corresponding to the network index under the condition of the stationarity, and identifying time sequence data segments within the upper and lower limits from the time sequence data as the time sequence data segments with the stationarity.

4. The method of claim 3, wherein determining whether the time series data corresponding to the network metric has periodicity comprises:

dividing the time sequence data corresponding to the network index into a plurality of time sequence data sections;

acquiring the distance between every two of the plurality of time sequence data segments;

and under the condition that the distances between every two time sequence data sections are smaller than a threshold value, determining that the time sequence data corresponding to the network index have periodicity.

5. The method of claim 3, wherein determining whether the time series data corresponding to the network metric is stationary comprises:

obtaining a polynomial corresponding to the time sequence data corresponding to the network index, wherein the polynomial is used for representing the time sequence data corresponding to the network index in time;

and determining whether the time sequence data corresponding to the network index has stationarity or not according to whether the polynomial has a unit root or not.

6. A time series data based machine learning system, comprising:

training data generating means for generating training data according to the method of any one of claims 1 to 5;

and the server is used for training a machine learning model by using the training data from the training data generation device, wherein the machine learning model is used for identifying the abnormal type of the time series data segment.

7. An electronic device comprising a memory and a processor; wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method steps of any of claims 1 to 5.

8. A readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement the method steps of any of claims 1 to 5.