CN113836240A

CN113836240A - Time sequence data classification method and device, terminal equipment and storage medium

Info

Publication number: CN113836240A
Application number: CN202111047596.7A
Authority: CN
Inventors: 李晓颖; 胡明艳; 吴慧强; 李苏璇
Original assignee: China Merchants Bank Co Ltd
Current assignee: China Merchants Bank Co Ltd
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2021-12-24
Anticipated expiration: 2041-09-07
Also published as: CN113836240B

Abstract

The invention discloses a time sequence data classification method, which comprises the following steps: acquiring a time sequence set to be classified, wherein the time sequence set comprises a plurality of time sequences; extracting multidimensional characteristics of each time sequence in the time sequence set respectively to obtain multidimensional time sequence characteristics; and classifying each time sequence in the time sequence set according to the multi-dimensional time sequence characteristics. The invention also discloses a time sequence data classification device, terminal equipment and a storage medium. According to the invention, the time sequence is subjected to multi-dimensional feature extraction, and is classified by combining the multi-dimensional time sequence features of the time sequence, so that the classification precision of the time sequence data can be effectively improved, and the classification precision of the time sequence data can meet the prediction requirement of complex business scenes in the financial field.

Description

Time sequence data classification method and device, terminal equipment and storage medium

Technical Field

The invention relates to the technical field of data processing of financial science and technology, in particular to a time sequence data classification method and device, terminal equipment and a storage medium.

Background

The time series data classification aims at analyzing standard time series data, classifying the time series according to the trend of a historical sequence to assist the prediction of future trend, and further assist abnormal point detection, business decision and the like. Most of the current time sequence data come from single scenes such as network technology, hydrologic forecast and the like, and the expression rules and the structures of the time sequence data of the scenes are relatively simple, so that the prediction requirements can be met through simple data classification.

However, in the financial field such as banks, the business scene is complex, various business processes generate time sequences with complicated types, and when time series data are predicted, characteristics of different dimensions of the time sequences need to be combined, so that great challenges are brought to prediction of the time series data. At present, the classification precision of a time sequence by a classification method suitable for time sequence data of a single scene is insufficient, and the prediction requirement of complex business scenes in the financial field cannot be met.

Disclosure of Invention

The invention mainly aims to provide a time sequence data classification method, a time sequence data classification device, a terminal device and a storage medium, and aims to solve the technical problems that the existing time sequence data classification method is insufficient in time sequence classification precision and cannot meet the prediction requirement of a complex business scene.

In addition, to achieve the above object, the present invention further provides a time series data classification method, including the steps of:

acquiring a time sequence set to be classified, wherein the time sequence set comprises a plurality of time sequences;

extracting multidimensional characteristics of each time sequence in the time sequence set respectively to obtain multidimensional time sequence characteristics;

and classifying each time sequence in the time sequence set according to the multi-dimensional time sequence characteristics.

Optionally, the step of performing multidimensional feature extraction on each time series in the time series set to obtain multidimensional time series features includes:

inputting the time series set into a preset data classifier, wherein the data classifier comprises a plurality of classification models;

and respectively performing feature extraction processing on each time sequence in the time sequence set by using each classification model to obtain the multi-dimensional time sequence features of each time sequence.

Optionally, the data classifier includes a magnitude classification model, the multi-dimensional time series features include a magnitude ratio, and the step of obtaining the multi-dimensional time series features of each time series by performing feature extraction processing on each time series in the time series set by using each classification model includes:

counting the number of sequence values of each time sequence, which are greater than a preset magnitude reference threshold value of each magnitude, by using the magnitude classification model, wherein the magnitude reference threshold value is obtained by mining a historical time sequence set by using the magnitude classification model;

and calculating the magnitude ratio of the sequence value quantity, wherein the magnitude ratio is the proportion of the quantity of the sequence values corresponding to each magnitude in the time sequence.

Optionally, the data classifier includes an up/down line classification model, the multi-dimensional time series feature includes an up/down line time point, and the step of obtaining the multi-dimensional time series feature of each time series by performing feature extraction processing on each time series in the time series set by using each classification model includes:

carrying out extremum filtering processing on each time sequence in the time sequence set by using the upper/lower line classification model to obtain a plurality of characteristic sequences;

and acquiring a subscript set of the feature sequence, and calculating an upper/lower line time point from the feature sequence according to sequence values of the subscript set traversing the feature sequence, wherein subscripts in the subscript set are positions of all the sequence values in the feature sequence.

Optionally, the data classifier includes a fluctuation type classification model, the multi-dimensional time series features include sequence similarity, and the step of obtaining the multi-dimensional time series features of each time series by respectively performing feature extraction processing on each time series in the time series set by using each classification model includes:

acquiring a preset sequence sample set, wherein the sequence sample set is constructed based on a time sequence with a special waveform in a historical time sequence set;

and traversing each sample sequence in the sequence sample set by using the fluctuation type classification model, and calculating the sequence similarity between each time sequence in the time sequence set and each sample sequence.

Optionally, the data classifier includes an irregular classification model, the multi-dimensional time-series features include sequence fluctuation factors, and the step of obtaining the multi-dimensional time-series features of each time series by performing feature extraction processing on each time series in the time-series set by using each classification model includes:

acquiring a first round of window parameters, setting the first round of window parameters as target window parameters, and dividing each time sequence in the time sequence set into a plurality of subsequences according to the irregular classification model and the target window parameters to obtain a subsequence set of each time sequence in the time sequence set;

calculating the distance between each subsequence in each subsequence set and the rest subsequences in the subsequence set to obtain a distance set of each time sequence;

calculating the number of targets of the distance values in the distance set, which are greater than a preset sequence standard deviation threshold value, and calculating a first distance characteristic value according to the number of targets;

calculating a second round of window parameters based on the first round of window parameters, setting the second round of window parameters as target window parameters, returning to and executing the step of dividing each time sequence in the time sequence set into a plurality of subsequences according to the irregular classification model and the target window parameters to obtain a subsequence set of each time sequence in the time sequence set to obtain a second distance characteristic value;

and calculating the fluctuation factor of each time series according to the first distance characteristic value and the second distance characteristic value.

Optionally, the data classifier includes a stationary classification model, the multi-dimensional time series features include an extreme value difference sequence, and the step of obtaining the multi-dimensional time series features of each time series by performing feature extraction processing on each time series in the time series set by using each classification model includes:

carrying out low-pass filtering processing on each time sequence in the time sequence set by using the stable classification model to obtain a plurality of low-pass filtering sequences;

calculating a stationary magnitude threshold based on each sequence value in the low-pass filtering sequence;

taking each sequence value in the low-pass filtering sequence as a center, and performing sliding window processing on the low-pass filtering sequence to obtain each window sequence of the low-pass filtering sequence;

traversing each window sequence, calculating a target difference value of a maximum value and a minimum value of sequence values in each window sequence, and creating a mark sequence, wherein the sequence length of the mark sequence is the same as that of the low-pass filtering sequence;

and if the target difference value is larger than the stable magnitude threshold value, marking the central sequence value of the target window sequence corresponding to the target difference value in the marking sequence to obtain an extreme value difference sequence.

In addition, to achieve the above object, the present invention provides a time series data sorting apparatus, including:

the data acquisition module is used for acquiring a time sequence set to be classified, wherein the time sequence set comprises a plurality of time sequences;

the characteristic extraction module is used for respectively carrying out multi-dimensional characteristic extraction on each time sequence in the time sequence set to obtain multi-dimensional time sequence characteristics;

and the data classification module is used for classifying each time sequence in the time sequence set according to the multi-dimensional time sequence characteristics.

In addition, to achieve the above object, the present invention also provides a terminal device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the time series data classification method as described above.

Furthermore, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the time-series data classification method as described above.

Furthermore, to achieve the above object, the present invention also provides a computer program product comprising a computer program which, when being executed by a processor, realizes the steps of the time series data classification method as described above.

The embodiment of the invention provides a time sequence data classification method, a time sequence data classification device, terminal equipment and a storage medium. Compared with the prior art, in the embodiment of the invention, a time sequence set to be classified is obtained, wherein the time sequence set comprises a plurality of time sequences; extracting multidimensional characteristics of each time sequence in the time sequence set respectively to obtain multidimensional time sequence characteristics; and classifying each time sequence in the time sequence set according to the multi-dimensional time sequence characteristics. The time series is subjected to multi-dimensional and multi-dimensional feature extraction, and the multi-dimensional time sequence features of the time series are combined for classification, so that the classification precision of the time sequence data can be effectively improved, and the classification precision of the time sequence data can meet the prediction requirement of complex business scenes in the financial field.

Drawings

Fig. 1 is a schematic hardware structure diagram of an implementation manner of a terminal device according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a time series data classification method according to a first embodiment of the present invention;

fig. 3 is a functional block diagram of an embodiment of a time series data classification apparatus according to the invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.

The time sequence data classification terminal (also called terminal, equipment or terminal equipment) in the embodiment of the invention can be a PC (personal computer), and can also be mobile terminal equipment with display and data processing functions, such as a smart phone, a tablet computer, a portable computer and the like.

As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein a computer program of an operating system, a network communication module, a user interface module, and a time-series data classification.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be used to call a computer program stored in the memory 1005, which when executed by the processor, implements the operations in the time-series data classification method provided by the embodiments described below.

Based on the hardware structure of the equipment, the invention provides various embodiments of the time sequence data classification method. It should be noted that the time series data, i.e. the time series data, is a series formed by arranging numerical values of the same statistical index according to the time sequence of occurrence of the numerical values, and may be time period data or time point data, different time series data have different development rules, the prediction of the time series data is based on the trend of historical data to perform analysis, so as to predict future performance, and different time series data need to be predicted in different ways, so before prediction, the time series data need to be accurately classified. Most of the time sequence data classification methods in the prior art are directed at single scenes such as network technology, hydrologic prediction and the like, and the time sequence data classification method is simple. However, in the financial field, service scenes are complex, various types of time sequence data are generated in various services, the time sequence data of different types have different characteristics, when prediction is performed, comprehensive prediction is performed by combining multi-classification prediction models with the characteristics of the time sequence data in different dimensions, a simple classification mode is insufficient in classification accuracy of the time sequence data, and it cannot be determined which prediction models need to be used in prediction, so that the time sequence data classification accuracy in a single scene cannot meet the prediction requirements of the complex scene.

Based on the above, the embodiment of the invention provides a time series data classification method, which is used for analyzing time series data with rich types generated in a complex business scene with experience accumulation in the financial field such as banks, designing different classification rules for time series with different characteristics, and expressing the expression of each type of time series in a full amount, so as to effectively assist the baseline prediction function.

Specifically, referring to fig. 2, fig. 2 is a schematic flow chart of a time series data classification method according to a first embodiment of the present invention, in which the time series data classification method includes:

step S10, acquiring a time sequence set to be classified, wherein the time sequence set comprises a plurality of time sequences;

in this embodiment, the time series data classification method is implemented in a time series data classification terminal, which may be a personal computer, or a mobile terminal device such as a tablet computer having display and data processing functions. Therefore, before classifying the time series data, the time series data to be classified is acquired, and the time series data is a time series set, which includes a plurality of time series, and is mainly used for assisting the baseline prediction, thereby assisting the business decision and detecting the abnormal point.

Further, the sources of the time sequences in the acquired time sequence set may be the same or different, and different time sequences may correspond to the same statistical index or different statistical indexes.

Step S20, respectively extracting multidimensional characteristics of each time series in the time series set to obtain multidimensional time series characteristics;

when classifying each time series in the time series set, firstly, performing multi-dimensional feature extraction on each time series respectively to obtain multi-dimensional time series features of each time series. When the feature extraction is performed on each time series, different rules may be used to extract a plurality of time series simultaneously, specifically, according to the number of rules set in the classifier, the time series in the time series set are divided into batches, each time the time series of the same batch is sent to the classifier to perform the feature extraction simultaneously, when the feature extraction is performed, after the feature extraction of the current time series is completed by each rule, the time series are exchanged with other rules until each rule in the classifier performs the feature extraction on the time series to be processed in the same batch, and then the time series of the next batch is input to perform the feature extraction.

Further, each time series in the time series set can be input into each rule of the classifier one by one for feature extraction, and each rule in the classifier sequentially performs feature extraction on the time series to obtain multi-dimensional feature extraction of each time series in the time series set.

Further, the refinement of step S20 includes:

step S201, inputting the time series set into a preset data classifier, wherein the data classifier comprises a plurality of classification models;

step S202, each time series in the time series set is subjected to feature extraction by using each classification model, so that multi-dimensional time series features are obtained.

A classifier for performing multi-dimensional feature extraction on each time sequence in a time sequence set is provided with a plurality of feature extraction rules, wherein each feature extraction rule is a classification model. And analyzing the characteristic information required for predicting the time sequence based on the historical time sequence by combining the actual service prediction requirement, thereby obtaining the classification requirement of the time sequence. Based on the classification requirement of the time series, different classification models are designed to classify the acquired time series, so that different prediction models are selected according to different types of time series to predict each type of time series.

When each time sequence in the acquired time sequence set is classified, the acquired time sequence set is input into a preset data classifier, and each classification model in the data classifier respectively extracts the features of each time sequence in the time sequence set to obtain the multi-dimensional time sequence features of each time sequence.

And step S30, classifying each time sequence in the time sequence set according to the multi-dimensional time sequence characteristics.

Further, each time sequence is classified according to the multidimensional time sequence characteristics of each time sequence, in this embodiment, the time sequences are classified into a plurality of types according to the multidimensional time sequence characteristics of each time sequence from different dimensions, wherein the time sequences are divided into a large magnitude, a medium magnitude, a small magnitude and a micro magnitude from the magnitude dimension of the data volume; dividing the time sequence into a continuous type and a discrete type from the dimension of the continuity of the time sequence data; from the source of the time sequence data, dividing the time sequence data in the business system of the upper line and the lower line into an upper line type and a lower line type; the time series with obvious local characteristics are divided into fluctuation type, irregular type, stable type and the like. It will be appreciated that the type of time series may be divided into more or fewer types depending on the actual prediction needs.

Further, a preferred classification manner is to generate corresponding label information according to the type of each time series, so that the prediction model can identify the type of each time series, and further determine different prediction models for prediction. It will be appreciated that the same time series may generate multiple tag information, thus corresponding to multiple different timing types. For example, a large-magnitude time series generally belongs to a continuous type, a micro-magnitude time series and a small-magnitude time series, generally have characteristics of irregularity, frequent fluctuation, dispersion and the like, and generally belong to an irregular type or fluctuation type or dispersion type time series. Furthermore, in the same time sequence, the sequence values of different time periods may also have different rules and characteristics, so that the same time sequence may be divided into a plurality of different types of sub-sequences for classification. It should be noted that each classification model may also selectively extract features from multiple classification models according to the relevance between features of the same time series, and the actual classification manner is not limited thereto, and is not described herein again.

In this embodiment, a time series set to be classified is obtained, where the time series set includes a plurality of time series; extracting multidimensional characteristics of each time sequence in the time sequence set respectively to obtain multidimensional time sequence characteristics; and classifying each time sequence in the time sequence set according to the multi-dimensional time sequence characteristics. The time series is subjected to multi-dimensional and multi-dimensional feature extraction, and the multi-dimensional time sequence features of the time series are combined for classification, so that the classification precision of the time sequence data can be effectively improved, and the classification precision of the time sequence data can meet the prediction requirement of complex business scenes in the financial field.

Further, on the basis of the above embodiments of the present invention, a second embodiment of the time-series data classification method of the present invention is provided.

The present embodiment is a refinement of step S202 in the first embodiment, and the present embodiment is a refinement of a feature extraction processing process performed on a time series according to different classification models in a classifier, where the preset classification models in the data classifier include a magnitude classification model, an up/down line classification model, and stationary classification models of a fluctuation classification model, and the time series is classified from different dimensions. Specifically, the step of respectively performing feature extraction processing on each time series in the time series set by using each classification model in a preset data classifier to obtain the multidimensional time series features of each time series includes:

step A1, counting the number of sequence values in each time sequence, which are greater than a preset magnitude reference threshold value of each magnitude, by using the magnitude classification model, wherein the magnitude reference threshold value is obtained by mining a historical time sequence set by using the magnitude classification model;

step A2, calculating the magnitude ratio of the sequence value quantity, wherein the magnitude ratio is the proportion of the sequence value quantity corresponding to each magnitude in the time sequence.

In this embodiment, a magnitude classification model is provided in the data classifier, and the time sequence is divided into different magnitude types according to the magnitude of the time sequence, which mainly include a large magnitude, a medium magnitude, a small magnitude and a micro magnitude, and when the feature extraction is performed, the magnitude proportion of each magnitude sequence value in the time sequence is mainly extracted, and the specific extraction method is as follows:

and counting the ratio of different percentile sequence values in the time sequence, and dividing each time sequence into different magnitudes according to the ratio of the magnitude sequence values. In this embodiment, the magnitude reference threshold of each magnitude sequence value is obtained by analyzing based on a historical time sequence, and based on the analysis of a historical time sequence set, the magnitude reference threshold of the large magnitude is generally obtained by taking a 50-70 percentile value and the median is obtained by taking a 10-30 percentile value, for example, the proportion of the number of the sequence values greater than 50-70 percentile in each time sequence in the number of all the sequence values in the time sequence is counted, and if the proportion is greater than 30%, the time sequence is the large magnitude time sequence. The percentile value of the time sequence is the proportion of sequence values in the time sequence in the total sequence from small to large.

Taking a time series of minutes as an example, in a data column in which data is acquired once per minute, data acquired every day is a time series, the length of the time series is 1440, that is, there are 1440 sequence values in the time series, the 1440 sequence values are sorted from small to large, the number of sequence values greater than 50 to 70 percentile in the 1440 sequence values in the time series is counted, and if the number exceeds 432, that is, the percentage exceeds 30%, the time series is considered as a large-magnitude time series. And performing overall classification evaluation on historical performances of the time series based on the time series in a historical period, so as to determine the magnitude of the time series, wherein the characteristic performances of a large number of levels of time series are that the overall magnitude is large, the period and the trend are obvious, and the relative fluctuation is small.

And if the proportion of the large-scale sequence value does not exceed the preset magnitude proportion threshold value, counting the number and magnitude proportion of the sequence values which are larger than 10-30 percent, and if the proportion is larger than 30 percent, determining that the time sequence is a medium-scale time sequence. The medium-magnitude time sequence is characterized in that the overall magnitude is general, the period and the trend are not as obvious as those of the large magnitude, the relative fluctuation is large, and the large magnitude condition needs to be eliminated in the identification process of the medium-magnitude type, and the large magnitude condition are mutually exclusive.

If the time sequence is neither a large scale nor a medium scale, the time sequence is considered to be possibly a small scale, the scale ratio required to be extracted by the time sequence of the small scale is not completely the same as the large scale and the medium scale, when the characteristics of the small scale are judged, the time sequence is used for extracting sequence values of abnormal service time periods in the time sequence, such as nighttime data of 0 to 7 points, then counting the number of zero values in the extracted sequence values, taking the continuous length of the zero values in the counted number of the zero values as a zero value interval, and determining the scale ratio of the zero value interval. When the micro-magnitude time series are classified, if the magnitude ratio of the zero value interval is greater than the magnitude ratio threshold of the micro-magnitude and less than the magnitude ratio threshold of the small-magnitude, and meanwhile, the magnitude ratio of the large-magnitude sequence value is less than a preset first magnitude ratio threshold (such as 0.001), the time series is considered to be the small-magnitude time series. And if the magnitude ratio of the zero value interval is greater than or equal to the micro magnitude ratio threshold value, and the ratio of the large magnitude sequence value is less than a preset second magnitude ratio threshold value (such as 0.0001), the time sequence is considered to be a micro magnitude time sequence. The small-magnitude time sequence is characterized by small overall magnitude and discrete distribution, and the period can be seen from the density. Such discrete data is more obvious in characteristics in night time, so that the characteristic information can be extracted for calculation and identification based on the angle of the time interval. The micro-scale time sequence is characterized by small integral scale, discrete distribution and no obvious period.

In summary, when the time series is subjected to feature extraction and classification from the magnitude dimension of the time series, the extracted feature information at least includes a large magnitude duty ratio, a medium magnitude duty ratio, a magnitude duty ratio of a small magnitude zero value interval, and a magnitude duty ratio of a micro magnitude zero value interval. It is understood that, when performing the magnitude classification, generally, the feature extraction is performed on the time series data of a certain longer time period in the history in a unified manner, for example, for the time series data of the minute level, the feature proportion of the time series data of 21 days in the history is calculated in a unified manner without performing the classification on the data of each day, and then the overall classification evaluation is performed on the sequence expression to determine the magnitude classification of the time series. It should be noted that, in this embodiment, specific values such as the magnitude threshold or the magnitude threshold ratio are only used for exemplary illustration, and in practical applications, each threshold may be set by self-definition according to actual conditions or experience.

Further, the preset data classifier further includes an upper/lower line type classification model, the extracted multidimensional timing characteristics further include upper/lower line time points, and the step of performing characteristic extraction processing on each time sequence in the time sequence set by using each classification model to obtain the multidimensional timing characteristics of each time sequence further includes:

step B1, carrying out extremum filtering processing on each time sequence in the time sequence set by using the upper/lower line classification model to obtain a plurality of characteristic sequences;

step B2, obtaining a subscript set of the feature sequence, and calculating an upper/lower line time point from the feature sequence according to each sequence value of the subscript set traversing the feature sequence, wherein the subscript in the subscript set is the position of each sequence value in the feature sequence.

It can be understood that the up-line time series corresponds to the time series data form of the just-up line, the magnitude of the time series data is suddenly increased, and the magnitude difference between the data before and after the up-line time is large, and the magnitude difference between the data before and after the down-line time is also large. Therefore, the data magnitude mutation before and after the online/offline time can be frequently falsely detected as an abnormal point to generate a false alarm, and in this embodiment, an online/offline classification model is provided for extracting the online/offline time point, so as to classify the time series including the sequence value of the online/offline time. Specifically, the extreme value filtering processing is firstly carried out on the time series by using an up/down line classification model, and the purpose of the extreme value filtering is to eliminate the influence of daily fluctuation of the non-up line type time series on classification. In this embodiment, the extremum filtering process takes only the sequence value smaller than 85 percentile in the reserved time sequence as an example, after the extremum filtering process is performed on each time sequence, a plurality of feature sequences are obtained, and then a subscript set of each feature sequence is obtained, where the subscript set is a position of the sequence value in each time sequence in the time sequence. And traversing the sequence values in the characteristic sequences according to the subscripts in the subscript set, thereby extracting the upper/lower line time points.

Further, with reference to equations 1-3:

ave_before<ave_after×α，ave_after>medium_thres (2)

ave_after<ave_before×β，ave_before>medium_thres (3)

a preferred extraction method of the upper/lower line time points is as follows: traversing each sequence value in the time sequence, dividing the time sequence into a front subsequence and a rear subsequence according to subscripts of each sequence value in the time sequence, respectively calculating sequence mean ave _ before and ave _ after of the two subsequences, then calculating a break _ alpha value according to formula 1 and recording a corresponding sequence value subscript indexi, and updating the minimum value of the break _ alpha in the traversing process. After traversing is completed, if the minimum value of break _ alpha exists, so that the two sequence mean values meet the condition in formula 2, the subscript corresponding to the minimum value of break _ alpha is the break _ index at the time of going on the line; if the minimum value of break _ alpha exists so that the two sequence means satisfy the condition in formula 3, the subscript corresponding to the minimum value of break _ alpha is the lower line time break _ index.

Further, in the above equation 2-3, medium _ thres is a medium-level sequence threshold, α and β are adjustable parameters, respectively, and can be set by self-definition according to the analysis of the historical time-series data, in this embodiment, α is set to 0.05, and β is set to 0.1.

Further, the preset data classifier further includes a fluctuation type classification model, the extracted multidimensional time sequence features further include sequence similarity, and the step of performing feature extraction processing on each time sequence in the time sequence set by using each classification model to obtain the multidimensional time sequence features of each time sequence further includes:

step C1, acquiring a preset sequence sample set, wherein the sequence sample set is constructed based on a time sequence with a special waveform in a historical time sequence set;

and step C2, traversing each sample sequence in the sequence sample set by using the fluctuation type classification model, and calculating the sequence similarity between each time sequence in the time sequence set and each sample sequence.

The fluctuation type time sequence is mainly used for identifying a special-shaped time sequence, namely a time sequence which has special waveform characteristics and certain regularity and randomly fluctuates at local time. And calculating the similarity between the input time sequence and the sample sequence by constructing a special sample sequence set, defining the similarity as a similar sequence when the similarity is higher than a preset threshold value, and classifying the similar sequence as a fluctuation time sequence. The wave type time sequence can be configured with a loose alarm strategy in the subsequent abnormal detection and other stages. Further, in this embodiment, the time sequence with the special waveform refers to a time sequence with a certain regularity, and includes a sample sequence customized according to actual requirements, a sequence with a certain regularity summarized based on historical time sequence data, and the like.

In this embodiment, a Fast Dynamic Time Warping (Fast Dynamic Time Warping) algorithm model is not sensitive to sequence extension and compression, and similar sequences with different lengths and local fluctuations can be effectively and dynamically measured. Specifically, a sequence sample set of a special sequence is constructed in advance based on historical time series data, sample sequences in the sequence sample set are traversed by a fluctuation type classification model, and the similarity between the input time series and each sample sequence is calculated.

Further, the preset data classifier further includes an irregular classification model, the extracted multidimensional time sequence features further include sequence fluctuation factors, and the step of performing feature extraction processing on each time sequence in the time sequence set by using each classification model to obtain the multidimensional time sequence features of each time sequence further includes:

step D1, acquiring a first round of window parameters, setting the first round of window parameters as target window parameters, dividing each time sequence in the time sequence set into a plurality of subsequences according to the irregular classification model and the target window parameters by the target window parameters, and obtaining a subsequence set of each time sequence in the time sequence set;

step D2, calculating the distance between each subsequence in each subsequence set and the rest subsequences in the subsequence set to obtain the distance set of each time sequence;

step D3, calculating the target number of the distance values in the distance set which are larger than a preset sequence standard deviation threshold value, and calculating a first distance characteristic value according to the target number;

step D4, calculating a second round window parameter based on the first round window parameter, setting the second round window parameter as a target window parameter, returning and executing the step of dividing each time sequence in the time sequence set into a plurality of subsequences according to the target window parameter to obtain a subsequence set of each time sequence in the time sequence set, and obtaining a second distance characteristic value;

step D5, calculating a fluctuation factor of each time series according to the first distance feature value and the second distance feature value.

The irregular type is a time sequence type with chaotic overall state and irregular fluctuation. And (3) calculating a sequence fluctuation factor, namely a sample entropy value by applying a sample entropy algorithm to measure the irregular fluctuation degree of the sequence, and identifying whether the time sequence is an irregular time sequence or not based on a fluctuation statistical threshold value of a large amount of historical time sequence data. Wherein, the calculation method of the sequence fluctuation factor is as follows (formula 4-8):

if each sequence value in the time series X is: x (t), t ═ 1,2,3,. and n;

firstly, obtaining a first round of window parameters, wherein the window parameters comprise window length, firstly obtaining the first round of window length m, setting the obtained first round of window length as a target window parameter, taking m as the window length, performing sliding window processing on a time sequence according to the fixed window length, dividing the time sequence into k-n-m +1 subsequences shown in formula 4, and obtaining a subsequence set:

X_i(t)＝(X_i(t),X_i+1(t),...,X_i+m-1(t)) (4)

calculating the distance between each subsequence in the subsequence set and the rest n-m subsequences to obtain a distance set of each subsequence, wherein the distance in the distance set is the maximum value of the absolute value of the difference value of the corresponding sequence values of the two subsequences, namely the distance set is shown in the following formula 5:

d_ij＝max|X_i+k(t)-X_j+k(t)|,k＝0,1,2,...,m-1 (5)

defining a sequence standard deviation threshold:

F＝r×SD (6)

and r is an adjustable coefficient, can be adjusted according to the actual allowable deviation F, and is in a value range of 0.1-0.25 through analysis of a historical time sequence, and SD is a time sequence standard deviation. Go through_Xi(t)Counting the number of targets with the distance value larger than F in the distance set corresponding to each sub-sequence, and calculating a first distance characteristic value according to the number of the targets, wherein the distance characteristic value is the ratio of the number of the targets in the distance set and is recorded as

i＝1，2，...，k. According to k number of

Calculating the average value according to the following formula 7 to obtain the first distance characteristic value phi^m(t)。

Calculating a second round of window parameters based on the first round of window parameters m, e.g. increasing the window length m to m +1 and setting the new window length m +1 as the target window parameter, returning and performing the step of dividing each time series in the set of time series into a plurality of sub-series according to the irregular type classification model and the target window parameter, repeating the above steps until a second distance characteristic value Φ is calculated^m+1(t) of (d). Calculating a fluctuation factor according to the first distance characteristic value and the second distance characteristic value of each time series, wherein one calculation mode of the fluctuation factor is as shown in the following formula 8:

SampEn(t)＝lnΦ^m(t)-lnΦ^m+1(t) (8)

the larger the value of the fluctuation factor, i.e., the sample entropy sampen (t), is, the larger the fluctuation degree of the time series is, and the more irregular the feature performance is, so that the time series with the sample entropy greater than the preset sample entropy threshold is classified as an irregular time series. Therefore, when calculating the sequence fluctuation factor, the time sequence may be further processed by a plurality of rounds of sliding windows, more distance feature values are obtained according to the sliding window processing of different window lengths, and the sample entropy of the time sequence is calculated, so as to measure the fluctuation degree of the time sequence, which is not described herein again.

Further, the preset data classifier further includes a stationary classification model, the extracted multidimensional timing characteristics further include extreme difference sequences, and the step of performing characteristic extraction processing on each time sequence in the time sequence set by using each classification model to obtain the multidimensional timing characteristics of each time sequence further includes:

step E1, performing low-pass filtering processing on each time sequence in the time sequence set by using the stable classification model to obtain a plurality of low-pass filtering sequences;

step E2, calculating a stationary magnitude threshold based on each sequence value in the low-pass filtering sequence;

step E3, taking each sequence value in the low-pass filtering sequence as a center, performing sliding window processing on the low-pass filtering sequence to obtain each window sequence of the low-pass filtering sequence;

step E4, traversing each window sequence, calculating a target difference value between a maximum value and a minimum value of sequence values in each window sequence, and creating a marker sequence, wherein the sequence length of the marker sequence is the same as the sequence length of the low-pass filtering sequence;

and E5, if the target difference value is greater than the stationary magnitude threshold, marking the central sequence value of the target window sequence corresponding to the target difference value in the marking sequence to obtain an extreme value difference sequence.

When a stable classification model in a data classifier is used for carrying out feature extraction processing on a time sequence, firstly, whether the time sequence is a large-scale time sequence is judged, if not, the time sequence is not of a stable type, and if the time sequence is the large-scale time sequence, whether the time sequence is the stable time sequence needs to be further judged.

Specifically, each time series is first low-pass filtered to remove waveforms below a predetermined frequency, and a plurality of low-pass filtered series are obtained. Based on the sequence values of the low-pass filtering sequences, a stationary magnitude threshold is calculated, and one calculation method of the stationary magnitude threshold is shown in the following formula 9:

y_thres＝max(200,ave(Xsmooth)×0.6) (9)

in formula 9, y _ thres is a stationary magnitude threshold, Xsmooth is a low-pass filtered sequence, ave () is a sequence value average of the low-pass filtered sequence, a weighted sequence average is calculated by a coefficient of 0.6, if the weighted sequence average is greater than 200, the weighted sequence average is used as the stationary magnitude threshold, and if the weighted sequence average is less than or equal to 200, the stationary magnitude threshold is set to 200. The weighting coefficients 0.6 and 200 of the sequence mean are adjustable parameters obtained by analyzing according to the historical time sequence, which are only used as an exemplary illustration, and can be adjusted according to actual needs in practical application.

And performing sliding window processing on the low-pass filtering sequence by taking the position of each sequence value in the low-pass filtering sequence as a center, namely taking the subscript i of each sequence value as a center and a fixed window length to obtain a window sequence of each low-pass filtering sequence, and simultaneously creating a mark sequence with the same length as the input low-pass filtering sequence. And sliding a window from front to back according to subscripts of all sequence values in the low-pass filtering sequence to obtain a corresponding window sequence, calculating a difference value between a maximum value and a minimum value in each window sequence by traversing all the sequence values in the window sequence, and if the difference value is greater than a plateau magnitude threshold, marking the window sequence in a marking sequence according to a subscript center i of the window sequence, wherein the marking mode includes but is not limited to initializing all the sequence values in the marking sequence to 0, and setting a sequence value corresponding to the same subscript i as the window sequence center in the marking sequence to be 1. And finally, judging whether the large-scale time sequence is a stable time sequence or not according to the number ratio of the marked sequence values in the marked sequence, taking 0.07 as an example, if the ratio of the number of the marked sequence values of 1 in the marked sequence in the sequence length is less than 0.07, the marked sequence is regarded as the stable time sequence, otherwise, the marked sequence does not belong to the stable time sequence. In practical application, other labeling manners may also be adopted, for example, after all sequence values in the label sequence are initialized to 0, the sequence value in the label sequence that is the same as the subscript of the central sequence value in the window sequence is set to a non-zero value, and then the number of sequence values in the label sequence whose sequence value is not 0 is calculated through subscript traversal. Therefore, in general, the low-pass filtering sequence is marked based on the marking sequence, and whether the extreme value difference in the window sequence taking the sequence value of the same subscript as the center in the low-pass filtering sequence is larger than the stationary magnitude threshold is distinguished by using different sequence values in the marking sequence, so as to judge whether the large magnitude time sequence is the stationary time sequence. In this embodiment, the window length of the sliding window processing may be set by self-definition according to historical experience or actual needs, which is not described herein again.

Furthermore, the preset data classifier further includes a constant type classification model, the extracted multidimensional time series features further include constant feature values, and the step of performing feature extraction processing on each time series in the time series set by using each classification model to obtain the multidimensional time series features of each time series further includes:

and carrying out extremum filtering on the time sequence, only reserving partial sequence values of the time sequence in a standard format, taking the sequence values between 5 and 95 percentile in the reserved time sequence as an example, and then calculating the standard deviation and the mean value of the extremum filtering sequence after filtering extremum as constant characteristic values.

When judging whether the time series is a constant-type time series, according to whether the calculated constant characteristic value satisfies the condition shown in the following formula 11:

wherein Xi is an extremum filtering sequence obtained after filtering extrema by the time sequence in the standard format, std (Xi) is a standard deviation of the extremum filtering sequence, ave (Xi) is a mean value of the extremum filtering sequence, 0.5 and 50 are adjustable parameters, which is an exemplary description herein, a sequence satisfying the above conditions is a constant time sequence, otherwise, the sequence does not belong to the constant time sequence.

Further, time series that do not belong to the above-described various types are classified into special types of time series.

In the embodiment, the time sequences are subjected to feature extraction processing by using different classification models to obtain the multi-dimensional time sequence features of each time sequence, and the multi-dimensional time sequence features are comprehensively extracted to classify each time sequence, so that the classification precision of the time sequences can be effectively refined, and the prediction requirement under a complex scene can be favorably met.

In addition, referring to fig. 3, an embodiment of the present invention further provides a time series data classification apparatus, where the time series data classification apparatus includes:

a data obtaining module 10, configured to obtain a time sequence set to be classified, where the time sequence set includes multiple time sequences;

the feature extraction module 20 is configured to perform multi-dimensional feature extraction on each time series in the time series set to obtain multi-dimensional time series features;

and the data classification module 30 is configured to classify each time sequence in the time sequence set according to the multidimensional time sequence feature.

Optionally, the feature extraction module 20 is further configured to:

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the operations in the computer method provided by the above-mentioned embodiment.

In addition, an embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program product implements the operations in the computer method provided in the foregoing embodiments.

For the embodiments of the apparatus, the computer program product and the computer-readable storage medium of the present invention, reference may be made to the embodiments of the computer method of the present invention, which are not described herein again.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity/action/object from another entity/action/object without necessarily requiring or implying any actual such relationship or order between such entities/actions/objects; the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, in that elements described as separate components may or may not be physically separate. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the computer method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A time series data classification method is characterized by comprising the following steps:

2. The method for classifying time series data according to claim 1, wherein the step of extracting the multidimensional feature from each time series in the time series set to obtain the multidimensional time series feature comprises:

3. The method for classifying time series data according to claim 2, wherein the data classifier includes a magnitude classification model, the multi-dimensional time series features include a magnitude ratio, and the step of obtaining the multi-dimensional time series features of each time series by performing feature extraction processing on each time series in the time series set by using each classification model includes:

4. The method for classifying time series data according to claim 2, wherein the data classifier includes an upper/lower line classification model, the multi-dimensional time series feature includes an upper/lower line time point, and the step of obtaining the multi-dimensional time series feature of each time series by performing feature extraction processing on each time series in the time series set by using each classification model includes:

5. The method for classifying time series data according to claim 2, wherein the data classifier includes a fluctuation type classification model, the multi-dimensional time series features include sequence similarity, and the step of obtaining the multi-dimensional time series features of each time series by performing feature extraction processing on each time series in the time series set by using each classification model includes:

6. The method for classifying time series data according to claim 2, wherein the data classifier includes an irregular classification model, the multi-dimensional time series features include a series fluctuation factor, and the step of obtaining the multi-dimensional time series features of each time series by performing feature extraction processing on each time series in the time series set by using each classification model comprises:

7. The method for classifying time series data according to claim 2, wherein the data classifier includes a stationary classification model, the multi-dimensional time series features include an extreme value difference sequence, and the step of obtaining the multi-dimensional time series features of each time series by performing feature extraction processing on each time series in the time series set by using each classification model includes:

8. A time-series data sorting apparatus, characterized by comprising:

9. A terminal device, characterized in that the terminal device comprises: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the time-series data classification method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the time-series data classification method according to any one of claims 1 to 7.