CN111814897A - Time series data classification method based on multi-level shape - Google Patents

Time series data classification method based on multi-level shape Download PDF

Info

Publication number
CN111814897A
CN111814897A CN202010696976.2A CN202010696976A CN111814897A CN 111814897 A CN111814897 A CN 111814897A CN 202010696976 A CN202010696976 A CN 202010696976A CN 111814897 A CN111814897 A CN 111814897A
Authority
CN
China
Prior art keywords
shape
candidate
time sequence
data
subsequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010696976.2A
Other languages
Chinese (zh)
Inventor
丁琳琳
脱乃元
曹鲁杰
张翰林
宋宝燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN202010696976.2A priority Critical patent/CN111814897A/en
Publication of CN111814897A publication Critical patent/CN111814897A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A time sequence data classification method based on multi-level shape includes the steps of 1) preprocessing time sequence data: performing data dimension reduction processing on the original time sequence by using an SAX method; step 2) obtaining the time sequence initial subsequence: extracting a subsequence set in the time sequence by a sliding window method, and indirectly controlling the extraction length of the subsequence by changing and adjusting the size of a window; step 3), discovery and extraction of a multi-level shape candidate set: filtering and combining the candidate set through the proposed multi-level shape frame, and selecting a shape with large information gain as a candidate set; 4) shapelet transforms and constructs classifiers. According to the method, an efficient multi-level shape candidate set filtering model is provided, the number of shape candidate sets is effectively reduced, the shape sets with high classification capacity are rapidly screened, and then the effective classification of time sequence data is achieved through an ELM classifier.

Description

Time series data classification method based on multi-level shape
Technical Field
The invention belongs to the field of time series data mining, relates to a time series data classification method, and particularly relates to a time series data classification method based on a multi-layer shape.
Background
Time series data generally represents the observation of a potential process at a set sampling frequency over equally spaced time periods, and is derived from the fields of medical diagnosis, disaster prediction, commercial monitoring, and the like. Time series data generally has characteristics such as data bulk, dimension height, update are fast. Time series data classification has always been a major problem in the field of time series data mining, and has a wide range of applications. In the time series classification task, the shape technology is an effective method for solving the time series classification problem. The method for classifying based on the shape has the characteristics of high interpretability, strong reducibility, high operation efficiency, high classification accuracy and the like, can clearly reflect the category corresponding to the data, and can embody the intuitiveness of an expected classification effect. However, the existing shape acquisition method still has the problems of too many candidate sets and too large calculation amount. On one hand, with the help of the shape method, the distinguishing capability of the subsequences in the time sequence needs to be judged one by one, a large amount of operation cost is generated through distance similarity calculation, and the complexity in the discovery process is increased; on the other hand, in the operation process, a large number of candidate sets and alternative shape sequences are generated due to the excessive number of subsequences, so that a large amount of time is consumed for direct calculation, and a great challenge is brought to shape discovery and extraction. In addition, the ELM serving as a single hidden layer feedforward neural network has the characteristics of high training speed and high classification precision, is widely applied in various fields, and can combine the classification problem of time series with the existing classifier. Therefore, the invention provides a multi-level shape candidate set extraction method combined with the problem of huge shape candidate set calculation for research, and an ELM classifier is applied to research on the classification problem of time series data.
Since time series data generally does not have direct features, even though the potential features of dimension are still high through a complex feature selection technique, dimension reduction of time series data is generally required before classification of time series data. At present, the time series data dimension reduction classification methods widely used include PAA, SAX, shape method, and the like. The essence of the Shapelet method is that time sequence data is mapped from an original input space to a new feature space, however, the time sequence of different variables in the original time sequence is ignored, and in the process of generating a shape candidate set, the method of judging subsequences one by one may ignore the approximate relation of the shape, and the time efficiency and the calculation amount are very large.
Disclosure of Invention
In order to overcome the defects of the conventional shape classification method of the time sequence, the invention provides a time sequence data classification method based on a multilayer shape, which can quickly and effectively process the problem of accurately classifying high-dimensional time sequence data.
The purpose of the invention is realized by the following technical scheme:
a time sequence data classification method based on multi-level shape is characterized in that: the method comprises the following steps:
step 1) preprocessing time sequence data: performing data dimension reduction processing on the original time sequence by using an SAX method:
step 2) obtaining the time sequence initial subsequence: extracting a subsequence set in a time sequence by a sliding window method, and indirectly controlling the extraction length of the subsequence by changing and adjusting the size of a window;
step 3), discovery and extraction of a multi-level shape candidate set: filtering and combining the candidate set through the proposed multi-level shape frame, and selecting a shape with large information gain as a candidate set;
4) shapelet conversion and classifier construction:
4-1) shape classification conversion: firstly, establishing a simple initialized data matrix for the initial N time sequence data sets according to the number of the time sequence data sets, and simultaneously performing matrix generation on all shape candidate sets obtained by a multi-level frame method according to the sequence of the attributed time sequence; secondly, according to a multi-pair and multi-mapping relation between the initial N time sequence sets and the shape matrix, similarity calculation of Euclidean distances is carried out to obtain characteristic values of each time sequence, wherein the attribute of each characteristic value represents one shape, and the value of each attribute is the distance from the shape to the original sequence; finally, constructing the characteristic values into N characteristic vectors to finish the characteristic vector representation of the time sequence data set;
4.2) after the classifier is established for the time sequence, putting the subsequent training sample data into the classifier for training, wherein in the training process, ELM firstly generates input weight and hidden node threshold value randomly, and then calculates output weight of SLFNs according to the training data.
In the step 1), the concrete steps are as follows:
1.1) normalized piecewise approximation of data: changing the initial time sequence data into a data set with the average value of 0 and the variance of 1 by adopting a 0-average standardization method;
1.2) performing character representation on the processed data: and mapping the average value in each segment into a Gaussian distribution table, wherein the range of the Gaussian distribution table represents the expression range of time sequence dimensionality reduction, and performing a symbolization operation according to the initialized and set w parameter index, the size of the letter base r and the range of the split point beta to finish symbol aggregation approximate expression.
In the step 2), the concrete steps are as follows:
firstly, setting the size of a sliding window, and fixing the length and the range of each extracted subsequence; secondly, sliding the window according to the principle of moving 1 to the right each time, changing the position of the window in the time sequence, and finishing the extraction of the subsequence at different positions in the time sequence; and finally, adjusting and changing the size of the window to finish the extraction of all the subsequences with different lengths, and storing the extracted subsequences in a set.
In the step 3), the concrete steps are as follows:
3.1) initial subsequence clustering based on k-means: after extracting subsequences of all time sequences, clustering candidate subsequences, introducing a DTW distance measurement calculation mode as a measurement index, filtering and screening a subsequence set, wherein DTW distances represent the similarity of subsequence shapes, and dividing all alternative shape candidate sets by adopting a DTW algorithm to enable the shape candidate sets in the same cluster to have similar characteristics in shape;
the method comprises the steps of calculating the similarity of shape based on DTW distance, setting two different shape sequences, namely X1 { X1, X2, … xM }, Y1 { Y1, Y2, …, yN }, and firstly calculating a distance matrix
Figure BDA0002591600970000031
Then calculating the accumulated distance matrix Sij=Dij+min(si,j-1,Si,j-1,Si-1,j-1)
3.2) updating the clustering result: after the subsequence candidate set is clustered by combining a k-means method and a DTW method, real-time iteration and updating are carried out on the obtained clustering result so as to ensure that the clustering result of the subsequence meets the characteristic of approximate shape, thereby realizing definite classification in subsequent SHAPET candidate extraction;
3.3) establishing a multi-level shape extraction framework.
In the step 3.3), the concrete steps are as follows:
3.3.1) performing intra-level candidate set merging: firstly, according to the condition that a 'heap' is generated by sub-sequence clustering, hierarchical division of all clustered sub-sequences is completed; secondly, the integration of the candidate sets is completed through the inherent 'approximate' relation of the candidate sets in the hierarchy, the screening is carried out through the approximate characteristics of the shapes, the candidate sets with similar shapes are merged and integrated, and the candidate sets with obvious shape characteristics have distinguishing capacity and are updated interpretively, so that the integration is reserved; the DTW distance is used as a threshold or a given threshold, two candidate sets with the distance smaller than the threshold are shown to be very similar in shape, and the candidate sets close to the threshold range are reserved for reduction; finally, obtaining a simplified shape candidate set at each level;
3.3.2) performing inter-level candidate set merging: in the SH-ELM model, a Levenshtein Distance algorithm is used for merging candidate sets among multiple levels, the lengths of two character strings a and b are divided into | a | and | b |, and the Levenshtein Distance calculation formula is as follows:
Figure BDA0002591600970000032
wherein when ai=bjWhen, leva,b(i, j) value is 0, otherwise leva,b(i, j) value 1, leva,b(i, j) is the edit distance between the first i characters and the first j characters of b, the similarity Sim of a and ba,bExpressed as:
Sima,b=1-leva,b(|a|,|b|)/max(|a|,|b|)
in the process of merging the candidate sets, performing connection calculation on the candidate sets in adjacent layers in the frame, and comparing and screening the candidate sets between the layers by using a Levenshtein Distance method and by means of charaterized approximate Distance calculation;
3.3.3) Multi-level top-k candidate set validation: taking the index of information gain as a judgment standard for measuring the classification capability, and selecting k shape slices with the maximum information gain in a single layer to finish the extraction task; and finally confirming the extracted candidate set, and finishing a classification task of the candidate time sequence by using the extracted candidate set. The process is a process for extracting k best shapelets from the data set; initially, the k-shape set is empty, then a candidate shape sequence is obtained in each layer, and the distance between the sequence and the layer is required to be calculated; and after the distance value is obtained, calculating the information gain corresponding to the sequence, sequencing according to the information gain, finishing the candidate replacement of the optimal shape, and finally outputting the optimal k-shape.
The beneficial effects created by the invention are as follows: the invention provides a time sequence data classification method based on multilayer shapets, which designs an efficient multilayer shapet candidate set filtering model, further effectively reduces the number of shapelets candidate sets, quickly screens shapelets sets with higher classification capability, and then realizes effective classification of time sequence data through an ELM classifier.
Drawings
FIG. 1: the invention provides a time sequence SH-ELM model work flow chart.
FIG. 2: the invention discloses a staged schematic diagram of a time series data classification model.
FIG. 3: the invention discloses a schematic diagram of a time sequence SAX (software-executable code) tokenization dimension reduction representation method.
FIG. 4: the invention provides a schematic diagram of a subsequence candidate set extraction execution process.
FIG. 5: the invention provides a model structure schematic diagram of a multi-level shape frame.
FIG. 6 a: the invention compares the change in k-value in the real dataset to the sort run time.
FIG. 6 b: comparison of k-value changes in the synthetic dataset to sort run times in the experiments of the present invention.
FIG. 7 a: in the experiment of the invention, the change of the k value in the real data set is compared with the classification accuracy.
FIG. 7 b: the comparison of k-value changes in the synthetic dataset to classification accuracy in the experiments of the invention.
FIG. 8 a: the invention compares the change of the number of layers in a real data set with the classification running time in the experiment.
FIG. 8 b: the number of layers in the synthetic dataset varied in comparison to the sort run time in the experiments of the present invention.
FIG. 9: the influence of the time sequence length on the classification running time in the experiment of the invention.
FIG. 10: the influence of the time sequence length on the classification accuracy in the experiment of the invention.
Detailed Description
First some related concepts are given. If a domain of a shape object contains the minimum number of data objects, then the shape object is the cluster center for a shape direction. Definitions 1 and 2 are the meaning of the expression of time series and subsequence, and definitions 3 and 4 give the definitions of candidate set and distance calculation method.
Definition 1: time series and subsequences: first, a time sequence T of length m is given, each subsequence S of T being a continuous truncation starting from any position of T. Wherein the content of the first and second substances,assuming that the subsequence has a length of l and the point at which extraction begins is p, the subsequence is denoted as S-tp,...,tp+l-1The extraction range of the point p is that l is less than or equal to p is less than or equal to m-l + 1.
Definition 2: sliding the window: given a time sequence T of length m and a defined subsequence length l, all possible subsequences can be extracted by setting a sliding window of size l over T. The superscript l and the subscript p denote the subsequence extraction length and the starting position of the sliding window in the time series, respectively. The set of all subsequences of length l extracted from T is defined as
Figure BDA0002591600970000051
The set of subsequences represented by a sliding window is
Figure BDA0002591600970000052
Definition 3: distance in time series: dist (T, S) is a distance function, and distance operation is performed on two time series T and R with the same length through the Dist function, and a distance value d is returned, and the value is the distance value between the time series. The Dist function can also be used to measure the distance between two subsequences of the same length.
Definition 4: distance from time series to subsequence: SubsequenceDist (T, S) is a distance function, which returns a non-negative value d using the time sequence T and the subsequence S as inputs, i.e. the distance between the time sequence and the subsequence. Subsequence Dist (T, S) ═ MIN (Dist (S, S')).
Definition 5: dimension reduction normalization of the time series: to obtain an efficient dimension-reduction feature representation of the time-series data, a normalization method (z-normalization, as shown below) is used to convert the time-series with length m into w symbols. Initializing time-series data T, converting into standard sequence, and dividing time-series into w segments of equal size, i.e. C ═ C1,c2,....,cω. The dimension reduction normalization is expressed as follows:
Figure BDA0002591600970000053
a time sequence data classification method based on multi-level shape is characterized in that: the method comprises the following steps:
step 1) preprocessing time sequence data: performing data dimension reduction processing on the original time sequence by using an SAX method:
1.1) normalized piecewise approximation of data: changing the initial time sequence data into a data set with the average value of 0 and the variance of 1 by adopting a 0-average standardization method;
1.2) performing character representation on the processed data: and mapping the average value in each segment into a Gaussian distribution table, wherein the range of the Gaussian distribution table represents the expression range of time sequence dimensionality reduction, and performing a symbolization operation according to the initialized and set w parameter index, the size of the letter base r and the range of the split point beta to finish symbol aggregation approximate expression.
Step 2) obtaining the time sequence initial subsequence: extracting a subsequence set in a time sequence by a sliding window method, and indirectly controlling the extraction length of the subsequence by changing and adjusting the size of a window, wherein the method comprises the following specific steps:
firstly, setting the size of a sliding window, and fixing the length and the range of each extracted subsequence; secondly, sliding the window according to the principle of moving 1 to the right each time, changing the position of the window in the time sequence, and finishing the extraction of the subsequence at different positions in the time sequence; and finally, adjusting and changing the size of the window to finish the extraction of all the subsequences with different lengths, and storing the extracted subsequences in a set.
Step 3), discovery and extraction of a multi-level shape candidate set: filtering and combining the candidate set through the proposed multi-level shape frame, and selecting a shape with large information gain as a candidate set;
3.1) initial subsequence clustering based on k-means: after extracting subsequences of all time sequences, clustering candidate subsequences, introducing a DTW distance measurement calculation mode as a measurement index, filtering and screening a subsequence set, wherein DTW distances represent the similarity of subsequence shapes, and dividing all alternative shape candidate sets by adopting a DTW algorithm to enable the shape candidate sets in the same cluster to have similar characteristics in shape;
the method comprises the steps of calculating the similarity of shape based on DTW distance, setting two different shape sequences, namely X1 { X1, X2, … xM }, Y1 { Y1, Y2, …, yN }, and firstly calculating a distance matrix
Figure BDA0002591600970000061
Then calculating the accumulated distance matrix Sij=Dij+min(Si,j-1,Si,j-1,Si-1,j-1)
3.2) updating the clustering result: after the subsequence candidate set is clustered by combining a k-means method and a DTW method, real-time iteration and updating are carried out on the obtained clustering result so as to ensure that the clustering result of the subsequence meets the characteristic of approximate shape, thereby realizing definite classification in subsequent SHAPET candidate extraction;
3.3) establishing a multi-level shape extraction framework:
3.3.1) performing intra-level candidate set merging: firstly, according to the condition that a 'heap' is generated by sub-sequence clustering, hierarchical division of all clustered sub-sequences is completed; secondly, the integration of the candidate sets is completed through the inherent 'approximate' relation of the candidate sets in the hierarchy, the screening is carried out through the approximate characteristics of the shapes, the candidate sets with similar shapes are merged and integrated, and the candidate sets with obvious shape characteristics have distinguishing capacity and are updated interpretively, so that the integration is reserved; the DTW distance is used as a threshold or a given threshold, two candidate sets with the distance smaller than the threshold are shown to be very similar in shape, and the candidate sets close to the threshold range are reserved for reduction; finally, obtaining a simplified shape candidate set at each level;
3.3.2) performing inter-level candidate set merging: in the SH-ELM model, a Levenshtein Distance algorithm is used for merging candidate sets among multiple levels, the lengths of two character strings a and b are divided into | a | and | b |, and the Levenshtein Distance calculation formula is as follows:
Figure BDA0002591600970000071
wherein when ai=bjWhen, leva,b(i, j) value is 0, otherwise leva,b(i, j) value 1, leva,b(i, j) is the edit distance between the first i characters and the first j characters of b, the similarity Sim of a and ba,bExpressed as:
Sima,b=1-leva,b(|a|,|b|)/max(|a|,|b|)
in the process of merging the candidate sets, performing connection calculation on the candidate sets in adjacent layers in the frame, and comparing and screening the candidate sets between the layers by using a Levenshtein Distance method and by means of charaterized approximate Distance calculation;
3.3.3) Multi-level top-k candidate set validation: taking the index of information gain as a judgment standard for measuring the classification capability, and selecting k shape slices with the maximum information gain in a single layer to finish the extraction task; and finally confirming the extracted candidate set, and finishing a classification task of the candidate time sequence by using the extracted candidate set. The process is a process for extracting k best shapelets from the data set; initially, the k-shape set is empty, then a candidate shape sequence is obtained in each layer, and the distance between the sequence and the layer is required to be calculated; and after the distance value is obtained, calculating the information gain corresponding to the sequence, sequencing according to the information gain, finishing the candidate replacement of the optimal shape, and finally outputting the optimal k-shape.
4) Shapelet conversion and classifier construction:
4-1) shape classification conversion: firstly, establishing a simple initialized data matrix for the initial N time sequence data sets according to the number of the time sequence data sets, and simultaneously performing matrix generation on all shape candidate sets obtained by a multi-level frame method according to the sequence of the attributed time sequence; secondly, according to a multi-pair and multi-mapping relation between the initial N time sequence sets and the shape matrix, similarity calculation of Euclidean distances is carried out to obtain characteristic values of each time sequence, wherein the attribute of each characteristic value represents one shape, and the value of each attribute is the distance from the shape to the original sequence; finally, constructing the characteristic values into N characteristic vectors to finish the characteristic vector representation of the time sequence data set;
4.2) after the classifier is established for the time sequence, putting the subsequent training sample data into the classifier for training, wherein in the training process, the ELM firstly randomly generates input weight and hidden node threshold, and then calculates the output weight of SLFNs according to the training data.
Example 1:
(1) a multi-level time sequence shield classification model is constructed and comprises three stages, the whole work flow of the model is shown in figure 1, the staged work distribution of the model is shown in figure 2, and the staged distribution respectively comprises a candidate set acquisition stage, a multi-level screening stage and a shield conversion stage. Firstly, performing dimensionality reduction on a time sequence by adopting an SAX algorithm, extracting a subsequence from the sequence by using a sliding window method, and then performing clustering processing on a shape candidate set by using a DTW clustering method.
Fig. 3 shows an example of extracting subsequences by using a sliding window method, where the data representation set of time-series data after dimension reduction by a dimension reduction SAX tokenization representation method is { dcbbacdcbdcacd }, the size W of the sliding window is set to 3, the subsequences are sequentially extracted from the left, and are { dcb, cbb, bba,... acd }, respectively, the extracted subsequences are constructed into a set, and then the set is subjected to candidate classification in shape by a DTW clustering method.
(2) And constructing a multi-level shape framework, wherein candidate subsequences with similar shapes can be formed in each time sequence. These very similar candidate subsequences are mapped to one layer by the DTW clustering method. Through the hierarchical combination among a plurality of layers, an SH-ELM framework model is constructed.
Information gain is defined 6. Given some splitting policy sp that divides the data set D into two subsets D1 and D2, the entropy before and after splitting is I (D) and
Figure BDA0002591600970000081
(D) in that respect Thus, the information gain of the segmentation rule is
Figure BDA0002591600970000082
Gain(sp)=I(D)-f(D1)I(D1)+f(D2)I(D2)。
Definition 7: the optimum split point. The time series data set D consists of two categories a and B. For shape candidate S, a certain distance threshold d is selectedthAnd divides D into D1 and D2 such that for each time series object T in D1i,SubsequenceDist(T1,S)<dthAnd for each time-series object T2 in D2, subsequenceDist (T2, S) ≧ Dth. The optimal division point is the distance threshold Gain (S, d)osp(D,S))≥Gain(S,d′th)。
Definition 8: sharelet with the best segmentation point. Given a set of time series data D consisting of two classes A and B, shape (D) is a subsequence thereof having a corresponding best segmentation point, Gain (D), Dosp(D,shapelet(D))≥Gain(S,dosp(D,S))。
Fig. 4 shows the architecture of a model of a multi-level shape frame, and it can be seen that candidate sets with very similar shapes in 5 time series (T1, T2, T3, T4, T5) are divided into the same level, and a shape set of candidates in each time series is effectively divided into shapes, so as to form a shape effect after DTW clustering; the shape candidate set in each layer selects shape by using a calculation method for defining information gain in 6-8, and selects k shape with the largest information gain.
FIG. 5 shows the hierarchical relationship generated in the multi-level model, and the candidate shape set in each level is passed through the Levenshtein Distance algorithm. The subsequence dbdbdb and the subsequence dbdcb generated in the graph 4 are calculated, and a candidate set can be further filtered, so that the filtered subsequence set still keeps the effect of information gain, namely the subsequence dbcbdb simplifies the later-stage operation amount through the operation among the character strings, meanwhile, the later-stage classification calculation time is reduced to a certain extent again, and the classification efficiency is improved.
(3) And performing matrix conversion calculation on the obtained shape set and the initial time sequence set by using a shape conversion technology to obtain a spatial feature vector which is used as an input of ELM classification, and further directly linking the constructed SH-ELM model with a classifier for application. For each piece of time series data TiThe distances between the shape vector and the k shape shells are sequentially calculated, and the distances are combined to form a distance vector, wherein the distance vector is the representation of the sequence data among the shape shells; on the other hand, each value in the vector contains the sequential relation of data in the time sequence data.
In terms of parameter setting, the shape candidate generation process includes many factors. Because the influence factor of the shape length is overcome, the DTW distance calculation is used for clustering the shape, and therefore the influence of the k value is considered. The number of shape is determined by the k value of a key parameter, and the k parameter is also a factor influencing the experimental time, so that the influence of different k values on the classification precision is tested.
(4) Performing time series data classification
The performance of the screening optimization method is evaluated in detail through various experimental settings, including the influence of different parameters in the SH-ELM model on the classification accuracy and the extraction speed. The different parameter settings include the choice of k values, the size of the data sets and the length of each data set. The classification results of SH-ELM models include many types. The performance of the proposed SH-ELM model was evaluated by comparing it to a number of current mainstream time series shape extraction classification algorithms including FSH (fast shape), divshade-ELM (variogram Top-kshapelet) and FLAG. The invention has been tested mainly from 4 aspects, which are described below:
varying influence of changing k-value parameter
Fig. 6 a-6 b show the effect of the change of k-value of the SH-ELM model in the real dataset and the synthetic dataset on the runtime. Wherein the horizontal axis represents the k value and the vertical axis represents the run time. With the increase of the k value, the computation time of the SH-ELM algorithm provided by the invention in the real data set and the synthetic data set is superior to that of a DIV-Shapelet algorithm, and the SH-ELM algorithm has better effects in the early stage and the later stage of the change of the k value. The SH-ELM algorithm provided by the invention screens redundant candidate sets in the process of the subsequence candidate set, reduces the operation amount of shape extraction, and thus reduces the consumed time.
The experimental graphs shown in fig. 7 a-7 b show the influence of the change of k value on the classification accuracy of time series, and reflect the comparison between the SH-ELM model proposed by the present invention and other algorithms. It can be seen that, as the k value changes, the classification accuracy on the time series data set is improved, and the algorithm classification accuracy is better than that of the FSH algorithm and the FLAG algorithm, which also indicates that the shape set at this time has good distinguishing capability on the time series data set. The SH-ELM algorithm applies the updating strategy of the candidate set, replaces and optimizes the shape set in the model, and improves the adaptivity and the accuracy of the classifier.
Varying effects of varying the number of tiers
In the experimental diagrams of fig. 8a to 8b, the horizontal axis represents the number of layers and the vertical axis represents the operating time. The figure shows the impact of the number of layers on the classification run time in the real dataset and the composite dataset, respectively. As the number of tiers increases, the runtime of the taxonomy assumes an increasing posture. This is because the number of layers increases, the layers corresponding to the candidate set become more accurate, and the computation time required for the computation also increases, so that the running time of the model increases relatively.
Model classification time and velocity profiles
Since the SH-ELM model takes the longest time in the shield classification experiment stage, the experiment of the invention compares the time and the speed of the stage with other mainstream shield extraction methods. FIG. 9 shows a comparison of the classification run times of the algorithm of the present invention. The present invention compares the algorithm of the present invention to Fast-Shapelet (FSH), Div-Shapelet (DIVSH) and FLAG algorithms. In the SH-ELM model of the invention, firstly, the dimension of the whole time sequence dataset is effectively reduced by using the SAX dimension reduction method, so that the search time in the shape discovery phase becomes faster. And secondly, in the stage of searching for the shape, the shape method of DTW clustering is adopted to replace the original algorithm. By sequentially traversing the subsequences, the candidate set is then searched out during shape discovery. The Div-Shapelet algorithm starts to compute the sub-sequence in the initial part of the whole time series. The runtime of the algorithm of the present invention is shorter than the original Fast-Shapelet algorithm and DivShap-ELM algorithm as the length of the time series data increases.
Comparison of model classification accuracy
Fig. 10 shows the classification accuracy of different time series classification algorithms. Along with the increase of the length of the time sequence, the classification precision of various algorithms fluctuates, and other comparison algorithms are better than the SH-ELM model provided by the invention in the intermediate stage of the classification task. This is because the length size of the time series is changed, which has a certain influence on the extraction of the shape subsequence. It can be seen that, as the length of the time series data increases, the classification accuracy in the initial stage and the subsequent stage of the SH-ELM model of the present invention is higher, because the multilevel model of the present invention can select subsequences with lower similarity, and the accuracy of time series classification is more accurate, slightly higher than FSH and DIV-SH.

Claims (5)

1. A time sequence data classification method based on multi-level shape is characterized in that: the method comprises the following steps:
step 1) preprocessing time sequence data: performing data dimension reduction processing on the original time sequence by using an SAX method:
step 2) obtaining the time sequence initial subsequence: extracting a subsequence set in the time sequence by a sliding window method, and indirectly controlling the extraction length of the subsequence by changing and adjusting the size of a window;
step 3), discovery and extraction of a multi-level shape candidate set: filtering and combining the candidate set through the proposed multi-level shape frame, and selecting a shape with large information gain as a candidate set;
4) shapelet conversion and classifier construction:
4-1) shape classification conversion: firstly, establishing a simple initialized data matrix for the initial N time sequence data sets according to the number of the initial N time sequence data sets, and simultaneously performing matrix generation on all shape candidate sets obtained by a multi-level frame method according to the sequence of the attributed time sequences; secondly, according to a many-to-many mapping relation between the initial N time sequence sets and the shape matrix, similarity calculation of Euclidean distances is carried out to obtain characteristic values of each time sequence, wherein the attribute of each characteristic value represents one shape, and the value of each attribute is the distance from the shape to the original sequence; finally, constructing the characteristic values into N characteristic vectors to finish the characteristic vector representation of the time sequence data set;
4.2) after the classifier is established for the time sequence, putting the subsequent training sample data into the classifier for training, wherein in the training process, the ELM firstly randomly generates input weight and hidden node threshold, and then calculates the output weight of SLFNs according to the training data.
2. The method for classifying time-series data based on multilevel shape according to claim 1, wherein: in the step 1), the concrete steps are as follows:
1.1) normalized piecewise approximation of data: changing the initial time sequence data into a data set with a mean value of 0 and a variance of 1 by adopting a 0-mean standardization method;
1.2) performing character representation on the processed data: and mapping the average value in each segment into a Gaussian distribution table, wherein the range of the Gaussian distribution table represents the expression range of time sequence dimensionality reduction, and performing a symbolization operation according to the initialized and set w parameter index, the size of the letter base r and the range of the split point beta to finish symbol aggregation approximate expression.
3. The method for classifying time-series data based on multilevel shape according to claim 1, wherein: in the step 2), the concrete steps are as follows:
firstly, setting the size of a sliding window, and fixing the length and the range of each extracted subsequence; secondly, sliding the window according to the principle of moving 1 to the right each time, changing the position of the window in the time sequence, and finishing the extraction of subsequences at different positions in the time sequence; and finally, adjusting and changing the size of the window to finish the extraction of all the subsequences with different lengths, and storing the extracted subsequences in a set.
4. The method for classifying time-series data based on multilevel shape according to claim 1, wherein: in the step 3), the concrete steps are as follows:
3.1) initial subsequence clustering based on k-means: after extracting subsequences of all time sequences, clustering candidate subsequences, introducing a DTW distance measurement calculation mode as a measurement index, filtering and screening a subsequence set, wherein DTW distances represent the similarity of subsequence shapes, and dividing all alternative shape candidate sets by adopting a DTW algorithm to enable the shape candidate sets in the same cluster to have similar characteristics;
the method comprises the steps of calculating the similarity of shape based on DTW distance, setting two different shape sequences, namely X1 { X1, X2, … xM }, Y1 { Y1, Y2, …, yN }, and firstly calculating a distance matrix
Figure FDA0002591600960000021
Then, the accumulated distance matrix S is calculatedij=Dij+min(Si,j-1,Si,j-1,Si-1,j-1)
3.2) updating the clustering result: after the subsequence candidate set is clustered by combining a k-means method and a DTW method, real-time iteration and updating are carried out on the obtained clustering result so as to ensure that the clustering result of the subsequence meets the characteristic of approximate shape, thereby realizing definite classification in subsequent SHAPET candidate extraction;
3.3) establishing a multi-level shape extraction framework.
5. The method for classifying time-series data based on multilevel shape according to claim 4, wherein: in the step 3.3), the concrete steps are as follows:
3.3.1) performing intra-level candidate set merging: firstly, according to the condition that a 'heap' is generated by sub-sequence clustering, hierarchical division of all clustered sub-sequences is completed; secondly, the integration of the candidate sets is completed through the inherent 'approximate' relation of the candidate sets in the hierarchy, the screening is carried out through the approximate characteristics of the shapes, the candidate sets with similar shapes are merged and integrated, and the candidate sets with obvious shape characteristics have distinguishing capacity and are updated interpretively, so that the integration is reserved; the DTW distance is used as a threshold or a given threshold, two candidate sets with the distance smaller than the threshold are shown to be very similar in shape, and the candidate sets close to the threshold range are reserved for reduction; finally, a simplified shape candidate set is obtained at each level;
3.3.2) performing inter-level candidate set merging: in the SH-ELM model, a Levenshtein Distance algorithm is used for merging candidate sets among multiple levels, the lengths of two character strings a and b are divided into | a | and | b |, and the Levenshtein Distance calculation formula is as follows:
Figure FDA0002591600960000022
wherein when ai=bjWhen, leva,b(i, j) value is 0, otherwise leva,b(i, j) value 1, leva,b(i, j) is the edit distance between the first i characters and the first j characters of b, the similarity Sim of a and ba,bExpressed as:
Sima,b=1-leva,b(|a|,|b|)/max(|a|,|b|)
in the process of merging the candidate sets, performing connection calculation on the candidate sets in adjacent layers in the frame, and comparing and screening the candidate sets among the layers by using a Levenshtein Distance method and by means of charaterized approximate Distance calculation;
3.3.3) Multi-level top-k candidate set validation: taking the index of information gain as a judgment standard for measuring the classification capability, and selecting k shape slices with the maximum information gain in a single layer, and finishing the extraction task by top-k shape slices; and finally confirming the extracted candidate set, and finishing the classification task of the candidate time sequence by using the extracted candidate set. The process is a process of extracting k best shapelets from the data set; initially, the k-shape set is empty, then a candidate shape sequence is obtained in each layer, and the distance between the sequence and the layer is required to be calculated; and after the distance value is obtained, calculating the information gain corresponding to the sequence, sequencing according to the information gain, finishing the candidate replacement of the optimal shape, and finally outputting the optimal k-shape.
CN202010696976.2A 2020-07-20 2020-07-20 Time series data classification method based on multi-level shape Pending CN111814897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010696976.2A CN111814897A (en) 2020-07-20 2020-07-20 Time series data classification method based on multi-level shape

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010696976.2A CN111814897A (en) 2020-07-20 2020-07-20 Time series data classification method based on multi-level shape

Publications (1)

Publication Number Publication Date
CN111814897A true CN111814897A (en) 2020-10-23

Family

ID=72864946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010696976.2A Pending CN111814897A (en) 2020-07-20 2020-07-20 Time series data classification method based on multi-level shape

Country Status (1)

Country Link
CN (1) CN111814897A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112416661A (en) * 2020-11-18 2021-02-26 清华大学 Multi-index time sequence anomaly detection method and device based on compressed sensing
CN113239990A (en) * 2021-04-27 2021-08-10 中国银联股份有限公司 Method and device for performing feature processing on sequence data and storage medium
CN113409054A (en) * 2021-06-18 2021-09-17 浙大城市学院 Suspicious transaction identification model construction method
CN113435381A (en) * 2021-07-07 2021-09-24 贵州东方世纪科技股份有限公司 Method for identifying flood in field aiming at different watersheds
CN113608146A (en) * 2021-08-06 2021-11-05 云南电网有限责任公司昆明供电局 Fault line selection method suitable for forest fire high-resistance grounding condition
CN113759278A (en) * 2021-08-10 2021-12-07 云南电网有限责任公司昆明供电局 Ground fault line selection method suitable for small current grounding system
CN113836240A (en) * 2021-09-07 2021-12-24 招商银行股份有限公司 Time sequence data classification method and device, terminal equipment and storage medium
CN113986674A (en) * 2021-10-28 2022-01-28 建信金融科技有限责任公司 Method and device for detecting abnormity of time sequence data and electronic equipment
CN114372538A (en) * 2022-03-22 2022-04-19 中国海洋大学 Method for convolution classification of scale vortex time series in towed sensor array
CN114726589A (en) * 2022-03-17 2022-07-08 南京科技职业学院 Alarm data fusion method
CN115357716A (en) * 2022-08-30 2022-11-18 中南民族大学 Time sequence data representation learning method integrating bag-of-words model and graph embedding
CN117407733A (en) * 2023-12-12 2024-01-16 南昌科晨电力试验研究有限公司 Flow anomaly detection method and system based on countermeasure generation shapelet
CN117493857A (en) * 2023-11-15 2024-02-02 国网四川省电力公司眉山供电公司 Electric energy metering abnormality judging method, system, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975932A (en) * 2016-05-04 2016-09-28 广东工业大学 Gait recognition and classification method based on time sequence shapelet
CN110389975A (en) * 2019-08-01 2019-10-29 中南大学 Time series early stage classification method and equipment based on shapelet

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975932A (en) * 2016-05-04 2016-09-28 广东工业大学 Gait recognition and classification method based on time sequence shapelet
CN110389975A (en) * 2019-08-01 2019-10-29 中南大学 Time series early stage classification method and equipment based on shapelet

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CE0B74704937: "Levenshtein distance(编辑距离)", 简书 *
QIUYAN YAN等: "adapting ELM to Time Series Classi_cation: A Novel Diversi_ed Top-k Shapelets Extraction Method"", ADC 2016:DATABASES THEORY AND APPLICATIONS, vol. 9877, pages 1 - 13 *
余思琴;闫秋艳;闫欣鸣;: "基于最佳u-shapelets的时间序列聚类算法", 计算机应用, no. 08 *
孙其法;闫秋艳;闫欣鸣;: "基于多样化top-k shapelets转换的时间序列分类方法", 计算机应用, no. 02 *
文必龙;李菲;马强;: "面向线性文本的K-means聚类算法研究", 计算机技术与发展, no. 09 *
闫欣鸣;孟凡荣;闫秋艳;: "基于趋势特征表示的shapelet分类方法", 计算机应用, no. 08 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112416661B (en) * 2020-11-18 2022-02-01 清华大学 Multi-index time sequence anomaly detection method and device based on compressed sensing
CN112416661A (en) * 2020-11-18 2021-02-26 清华大学 Multi-index time sequence anomaly detection method and device based on compressed sensing
CN113239990A (en) * 2021-04-27 2021-08-10 中国银联股份有限公司 Method and device for performing feature processing on sequence data and storage medium
CN113409054A (en) * 2021-06-18 2021-09-17 浙大城市学院 Suspicious transaction identification model construction method
CN113435381A (en) * 2021-07-07 2021-09-24 贵州东方世纪科技股份有限公司 Method for identifying flood in field aiming at different watersheds
CN113608146A (en) * 2021-08-06 2021-11-05 云南电网有限责任公司昆明供电局 Fault line selection method suitable for forest fire high-resistance grounding condition
CN113608146B (en) * 2021-08-06 2023-12-19 云南电网有限责任公司昆明供电局 Fault line selection method suitable for forest fire under high-resistance grounding condition
CN113759278B (en) * 2021-08-10 2023-12-19 云南电网有限责任公司昆明供电局 Ground fault line selection method suitable for small-current grounding system
CN113759278A (en) * 2021-08-10 2021-12-07 云南电网有限责任公司昆明供电局 Ground fault line selection method suitable for small current grounding system
CN113836240B (en) * 2021-09-07 2024-02-20 招商银行股份有限公司 Time sequence data classification method, device, terminal equipment and storage medium
CN113836240A (en) * 2021-09-07 2021-12-24 招商银行股份有限公司 Time sequence data classification method and device, terminal equipment and storage medium
CN113986674A (en) * 2021-10-28 2022-01-28 建信金融科技有限责任公司 Method and device for detecting abnormity of time sequence data and electronic equipment
CN114726589A (en) * 2022-03-17 2022-07-08 南京科技职业学院 Alarm data fusion method
CN114372538A (en) * 2022-03-22 2022-04-19 中国海洋大学 Method for convolution classification of scale vortex time series in towed sensor array
CN115357716B (en) * 2022-08-30 2023-07-04 中南民族大学 Learning time sequence data classification method integrating word bag model and graph embedding
CN115357716A (en) * 2022-08-30 2022-11-18 中南民族大学 Time sequence data representation learning method integrating bag-of-words model and graph embedding
CN117493857A (en) * 2023-11-15 2024-02-02 国网四川省电力公司眉山供电公司 Electric energy metering abnormality judging method, system, equipment and medium
CN117407733A (en) * 2023-12-12 2024-01-16 南昌科晨电力试验研究有限公司 Flow anomaly detection method and system based on countermeasure generation shapelet
CN117407733B (en) * 2023-12-12 2024-04-02 南昌科晨电力试验研究有限公司 Flow anomaly detection method and system based on countermeasure generation shapelet

Similar Documents

Publication Publication Date Title
CN111814897A (en) Time series data classification method based on multi-level shape
US11941523B2 (en) Stochastic gradient boosting for deep neural networks
US20220076150A1 (en) Method, apparatus and system for estimating causality among observed variables
Craven et al. Using neural networks for data mining
Bernard et al. Dynamic random forests
CN107578061A (en) Based on the imbalanced data classification issue method for minimizing loss study
Lenz et al. Scalable approximate FRNN-OWA classification
CN109472088B (en) Shale gas-conditioned production well production pressure dynamic prediction method
CN112765477B (en) Information processing method and device, information recommendation method and device, electronic equipment and storage medium
CN110110858B (en) Automatic machine learning method based on reinforcement learning
CN111553389B (en) Decision tree generation method for understanding deep learning model target classification decision mechanism
Li et al. Linear time complexity time series classification with bag-of-pattern-features
Chen et al. Learning to screen for fast softmax inference on large vocabulary neural networks
Chen et al. SS-HCNN: Semi-supervised hierarchical convolutional neural network for image classification
CN110909785B (en) Multitask Triplet loss function learning method based on semantic hierarchy
CN112348571A (en) Combined model sales prediction method based on sales prediction system
Jensen et al. Tolerance-based and fuzzy-rough feature selection
CN113095501A (en) Deep reinforcement learning-based unbalanced classification decision tree generation method
CN113378998A (en) Stratum lithology while-drilling identification method based on machine learning
US20230259761A1 (en) Transfer learning system and method for deep neural network
Satapathy et al. Rough set and teaching learning based optimization technique for optimal features selection
Yang et al. A two-stage training framework with feature-label matching mechanism for learning from label proportions
CN115019083A (en) Word embedding graph neural network fine-grained graph classification method based on few-sample learning
Purnomo et al. Synthesis ensemble oversampling and ensemble tree-based machine learning for class imbalance problem in breast cancer diagnosis
Guo et al. Dynamic programming-based optimization for segmentation and clustering of hydrometeorological time series

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201023