CN113705715A

CN113705715A - Time sequence classification method based on LSTM and multi-scale FCN

Info

Publication number: CN113705715A
Application number: CN202111034788.4A
Authority: CN
Inventors: 陈志奎
Original assignee: Dalian Juzhi Information Technology Co ltd
Current assignee: Dalian Juzhi Information Technology Co ltd
Priority date: 2021-09-04
Filing date: 2021-09-04
Publication date: 2021-11-26
Anticipated expiration: 2041-09-04
Also published as: CN113705715B

Abstract

The invention provides a time series classification method based on LSTM and multi-scale FCN, belonging to the field of time series classification. Setting a general structure of the multi-modal network; extracting time dependence characteristics by using a long-term and short-term memory network; fully excavating the geometric spatial features of multiple granularities of the time series curve by using a full convolution module; integrating the space-time characteristics and judging a sample according to the space-time characteristics; the model is fully trained using a back propagation algorithm. The method can comprehensively explore spatial features with large-scale and multi-scale receptive fields, can adaptively learn long-term and short-term time dependence, and has more comprehensive learned beneficial information than the existing model. The difference characteristics of the time series are comprehensively mastered, and more accurate judgment can be given.

Description

Time sequence classification method based on LSTM and multi-scale FCN

Technical Field

The invention relates to the field of time series classification, in particular to a time series classification method based on LSTM and multi-scale FCN.

Background

In the big data age, various types of structured and unstructured data are ubiquitous. Time series data, like images and text, is also a very common form. Which is numerical data obtained by successively sampling one or more physical quantities at equal time intervals. Time Series Classification (TSC) is a ubiquitous and significant topic. For example, in the industrial field, data acquired by pressure and vibration sensors applied to mechanical equipment are time series data, and whether or not a current part or a complete machine has a fault and what kind of fault occurs can be judged according to the information, so that a maintenance suggestion is given; in the medical field, waveform data such as electrocardiogram and the like are also time sequence data, and the efficiency of medical workers can be improved by classifying the waveform data through an artificial intelligence method; in the financial field, the time series data of historical trends of products such as securities, stocks and the like are analyzed in a prediction, classification and the like, so that the investors can be assisted in making decisions. In addition, other non-sequence data information can be converted into a time series representation, and then the problem is solved through classification. For example, researchers have attempted to extract edge curves of pictures of plant leaves and animal bones and to classify the curves to determine the species to which they belong.

Current methods applied to time series classification tasks can be categorized into the following categories:

method based on distance measurement: the classification method based on distance measurement relies on the distance between samples to be classified as the information upon which the classification task is based. Currently, research on the TSC method based on distance mainly focuses on optimization of the distance measurement method and innovation of the distance information utilization mode, and the used classifiers are more conventional, and rarely innovate in the aspects of the classifiers because the classifiers are not the key for improving the performance of the method. The calculation of the distance between samples generally utilizes an elastic metric method, most typically a dynamic time warping method (DTW), and many solutions alleviate the ill-conditioned alignment problem of DTW by imposing strict constraints on the warped path, such as weighting, window limiting, improving step-size increase patterns, and limiting the time step difference between alignment points. The method makes up the defects of DTW to a certain extent, but at present, no method can perfectly solve the fundamental problem of solving the distance between the unaligned time sequence samples, so that the distance-based TSC method is simple to operate but has poor task performance.

The method based on feature representation comprises the following steps: the characteristic representation-based method is to convert the original sequence into a characteristic space in which differences are easier to detect for representation, and in the space, the characteristic representation of the time sequence can be discrete and has lower dimensionality, so that the defects that the dimensionality of the time sequence data is too high and samples are not aligned in time are overcome, and more conventional classifiers can be used for solving the time sequence classification problem. At present, a great number of methods for classifying feature information based on time series exist, which construct a plurality of feature representations, wherein the most common are Shapelet transformation and symbolic representation, and in addition, variation trend features, spike signal extraction, domain transformation, segmented statistical features and the like are also provided. However, there are some limitations to this approach: the extraction operation of some characteristics is complex and tedious, and the time cost is large; some detail information of the original data is inevitably lost in the conversion process; meanwhile, the result is also greatly influenced by whether the manually made feature selection scheme is scientific or not.

The method based on ensemble learning comprises the following steps: ensemble learning is a method of combining a plurality of basic weak classifiers into a stronger classification model, and can achieve the effect of reducing variance or improving task performance. After the accuracy of the TSC method reaches a certain degree, integrating the TSC method is a classical way to further improve the performance. The method based on ensemble learning comprises COTE and HIVE-COT based on domain transformation feature representation, NNE integrated with a neural network, PROP integrated with different elastic distance measurement modes and the like. The TSC model based on ensemble learning possesses relatively high accuracy, but its accuracy level is affected by the basic model. Integration models typically require training tens of basic models, or extracting a variety of feature information, which results in integration-based approaches that are typically bulky and relatively complex.

The method based on deep learning comprises the following steps: the deep learning model commonly used in the time sequence classification field comprises a multilayer perceptron MLP, a convolutional neural network CNN, a cyclic neural network RNN, an automatic encoder and the like. In addition, there are many methods for combining the basic models to form a multi-modal neural network, such as MCDD-CNN, MCNN, LSTM-FCN, etc. The method has certain fault-tolerant capability on input data, so the method is less influenced by the problem of sequence misalignment, and can still obtain higher precision under the condition of directly taking an original time sequence as input; the characteristics in the data can be automatically learned, and the blindness of manual extraction is avoided; the method can directly learn the original data without missing detailed information and characteristics, and can also input the important characteristics extracted manually, thereby strengthening the learning of the key information. Compared with a non-deep learning integrated model, the single deep learning model achieves approximate classification accuracy by a simpler and lighter structure, and the multi-mode framework can obtain the current best effect in the TSC field.

Time series classification can solve a number of practical problems in a variety of fields. It has some disadvantages:

(1) the attribute dimension is too high (i.e., time step too long). The sequence time step of the data set causes the difference of the sequence on the whole situation to become less obvious, and increases the difficulty of learning. In addition, the attribute dimension is too high, which causes some methods to have high complexity and thus to be difficult to apply.

(2) The samples are not aligned in time. Time series data is data sampled at equal intervals in a real environment, and different degrees of delay may occur in sampling. The resulting sequences are not perfectly aligned in time steps and often misaligned. This makes it impossible to directly use the difference exhibited by different samples at the same position as a basis for distinguishing the types of the samples, and also makes it difficult to solve the global similarity between the samples.

Both of the above drawbacks make it difficult to obtain information that is beneficial to the classification task directly from the original time series.

Disclosure of Invention

The full convolution neural network (FCN) is one of the most powerful tools in the field of TSC and has strong feature learning capability. The full mining of multi-scale spatial features can play a role in improving task performance by adding an LSTM module to supplement time-dependent features or organizing a plurality of FCNs with different structures, but at present, a method for combining the advantages of the two can not be provided, and more comprehensive description of time series is provided. In view of this situation, the present invention proposes a time series classification method (LSTM Multi-Scale FCN, referred to as LSTM-MFCN) based on LSTM and Multi-Scale FCN. The device consists of an FCN module and an LSTM module which are used for carrying out convolution in various scales, can sense various scale shape characteristics of a time sequence curve, and can keep a gain effect brought by sequence time characteristics. The multi-modal framework can fully mine the characteristics of multiple scales contained in the high-dimensional time sequence from two angles, and further more accurate judgment is given.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

a time sequence classification method based on LSTM and multi-scale FCN is composed of a multi-scale FCN module and an LSTM module which perform multi-scale convolution, and specifically comprises the following steps:

(1) extracting time dependence characteristics by using a long-term and short-term memory network, obtaining the dependence relation existing between the current value of the observed variable and historical data, and describing the time sequence correlation inside the sequence; there is also some difference in the associations that exist between different classes of instances. Temporal dependence is also an important differential feature for the classification task.

(2) Fully mining spatial features of multiple granularities of a time sequence curve by using a full convolution module, splitting former two layers of convolution kernels in a classic FCN structure into multiple groups, wherein the adopted multi-scale quantity is expressed by M, the larger scale part is realized by cavity convolution, the smaller scale part is realized by common convolution, converging the multi-granularity abstract features extracted from each layer in an MFCN module and then uniformly transmitting the converged multi-granularity abstract features to the next layer, and performing feature extraction by using convolution of a single scale and integrating the features in the multiple convolution kernels by using global pooling as finally output features in the last layer of the MFCN module;

time series data typically contains shape features of various sizes. Large-scale features reflect the trend of sequences over a long range, while short features indicate subtle changes in local regions, and an excellent TSC model should be able to capture features on different scales.

In general, the range of the convolutional layer perceivable data can be enlarged by pooling the data and increasing the size of the convolutional kernel, but both of these approaches result in loss of information or an increase in the amount of parameters to be trained. And the scope of the receptive field can be enlarged under the conditions of not compressing information and not increasing the number of parameters by the cavity convolution. The invention realizes the multi-scale receptive field by using the convolution kernel with fixed size and different void ratios, thus being capable of constructing the multi-scale on a larger size level under the same parameter scale.

Specifically, the first two layers of convolution kernels in the classic FCN structure are split into multiple groups, the adopted multi-scale quantity is expressed by M, the larger scale part is realized by cavity convolution, and the smaller scale part is realized by ordinary convolution. Considering the feature extraction operation performed on each level of the deep convolutional neural network, the dependency on the features of the previous level may be Multi-scale, and the Multi-granularity abstract features extracted from each level are collected and then uniformly transmitted to the next level in an MFCN (Multi-scale FCN) module of the model. At the last level of the MFCN module, since the features are sufficiently abstract, a single scale convolution is used for feature extraction and global pooling is used to integrate the features in multiple convolution kernels as the final output features.

(3) And integrating and distinguishing the space-time characteristics, splicing and integrating the time and space characteristics, and using a fully-connected neural network to adaptively learn the mapping relation between the space-time characteristics and the sample characteristics to obtain an LSTM-MFCN model. The above two parts respectively give geometric spatial and temporal dependency characteristics of the time series data, which are information obtained by learning the time series from different angles. The invention integrates the time and space characteristics and uses the fully-connected neural network to adaptively learn the mapping relation between the time-space characteristics and the sample characteristics.

Preferably, the convolution kernels of each layer of the multi-scale FCN module have the transverse sizes of 8, 5 and 3 and the total number of the convolution kernels is 128, 256 and 128, the voidage d does not exceed 4, and the voidage of each layer is in a pyramid structure.

Preferably, the number proportion of each layer of scale receptive field of the multi-scale FCN module is variable, and the super-parameter is set

As an adjustable proportion thereof, the number NF of true convolution kernels of the ith scale of the lth layer_LiIs calculated by the formula

Wherein NF_LIs the overall number of L-th layer convolution kernels.

Preferably, the LSTM module performs dimension transposition on time series data of a single variable, transforms the time series data into serial input of one value at a time, and then adjusts the number of LSTM neurons according to the complexity of time features in specific data and the training capability of the model.

Preferably, the number of LSTM neurons is 8, 64 or 128.

Preferably, the LSTM module also performs pruning operations.

Preferably, the Dropout rate of the pruning operation is set to 0.8.

Preferably, the splicing and integrating of the temporal and spatial features in the step (3) specifically refers to splicing and integrating the temporal and spatial features by using a layer of fully-connected structure in cooperation with a SoftMax activation function.

Preferably, step (3) further comprises training the LSTM-MFCN model using an error back-propagation algorithm, and keeping the model with the minimum error.

The invention has the beneficial effects that: aiming at the problem that difference information is difficult to obtain from an original time sequence directly, the invention designs a time sequence classification method based on LSTM and multi-scale FCN. The method can comprehensively explore spatial features with large-scale and multi-scale receptive fields, can adaptively learn long-term and short-term time dependence, and has more comprehensive learned beneficial information than the existing model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of the LSTM-MFCN structure of the present invention;

FIG. 2 is a flow chart of a time series classification method based on LSTM and multi-scale FCN according to the present invention;

FIG. 3 is a graph of model critical differences involved in experimental comparisons;

FIG. 4 is a graph of accuracy versus a model or structure of an FCN correlation series;

FIGS. 5a and 5b show the results of the LSTM-MFCN experiment and their comparison with the baseline model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments given herein without any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1-5 b, the invention provides a time series classification method based on LSTM and multi-scale FCN, which takes a dual-scale and three-scale convolution structure as a specific embodiment, and consists of a multi-scale FCN module and an LSTM module, wherein the multi-scale FCN module performs multi-scale convolution, the multi-scale FCN module fully extracts geometric spatial features of multiple granularities of a time series curve, the LSTM module is used for learning features of a sequence value changing with time, and feature vectors output by the two modules are spliced and learned by a layer of neurons and converted into classification results. The method specifically comprises the following steps:

(1) extracting time dependence characteristics by using a long-term and short-term memory network, obtaining the dependence relation existing between the current value of the observed variable and historical data, and describing the time sequence correlation inside the sequence;

(3) and integrating and distinguishing the space-time characteristics, splicing and integrating the time and space characteristics, and using a fully-connected neural network to adaptively learn the mapping relation between the space-time characteristics and the sample characteristics to obtain an LSTM-MFCN model.

The process according to the invention is described in detail below:

the deep neural network needs to be structurally set before training, but the neural network, particularly the multi-modal network, generally directly sets the general structure of the model first, keeps the hyper-parameters of the part of the structure unchanged for all data sets, then carries out restrictive gridding search on the part of the hyper-parameters for each data set, and evaluates the hyper-parameters according to the overall performance of the hyper-parameters on all the data sets. In previous studies, the network structure of the FCN and LSTM-FCN model is close to the optimal state, and the present invention will be based on the basic structure and the constant hyper-parameters are detailed in Table 1.

TABLE 1 structural hyper-parameters of the classical multimodal neural network LSTM-FCN

The multi-scale FCN convolution module in the invention has 8, 5 and 3 transverse sizes of each layer of convolution kernel and 128, 256 and 128 total convolution kernels. The multiscale is realized by adjusting the void ratio, the larger scale part is realized by void convolution, and the smaller scale part is realized by ordinary convolution. This enables more diverse features to be extracted at limited depths on an equivalent parameter scale.

Too high a void rate can cause too large a span between adjacent data to be perceived, resulting in less than valid features being extracted by the convolution operation. Therefore, the void ratio d is not more than 4 in the method. In addition, the CNN series model mostly keeps each layer of receptive field in a pyramid structure when being applied, and the characteristic size is smaller and smaller through layer-by-layer extraction. The selection of the void ratio of each layer follows the principle, such as the combination of three

layers

4, 2 and 1.

Considering that the quantity proportion of different scale features displayed by each time sequence data is not necessarily the same, the quantity proportion of the receptive fields of various scales is also variable, and the hyper-parameter is set

As its adjustable ratio. The number NF of true convolution kernels of the ith scale for the lth layer_LiCan be calculated according to equation 1:

wherein NF_LIs the overall number of L-th layer convolution kernels.

On the basis of the above structure arrangement, two specific multi-scale structures will be taken as embodiments of the proposed method:

1) when M is 2, it can be represented by LSTM-DFCN (Dual-scale FCN). In the multi-scale FCN module of the LSTM-DFCN, each layer has two scales of receptive fields. The convolution kernel of the first layer of large receptive field is realized by adopting the void rate of 4 or 2, the second layer of large receptive field is realized by adopting the void rate of 2, and the small receptive fields of the first two layers are realized by the traditional convolution. The convolution kernel ratios w1, w2 of the two scales are 2: 2. 3: 1. 1: one of 3 kinds. But ultimately only one case when the first layer d is 2 or 4 will remain as a representation of the method in double scale.

2) The structure when M is 3 can be denoted as LSTM-TFCN (Triple-scale FCN) as a general example. The large scale receptive field of the first two layers is realized using the void ratio 4, the medium receptive field is realized using the void ratio 2, and the small receptive field is also realized using the traditional convolution, i.e. d is 1. w1, w2, w3 at 2: 1: 1. 1: 2: 1 and 1: 1: 2.

The long-short term memory network LSTM can adaptively obtain the dependency relationship existing between each current value of the observed variable of the time series and the historical data.

Since LSTM is not a point-to-point model but depends on a history state, first, dimension transposition is performed on time-series data of single variables, and the data is transformed into serial input in which one value is input at a time. The number of LSTM neurons can then be adjusted in 8, 64, and 128 depending on the complexity of the temporal features in the particular data and the training capabilities of the model. In order to reduce the complexity of this part, pruning is used to compress it, rejecting the less contributing parts, while enhancing the robustness of the extracted features. The Dropout rate for the pruning operation is set to 0.8.

Integrating temporal and spatial features and making classification predictions by using full connectivity layers: the two features extracted by the LSTM and the MFCN are not direct classification results, and there may be a certain difference in data specification, and it is necessary to adaptively integrate the two kinds of information and perform conversion to obtain a predicted category. Most traditional fully-connected neural networks have the ability to fit any complex nonlinear functional relationship, here using a layer of fully-connected structure together with a SoftMax activation function as the final output layer. Therefore, the number of neurons in this layer should be consistent with the number of classes of the sample to be classified. The probability value output by the method is directly compared with the class label in the One-hot form during training, the error of the model can be calculated, and the maximum item of the probability is output as the predicted class during testing.

Based on the first four steps, a complete LSTM-MFCN model can be obtained, in which the flow of the time series is shown as algorithm 1. Meanwhile, table 2 gives specific structural parameters of the two proposed embodiments.

The time series data is first processed in two parts in the LSTM-MFCN. In the first two layers of the MFCN module (2-8 lines in the algorithm 1), data are subjected to convolution operation under a set multi-scale structure, the calculation result is subjected to batch normalization processing and activation of a linear correction unit to obtain multi-granularity features (4 th-5 th lines), and the multi-granularity features are converged to be used as the input of the next layer (7 th line). Feature extraction is performed by convolution with a single scale at the last layer of the MFCN module, and features in multiple convolution kernels are integrated and spliced by global pooling (lines 9-10). The LSTM network learns the data to output time-dependent features (lines 11-13). Finally, the full-connection layer integrates the spatial and temporal features of the two partial outputs, and the spatial and temporal features are converted into classification results by a SoftMax function (lines 14-15).

Data flow in Algorithm 1 LSTM-MFCN

TABLE 2 structural parameters of LSTM-DFCN and LSTM-TFCN

And then, training the model by using an error back propagation algorithm, and keeping the model with the minimum error. Algorithm 2 gives the procedure for model building.

Some of the structural parameters in the LSTM-MFCN are constant and are given in the first line of algorithm 2. The parameters to be set include the multi-scale number M and the void ratio d of each scale₁，d₂，…，d_MAnd all possible multi-scale ratio combinations W. In the implementation of the method, firstly, a possible multi-scale proportion W is selected from W, and undetermined hyperparameters in the MFCN module under the proportion are calculated (3 rd to 7 th lines of an algorithm 2); then, the number of neurons of the LSTM module is restrictively searched, and each selected number N is used for determining a specific model structure. Under the structure, the LSTM-MFCN can be trained by using a training set D and a test set T can be predicted (steps 8-14 of the algorithm 2); the best results in the structure exploration process described above are retained as the model behaves at the M scale.

Algorithm 2 establishing process of time series classification method based on LSTM and multi-scale FCN

In conjunction with the protocol of the present invention, the experimental analysis was performed as follows:

to verify the validity of the model, two specific structures of the proposed LSTM-MFCN were tested on UCR time series classification standard datasets. Since many of the deeply learned TSC models were compared on the datasets used in the baseline experiments in the literature, this experiment also selected a total of 44 datasets from this group.

(1) Verification experiment

For the representative structures when M is 2 and M is 3 (only one of the cases is kept as the result when M is 2), there are 3 possible ratios of the number of different scales of receptive fields, and in each ratio, the LSTM network can be given different numbers of neurons for restrictive tuning, and the specific flow is shown in algorithm 2. To ensure fairness of comparison, an equal number of 6 replicate experiments were performed on LSTM-FCN (LSTM-MFCN per structure: 3 multiscale ratios x 2 replicates for a total of 6). It should be noted that the LSTM-FCN, the proposed LSTM-DFCN and LSTM-TFCN all determine the test set by the model with the least training loss, and take the best result of each data set as the performance of each model in the experiment. Other hyper-parameter settings for the experiment are given in table 3.

TABLE 3 selection of hyper-parameters for LSTM-MFCN validation experiments

When the verification result is not improved after each 100 times of training, the learning rate is reduced

Up to the final learning rate.

The LSTM-MFCN provided by the invention is a multi-mode neural network, so that a baseline model is mainly based on a TSC (three-dimensional model) method based on deep learning, and meanwhile, some representative non-deep learning methods with better performance are selected for comparison. FCN and ResNet are selected to represent a basic monomodal deep learning method; MCNN, MFCN, LSTM-FCN and GRU-FCN represent a classical multi-modal neural network TSC model, wherein input information of the MCNN contains representation of time sequences under various down-sampling rates, and the representation is equivalent to multi-scale spatial feature learning. The MFCN is a multi-scale convolution FCN model with a structure different from that of the algorithm in this chapter, and is enhanced without an LSTM module. The LSTM-FCN is one of inspiration sources of the model proposed by the invention, and the GRU-FCN is an optimization attempt based on the LSTM-FCN. LWDTW, BOSS, COTE, PROP, and live-COTE represent unimodal or integrated non-deep learning TSC models based on distance metrics, shapelets, symbolization, frequency domain information, and the like.

Common result evaluation indexes in the TSC field include overall classification accuracy or error rate, the number of times the best result is obtained, the average rank of the result, and the average class error rate MPEC (mean per-class error). MPEC can be used to predict the classification error rate of a model for a single class, which is defined as

equations

2 and 3.

Wherein: n denotes the number of data sets, i is the ith data set,

e_irepresenting the error rate of the ith data set; c. C_iThe number of categories representing the ith data set; the PEC represents the error rate evenly spread out over each class.

In addition to using the four indicators described above for scoring each model, a Friedman rank sum test based on algorithmic ordering can be used to evaluate multiple models as a whole for differences in their performance across multiple data sets. Rank, i.e., the accuracy of classification of a data set by the model, is the order of all participating comparison models, so the calculation of rank sum can utilize the found average rank and the total number of data sets. Assuming that k algorithms are compared over N data sets, the statistical quantity FF constructed in the Friedman test is calculated as follows:

wherein R is_jIs the average bit order of the jth algorithm over all data sets.

The original hypothesis H0 of the Friedman hypothesis test is: the multiple samples are from a population with no significant differences, i.e., there are no differences between the multiple models involved in the comparison. When FF is greater than the threshold, the original hypothesis H0 is rejected, indicating that there is a significant difference in performance of the algorithms. At this point, the difference should be analyzed continuously, and Nemenyi follow-up test is used to replace pairwise comparison between algorithms. The critical value CD (critical distance) of the mean ordering difference, which is a key parameter in the Nemenyi test, can be calculated according to equation 6, where q is_αThe checking coefficient can be obtained by looking up the table. The two algorithms are considered to be significantly different when their ranking gap is greater than CD. The specific performance difference of all the models involved in the comparison is generally given in the form of a critical difference map.

Experiments show that when the first layer void ratio of the dual-scale structure is 4, the performance is slightly good, so the result under the condition is taken as the performance of the LSTM-DFCN. Fig. 5a and 5b fully present the experimental results of two specific proposed structures of LSTM-MFCN on UCR data sets. Max represents the best result of the two, but does not participate in the ranking comparison. The bolded font indicates that the model achieved the highest classification accuracy of all the participating comparison models on the data set.

After the ranking of each algorithm is calculated, the difference analysis can be continued. Under the condition that the significant level alpha is 0.05, the experiment needs to be carried out subsequentlyThe Nemenyi test gave a plot of the critical difference as shown in FIG. 3, where CD is 2.591. The algorithms in the figure are based on their mean bit order R_jThe set of algorithms under alignment, covered by the same line (length is the CD value), had no statistically significant difference. It can be seen that there is a significant effect improvement for LSTM-TFCN compared to LSTM-FCN. Although the difference between LSTM-MFCN (in this section, the general names of LSTM-DFCN and LSTM-TFCN) and the better GRU-FCN and Hive-COTE is not significant, in this method, the two structures of LSTM-MFCN are best represented in the average accuracy, average rank and MPEC of the above evaluation indexes, and the highest accuracy is obtained.

Combining the detailed scores of fig. 5a and 5b with the intuitive ranking and performance difference given in fig. 3, the following analysis can be done:

in the baseline model, LWDTW based on distance is an excellent improved algorithm based on DTW, and PROP integrates 11 classifiers based on different elastic distance measurement modes, but they achieve the worst performance in the comparison. The reason is that distance-based methods have difficulty giving the correct distance measure in case of sequence misalignment. The method based on feature representation and integrated non-deep learning model is slightly inferior to the deep learning method. The performance of the classical BOSS method and the COTE integrated learning method is close to that of a single-mode neural network ResNet and FCN, but inferior to that of a multi-mode neural network; the Hive-COTE which gets a new breakthrough can maintain the performance similar to that of the multi-modal neural networks MFCN and LSTM-FCN by means of a large and complex structure, but is inferior to the latest method provided in the chapter.

The LSTM-MFCN is superior to the LSTM-FCN and GRU-FCN in the aspects of average precision, bit order, expected class error rate and the like, and the fact that the large receptive field realized by using the cavity convolution can play a role in time sequence data and that the extraction of multiple granularity space features by using the multi-scale receptive field is superior to the extraction of fixed-scale features by using the single-scale receptive field is demonstrated. The LSTM-FCN and the LSTM-MFCN are superior to the MFCN and the MCNN, so that the characteristic of the data changing along with time is important information for the TSC problem, and the data can be more comprehensively learned by focusing on the time characteristic and the multi-scale spatial characteristic, so that the LSTM-FCN and the LSTM-MFCN have better performance. More particularly, the GRU-FCN exceeds all other models in the number of times of obtaining the highest classification precision, but is slightly inferior in the average precision, the bit order and the like, which indicates that the model is prominent for some data sets, but has more performance degradation on other data, and is inferior to the LSTM-MFCN proposed in this chapter in the aspects of robustness or application range of the model.

For two specific structures of the proposed model, LSTM-DFCN and LSTM-TFCN both achieve performance improvement on the basis of single-scale LSTM-FCN, but the performance of LSTM-DFCN is different from that of LSTM-TFCN. This shows laterally that under multi-scale convolution, the receptive fields of various scales play their own roles and learn spatial features of different granularities. LSTM-TFCNs with three scales can better cover the diversity of spatial features, while dual-scale structures, with a large field convolution kernel with a void rate of 2 or 4, will pay less attention to features that appear to be of some length. If the best results of the LSTM-DFCN at the two void rates of the first convolutional layer are combined, it can be found that the average accuracy of the LSTM-DFCN at this time reaches 0.925, which is even slightly higher than the LSTM-TFCN. Because the LSTM-DFCN at this time also has the perception of three scales d ═ 1, 2 and 4, and it tries 6 hyper-parameter combinations, the search is more detailed. In fig. 5a and 5b, Max represents the effect that can be achieved by performing a more detailed structure search under the multi-scale structure. If Max is taken as a representative, the model provided in this chapter has a greater leading advantage from the perspective of three evaluation indexes.

(2) Ablation experiment

In order to prove the effect of the multi-scale convolution and the multi-scale receptive field realized by utilizing the cavity convolution mode, the section surrounds the LSTM-DFCN model and carries out ablation experiments on two similar structures of the LSTM-DFCN model realized based on the traditional convolution kernel. The method comprises the following specific steps:

1) keeping the same parameter scale as LSTM-DFCN, and realizing multi-scale convolution by using a traditional convolution kernel;

2) maintaining the same receptive field size as the LSTM-DFCN, multi-scale convolution is achieved with a conventional convolution kernel.

The experiments for each structure were repeated twice, leaving the best results as representative of the performance under the structure. Specific structural settings are shown in Table 4, where LSTM-DFCN (1) and LSTM-DFCN (2) represent two similar comparative structures, respectively. Three sets of comparisons can be constructed from the above experimental results. The first group, which was performed for LSTM-DFCN (1) and LSTM-FCN for single scale convolution, can be used to explore whether using multi-scale convolution has an advantage over single scale convolution. In the second set of comparisons, the experimental results of LSTM-DFCN (1) are compared with the results of LSTM-DFCN, which can show whether the further effect improvement brought by the larger receptive field combination realized by using the hole convolution can be brought to the model. In the third set of comparisons, the results of LSTM-DFCN (2) and LSTM-DFCN were compared to verify whether the large field combination formed by the hole convolution could be replaced by simply enlarging the convolution kernel size.

TABLE 4 structural setup and experimental results relating to models in ablation experiments

The average experimental accuracy of the above mentioned models over 44 UCR datasets is also given in table 4, and the overall classification accuracy of both comparative structures is lower than the proposed LSTM-DFCN. The first group of comparisons under the same parameter scale, the classification accuracy of the LSTM-DFCN (1) subjected to multi-scale convolution is higher than that of the LSTM-FCN repeated under the same times. In fact, each convolution kernel can extract features equal to or smaller than its own size, but cannot extract features larger than itself. Therefore, when the size of a part of convolution kernels is reduced, or detail features of small scale can still be extracted; after the other part is amplified, the receptive field of the convolution kernel is increased, but the feature with larger granularity which cannot be perceived before is learned. Therefore, the multi-scale convolution is equivalent to reasonably allocating the sizes and proportions of convolution kernels again, and the feature extraction effect is better. In the second set of comparisons, the overall classification accuracy achieved by the LSTM-DFCN (1) structure implemented by conventional convolution is lower than that of the LSTM-DFCN. As can be readily appreciated from a review of Table 4, the multiscale combinations of fields that can be achieved without introducing the hole convolution are approximately (10,6) - (6,4), while those that can be achieved with the hole are (32,8) - (10, 5). LSTM-DFCN can achieve multiscale at larger receptive field size levels with limited parameters using hole convolution. Meanwhile, the hole convolution can adapt to time sequence data with continuous values, and the characteristics of the data can be roughly learned through carrying out equal-interval jumping type sensing on the data through a convolution kernel with the hole. Therefore, LSTM-DFCN has the opportunity to learn larger scale features, which LSTM-DFCN (1) cannot. Comparing LSTM-DFCN (2) with LSTM-DFCN may show that simply relying on traditional convolution may enlarge the field effect less than using hole convolution, since this operation may result in an increase in the overall parameters to be trained. And many data sets in the TSC field have less training sample data and limited training capability, so the training process of the structure is restricted, and the LSTM-DFCN (2) obtains relatively poor results under large parameter quantity.

The results and analysis of the ablation experiments prove the correctness and effectiveness of the optimization ideas proposed in this chapter. Under the same condition, the characteristic extraction of the time sequence data by using the multi-scale receptive field is superior to the convolution of a single scale, and the large-scale receptive field realized by using the cavity convolution can effectively sense the characteristic of the diversity of the time sequence data on the premise of not influencing the training effect of the model.

In addition, the results of this ablation experiment can be compared to all previous related studies. Fig. 4 shows more intuitively the differences in classification accuracy of the FCN-based series models, and can show the gain effect of various improved ideas. The FCN is a basic single-mode neural network, and the MFCN and the LSTM-FCN are models which are perfected from different angles on the basis of the FCN and the LSTM-FCN, so that the precision is improved; the LSTM-DFCN (1) and the LSTM-DFCN (2) are models designed by simultaneously considering two improved ideas, and the task performance of the models makes further breakthrough; the LSTM-DFCN and the LSTM-TFCN introduce hole convolution on the basis of simultaneously keeping the two optimizations, and can realize larger receptive field under the restriction of equal training capacity. Therefore, the two can sense multi-scale geometric features on a larger granularity level and learn time features at the same time, all advantages are integrated, and the highest classification precision is obtained.

In light of the foregoing description of the preferred embodiments of the present invention, those skilled in the art can now make various alterations and modifications without departing from the scope of the invention. The technical scope of the present invention is not limited to the contents of the specification, and must be determined according to the scope of the claims.

Claims

1. A time series classification method based on LSTM and multi-scale FCN is characterized by comprising a multi-scale FCN module and an LSTM module which perform multi-scale convolution, and specifically comprising the following steps:

2. The method according to claim 1, wherein the transverse sizes of convolution kernels of each layer of the multi-scale FCN module are 8, 5 and 3, the total number of convolution kernels is 128, 256 and 128, the voidage d is not more than 4, and the voidage of each layer is in a pyramid structure.

3. The method for classifying time series based on LSTM and multi-scale FCN as claimed in claim 1, wherein the number ratio of each layer of scale receptive fields of the multi-scale FCN module is variable, and the hyper-parameter w is set₁,w₂… as its adjustable scale, the number NF of true convolution kernels for the ith scale of the L-th layer_LiIs calculated by the formula

Wherein NF_LIs the overall number of L-th layer convolution kernels.

4. The method of claim 1, wherein the LSTM module performs dimension transposition on the time series data of single variable, transforms the data into serial input with one value at a time, and then adjusts the number of LSTM neurons according to the complexity of the time features in the specific data and the training ability of the model.

5. The method of claim 4, wherein the number of LSTM neurons is 8, 64 or 128.

6. The method for classifying time series based on LSTM and multi-scale FCN as claimed in claim 4, wherein said LSTM module further performs pruning operation.

7. The method for classifying time series based on LSTM and multi-scale FCN as claimed in claim 6, wherein Dropout rate of pruning operation is set to 0.8.

8. The method for classifying time series based on LSTM and multi-scale FCN as claimed in claim 1, wherein the step (3) of splicing and integrating the temporal and spatial features specifically refers to splicing and integrating the temporal and spatial features by using a layer of full-link structure in combination with a SoftMax activation function.

9. The LSTM and multi-scale FCN-based time series classification method of claim 1, wherein step (3) further comprises training the LSTM-MFCN model using an error back-propagation algorithm, preserving the model with the least error.