CN109656887B

CN109656887B - Distributed time series mode retrieval method for mass high-speed rail shaft temperature data

Info

Publication number: CN109656887B
Application number: CN201811510849.8A
Authority: CN
Inventors: 徐泉; 解军帅
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2023-03-21
Anticipated expiration: 2038-12-11
Also published as: WO2020118928A1; CN109656887A

Abstract

The invention provides a distributed time series mode retrieval method for massive high-speed rail shaft temperature data, and relates to the technical field of time series analysis. Firstly, setting a retrieval reference time sequence mode and the most similar mode number to be retrieved; reading historical shaft temperature time series data to be retrieved into a distributed system, and constructing a distributed data set X; then constructing indexes for the elements of X; constructing a plurality of auxiliary distributed data sets according to the length of the reference time sequence, connecting the X and the auxiliary data sets, and constructing a distributed data set Z, wherein each element of the distributed data set Z is a short time sequence; calculating Euclidean distances between elements of the reference time sequence and the elements of the Z, and constructing a distributed data set R; sorting R, taking the minimum k elements, and returning the index of the corresponding element; and acquiring corresponding elements in the data set Z according to the indexes. The distributed time series mode retrieval method for mass high-speed rail shaft temperature data improves the similarity retrieval efficiency of mass shaft temperature time series data.

Description

Distributed time series mode retrieval method for mass high-speed rail shaft temperature data

Technical Field

The invention relates to the technical field of time series analysis, in particular to a distributed time series mode retrieval method for massive high-speed rail shaft temperature data.

Background

The time sequence is a numerical sequence or symbol sequence which is associated with time and has a sequence, and is widely applied to the fields of finance, weather, fault diagnosis and the like. The high-speed rail shaft temperature data serving as an important component of the high-speed rail daily operation and maintenance data has typical time series characteristics, and the analysis and the processing of the time series are also an important direction of the current high-speed rail fault diagnosis research, including retrieval, mode mining, clustering and the like of abnormal modes. Because the high-speed rail shaft temperature sensors are large in number and high in acquisition frequency, the high-speed rail shaft temperature data has the characteristics of large data volume, high dimensionality, high updating frequency and the like, namely the high-speed rail shaft temperature data has typical large data characteristics. Therefore, how to efficiently process a huge time sequence formed by massive high-speed rail shaft temperature data is a problem which needs to be researched and solved at present.

The similarity pattern retrieval problem of time series can be described as that given a certain time series pattern, several subsequences most similar to the certain time series pattern are found out from a large time series, and the time series similarity retrieval is a precondition for realizing other time series analysis tasks, such as abnormality detection, pattern recognition and the like. The current time sequence similarity pattern retrieval method mainly adopts a single machine method to carry out serial retrieval on time sequences, and finds out all subsequence patterns with similarity meeting requirements. However, due to the limitation of machine performance, the single machine method has limited data amount to process, and the calculation efficiency is low, so that it is difficult to meet the search requirement of massive shaft temperature time series data.

At present, the development of big data and cloud computing makes distributed parallel computing of data possible, greatly improves the data processing capacity and efficiency, and provides a solution idea for the analysis problem of massive shaft temperature time series data. In order to improve the retrieval efficiency of massive shaft temperature time series data, a distributed time series similar mode retrieval method needs to be researched.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a distributed time series pattern retrieval method for massive high-speed rail shaft temperature data, aiming at the defects of the prior art, wherein a large data processing cluster consisting of distributed computing nodes is used, the data are distributed to memories of different nodes of the cluster, and when the cluster runs a computing task, the computing task is decomposed and distributed to different nodes, so that parallel retrieval of a time series similar pattern is realized.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a distributed time series mode retrieval method for massive high-speed rail shaft temperature data comprises the following steps:

step 1, setting a retrieved shaft temperature data reference time sequence mode s, determining the length m of a retrieval mode, and setting the number of most similar modes needing to be retrieved as k;

the shaft temperature data is bearing temperature data of a high-speed rail, and the sampling period is 1s;

the reference time sequence mode is data of a continuous time period selected by self-definition in the shaft temperature data, and the length of the reference time sequence mode is m;

step 2, reading mass historical shaft temperature time sequence data to be retrieved into a memory of a computing node of the distributed system, constructing an initial distributed data set X, and determining the number n of parallel computing tasks, wherein the method specifically comprises the following steps:

step 2.1, uploading historical axle temperature data stored in a hard disk storage medium to a Distributed File System (HDFS) through a network;

step 2.2, reading mass shaft temperature time sequence data stored in the HDFS by using a distributed Spark calculation engine;

step 2.3, setting the number of partitions of the Distributed data set to be created to be n, dividing the shaft temperature data stored on the HDFS into n data blocks by a Spark calculation engine according to the set number of partitions, creating a partition for each data block, and constructing a Distributed data set RDD (flexible Distributed data set) object X with the number of partitions of n, wherein each element of the object X is a shaft temperature data value at a certain moment, and the X maintains the sequence of the partitions and the offset of a first element of each partition in the whole data set;

step 2.4, determining the number of parallel computing tasks to be n according to the number of partitions, creating a computing task for each data partition of the RDD object X by the Spark computing engine, distributing the computing task to different computing nodes, enabling the tasks to be independent from one another, and enabling each task to process data of one partition in parallel;

step 3, constructing indexes for the distributed data set X in the step 2, numbering each element in the data set X from 0 according to a time sequence, wherein the numbering task of each partition is calculated in parallel at different nodes; converting each element in the data set X into a key value pair record form, wherein the key is an index number, and the value is an axial temperature time sequence numerical value at a corresponding moment;

step 4, constructing m-1 auxiliary distributed data sets Y according to the length of the reference time sequence mode _j Wherein j ∈ (1,m-1), the distributed data set X and the auxiliary distributed data set in step 3 are connected to construct a distributed data set Z, each element of which is a short-time sequence with a length of m, and the specific steps include:

step 4.1, according to the length m of the set reference time sequence mode, m-1 auxiliary distributed data sets Y which are the same as the data sets X in the step 3 are constructed _j Where j ∈ (1,m-1);

step 4.2, for m-1 auxiliary distributed data sets Y _j Constructing an index of Y _j The elements of (a) are numbered starting from-j;

step 4.3, sequentially adding m-1 auxiliary distributed data sets Y to the distributed data set X _j Carrying out Cartesian product operation, and connecting elements with the same index key value in the two distributed data sets subjected to Cartesian product;

step 4.4, after step 4.3, constructing a distributed data set Z from the data set X in step 3, the elements of Z being<key ₁ ，(value ₁ ，value ₂ ，...，value _m )>Form, wherein, key ₁ Is the number of the element, (value) ₁ ，value ₂ ，...，value _m ) Represents a segment with the first key ₁ Starting a short time sequence with the length of m at each moment;

step 5, creating calculation tasks through a Spark calculation engine, wherein the logic of each calculation task is to calculate the Euclidean distance between a reference time sequence mode and elements of each data partition, distribute each calculation task to a plurality of nodes in a distributed system for parallel calculation, and construct a distributed data set R, and the specific steps comprise:

step 5.1, creating a calculation task for each data partition of the data set Z in the step 4.4 through a Spark calculation engine, wherein the logic of completion of each calculation task is to calculate a Euclidean distance between a reference data sequence s and a short time sequence l represented by each element of the data partition;

step 5.2, scheduling the computing tasks created in the step 5.1 through a distributed Spark computing engine, distributing the computing tasks to computing nodes where data partitions are located, and processing partition data of the nodes where each computing task is located;

step 5.3, creating a distributed data set R, wherein the number of data is the same as that of the data set Z, and the elements are<key _R ，value _R >Key-value pair form, key _R Numbering elements, value _R A calculated euclidean distance value between each element of Z and the reference time series;

step 6, fully sorting the distributed data set R in the step 5.3 according to Euclidean distance values, taking the first k elements with the minimum distance, and returning the index of each element, wherein the specific steps comprise:

step 6.1, performing parallel computing and sequencing on the data of each partition of the distributed data set R through a Spark computing engine according to the value of each element, and taking the minimum k elements of each partition;

step 6.2, collecting n multiplied by k elements obtained by n partitions of R on a node in the distributed system, summarizing and sorting, taking the minimum k elements, and recording the index value of the element obtained by calculation;

and 7, acquiring corresponding elements in the data set Z in the step 4 according to the index obtained in the step 6, and obtaining k sub-time sequences which are most similar to the reference time sequence in the massive shaft temperature historical time sequence data set.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the distributed time series mode retrieval method for massive high-speed rail shaft temperature data provided by the invention is characterized in that on the basis of utilizing a distributed Spark calculation engine and a distributed file system HDFS, reorganization and transformation are carried out on a distributed data set formed by massive high-speed rail shaft temperature time series data, each element of the distributed data set formed by the high-speed rail shaft temperature data is converted into a short time series which is independent from each other and keeps time sequence, and a calculation task can be created for each data partition through the Spark calculation engine and distributed to different calculation nodes of the distributed system for calculation. Therefore, the parallel similarity retrieval effect of the time sequence can be realized, the problem of similarity retrieval of massive high-speed rail shaft temperature time sequences which cannot be processed by a single machine is solved, and the similarity retrieval efficiency of massive shaft temperature time sequence data is improved.

Drawings

Fig. 1 is a flowchart of a distributed time series pattern retrieval method for massive high-speed rail shaft temperature data according to an embodiment of the present invention;

fig. 2 is a diagram of the effect of time-series search according to the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The high-speed rail shaft temperature data is typical time series data, and in this embodiment, taking high-speed rail stator shaft temperature data collected by a certain railway bureau as an example, the Spark-oriented parallel time series similar mode retrieval method for massive high-speed rail shaft temperature data is adopted to retrieve the similar mode of the time series.

The high-speed rail stator shaft temperature data is temperature data acquired by a shaft temperature sensor installed at a high-speed rail stator end, the data volume of the used historical shaft temperature is 20GB, the number of data records is 213002710, the sampling period is 1s, the data content comprises acquisition time and the bearing temperature value at the moment, and part of the data is shown in table 1.

TABLE 1 partial high-speed rail stator shaft temperature data

Because the historical shaft temperature data volume is large, only part of the stator shaft temperature data is listed to illustrate the form of the high-speed rail stator shaft temperature data.

A Spark-based parallel time series retrieval method for massive high-speed rail stator shaft temperature data comprises the following steps as shown in figure 1:

step 1, setting a retrieved shaft temperature data reference time series pattern s, selecting 160 th data as a starting point, using continuous data with the length m of 12 as a reference time series pattern, setting data of the reference time series pattern s as [103, 103, 105, 105, 110, 159, 144, 127, 116, 116, 117, 113], and setting the number k of the retrieved most similar patterns as 2; the task of time series pattern retrieval is to search a plurality of m-length sub-time sequences most similar to a reference time series pattern in a large time series represented by massive axle temperature historical data;

and 2, reading historical time sequence data of the temperature of the high-speed rail stator sub-shaft to be retrieved into a memory of a distributed computing node, constructing a distributed data set X, and setting the number n of parallel computing tasks to be 10. The method specifically comprises the following steps:

step 2.1, uploading historical data of the temperature of the high-speed rail stator shaft stored in a hard disk storage medium to a distributed file system HDFS (Hadoop distributed file system) through a network;

step 2.2, reading the high-speed rail stator shaft temperature data stored on the HDFS by using a distributed Spark calculation engine;

and 2.3, setting the number of data partitions to be 10 when Spark reads the stator shaft temperature data, dividing the read stator shaft temperature data into 10 data blocks by a Spark calculation engine according to the number of the data partitions, wherein the data partitions form a distributed data set RDD object X, and the data partitions of the X are in the form of { partition1, partition2, partition10}, wherein each partition represents one data partition and the partitions have temporal sequence.

In this embodiment, the data format of the inside of the partition is described by the data of partition1, the number of the axis temperature records of partition1 is 20001036, and the internal data format of partition1 is { x } ₁ ，x2，...，x _20001036 }，x _n Representing the value of the stator shaft temperature, the elements of the subarea are arranged according to the time sequence, and the forms of other subareas are the same as the partition 1.

2.4, creating a calculation task for each data partition by the Spark calculation engine according to the set number of the data partitions of the RDD object, distributing the calculation task to different nodes in the distributed system, and only processing the stator shaft temperature data of the corresponding data partition by one task;

step 3, constructing indexes for the distributed data set X formed by the stator shaft temperature data in the step 2, numbering the elements of the X from 0 according to the time sequence, and converting the data of each partition of the data set X into data

{<index，x ₁ >，<index+1，x ₂ >，...，<index+n-1，xn>}

The index is the offset of the first element of the partition relative to the whole data set, n is the number of data records of the partition, an index is created for each element of the distributed data set, and the creation of the index of the element of each partition is independently completed at different nodes.

In this embodiment, partition1 is taken as an example to describe the conversion logic of data, and after an index is constructed, the data form of partition1 is converted into

{<0，x ₁ >，<1，x ₂ >，...，<20001035，x _20001036 >}

The data form of the other partitions is the same as partition 1.

Step 4, constructing 11 distributed data sets Y which are the same as X in the step 3 according to the length of the reference time sequence s _j Where j e (1, 11), the distributed data set X and the auxiliary distributed data set are joined to construct a distributed data set Z with each element being a short time sequence of length 12.

Step 4.1, 11 distributed data sets Y which are the same as the data sets X in the step 3 are constructed _j Wherein j ∈ (1, 11);

step 4.2, index is constructed for 11 auxiliary distributed data sets in step 4.1, and for Y _j For example, j is subtracted from the indices of all its elements, i.e., the index of the first element after conversion is-j, and Y is used ₁ For example, of the first elementThe index is-1, and indexes of other elements are increased by analogy;

and 4.3, carrying out Cartesian product operation on the data set X and the 11 auxiliary distributed data sets in the step 3, and reserving elements with the same index in the two data sets for connection of any two data sets subjected to the Cartesian product operation. The mathematical expression of Cartesian product of the two sets is

A×B＝{(x，y)|x∈A∧y∈B}

Wherein, A and B represent two data sets, x and y represent arbitrary elements in A and B, respectively, and (x, y) represent element forms in sets obtained after Cartesian product calculation is carried out on the two sets.

The result after the concatenation of two elements < key, value1> and < key, value2> having the same index in the two data sets is < key, (value 1, value 2) >.

This example uses X and Y ₁ Illustrating the logic of the coupling, X and Y ₁ The elements with the same index as 10 are respectively<10，99>And<10，100>the result after the joining of two elements is<10，(99，100)>The coupling of other elements is similar.

Step 4.4, after step 4.3, a distributed data set Z is constructed, each element of Z becoming

<key，(value1，value2，...，value12)>

Where, key is the index of the element, numbering from 0, value is converted to a short time sequence of length 12.

In the embodiment, the form of data in Z is described by the 160 th element in Z, namely the element with the index of 159, and the form of the element with the index key of 159 in Z is

<159，(103，103，105，105，110，159，144，127，116，116，117，113)>

And step 5, scheduling the computing tasks through a Spark computing engine, creating 10 computing tasks for 10 partitions of the data set Z in the step 4.4, and distributing the 10 computing tasks to 5 nodes of the distributed system for parallel computing. The method comprises the following specific steps:

step 5.1,Respectively creating a calculation task for 10 data partitions of the data set Z in the step 4.4 by using a Spark calculation engine, wherein the logic of completion of each calculation task is to calculate Euclidean distances between a reference data sequence s and short-time sequence elements with the length of 12 of each partition of the data set Z; for two k-dimensional vectors s (x) ₁₁ ，x ₁₂ ，...，x _1k ) And l (x) ₂₁ ，x ₂₂ ，...，x _2k ) The Euclidean distance between the two is as follows:

step 5.2, scheduling the computing tasks created in the step 5.1 through a distributed Spark computing engine, distributing the computing tasks to computing nodes where data partitions are located, and processing partition data of the nodes where the computing tasks are located by each computing task;

step 5.3, creating a distributed data set R, wherein the number of data is the same as that of the data set Z, and the elements are<key _R ，value _R >Key-value pair form, key _R Numbering elements, value _R A calculated euclidean distance value between each element of Z and the reference time series.

In this embodiment, the number of partitions in the constructed distributed data set R is also 10, and the form of data elements in R is as shown in table 2:

table 2 partial dataform of distributed data set R

And 6, carrying out full sequencing on the distributed data set created in the step 5.3 according to the value of the Euclidean distance, and obtaining the index value of the first 2 elements with the smallest values. The method comprises the following specific steps:

and 6.1, creating a computing task for each partition of the data set R in the step 5.3 through a Spark computing engine, and distributing the computing task to 5 nodes for computing. Calculating the minimum 2 elements of each partition element value, and recording the index values of the elements;

step 6.2, collecting the 2 × 10 local minimum elements obtained in the step 6.1 to a master node of the distributed system through a Spark calculation engine, performing summary sorting, calculating to obtain 2 elements with the minimum global state, and recording index values of the elements;

and 6.3, acquiring element values corresponding to corresponding indexes in the data set Z in the step 4.4 according to the 2 index values calculated in the step 6.2, namely 2 sub time sequences most similar to the reference time sequence s. In this embodiment, two most similar sub-time sequences retrieved from the axle temperature history data according to the reference time sequence s are respectively set as initial indexes 401 and 95, and the retrieval result is shown in table 3:

table 3 similar pattern search results

Index	European distance	Time series value of axle temperature
			70	36.318	{105，105，112，117，122，121，178，132，116，113，115，111}
401	38.716	{106，107，107，108，108，119，128，136，112，110，110，111}

The effect of the retrieval method is shown in fig. 2, wherein the data identified by the rectangular wire frame in the figure is the stator shaft temperature reference time series pattern, and the data identified by the oval wire frame is the most similar 2 sub time series patterns retrieved from the stator shaft temperature historical time series.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit of the invention, which is defined by the claims.

Claims

1. A distributed time series mode retrieval method for massive high-speed rail shaft temperature data is characterized by comprising the following steps: the method comprises the following steps:

step 2, reading mass historical shaft temperature time sequence data to be retrieved into a memory of a computing node of the distributed system, constructing an initial distributed data set X, and determining the number n of parallel computing tasks;

step 3, constructing indexes for the distributed data set X in the step 2, numbering each element in the data set X from 0 according to a time sequence, wherein the numbering task of each partition is calculated in parallel at different nodes; converting each element in the data set X into a key value pair recording form, wherein the key is an index number, and the value is an axial temperature time series numerical value at a corresponding moment;

step 4, constructing m-1 auxiliary distributed data sets Y according to the length of the reference time sequence mode _j Wherein j is the same as (1,m-1), and the distributed data set X and the auxiliary distribution in the step 3 are distributedConnecting the formula data sets to construct a distributed data set Z with each element being a short time sequence with the length of m;

step 4.2, for m-1 auxiliary distributed data sets Y _j Constructing an index in which Y _j The elements of (a) are numbered starting from-j;

step 4.4, after step 4.3, constructing a distributed data set Z from the data set X in step 3, where each element of Z becomes < key, (value 1, value 2.,. Value) >;

wherein, key is the index of the element, numbering is started from 0, and value is converted into a short time sequence with the length of m;

step 5, creating calculation tasks through a Spark calculation engine, wherein the logic of each calculation task is to calculate the Euclidean distance between a reference time sequence mode and elements of each data partition, distribute each calculation task to a plurality of nodes in a distributed system for parallel calculation, and construct a distributed data set R;

step 6, carrying out full sequencing on the distributed data set R in the step 5 according to Euclidean distance values, taking the first k elements with the minimum distance, and returning the index of each element;

and 7, acquiring corresponding elements in the data set Z in the step 4 according to the indexes obtained in the step 6, and obtaining k sub-time sequences which are most similar to the reference time sequence in the mass axle temperature historical time sequence data set.

2. The distributed time series pattern retrieval method for massive high-speed rail shaft temperature data according to claim 1, characterized in that: the specific method of the step 2 comprises the following steps:

step 2.1, uploading historical axle temperature data stored in a hard disk storage medium to a distributed file system HDFS through a network;

2.3, setting the number of partitions of the distributed data set to be created to be n, dividing the shaft temperature data stored on the HDFS into n data blocks by a Spark calculation engine according to the set number of the partitions, creating a partition for each data block, and constructing an elastic distributed data set RDD object X with the number of the partitions of n, wherein each element of the X is a shaft temperature data value at a certain moment, and the X maintains the sequence of the partitions and the offset of a first element of each partition in the whole data set;

and 2.4, determining the number of the parallel computing tasks to be n according to the number of the partitions, creating a computing task for each data partition of the RDD object X by the Spark computing engine, distributing the computing task to different computing nodes, enabling the tasks to be independent from one another, and enabling each task to process data of one partition in parallel.

3. The distributed time series pattern retrieval method for massive high-speed rail shaft temperature data according to claim 2, characterized in that: the specific method of the step 5 comprises the following steps:

step 5.3, creating a distributed data set R, wherein the number of data is the same as that of the data set Z, and the elements are < key _R ,value _R Key value pair form, key _R Numbering elements, value _R A calculated euclidean distance value between each element of Z and the reference time series.

4. The distributed time series pattern retrieval method for massive high-speed rail shaft temperature data according to claim 3, characterized in that: the specific method of the step 6 comprises the following steps:

and 6.2, collecting n multiplied by k elements obtained by the n partitions of the R to a node in the distributed system, summarizing and sorting, taking the minimum k elements, and recording the index value of the element obtained by calculation.