CN109656887B - Distributed time series mode retrieval method for mass high-speed rail shaft temperature data - Google Patents

Distributed time series mode retrieval method for mass high-speed rail shaft temperature data Download PDF

Info

Publication number
CN109656887B
CN109656887B CN201811510849.8A CN201811510849A CN109656887B CN 109656887 B CN109656887 B CN 109656887B CN 201811510849 A CN201811510849 A CN 201811510849A CN 109656887 B CN109656887 B CN 109656887B
Authority
CN
China
Prior art keywords
data
distributed
data set
time sequence
shaft temperature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811510849.8A
Other languages
Chinese (zh)
Other versions
CN109656887A (en
Inventor
徐泉
解军帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201811510849.8A priority Critical patent/CN109656887B/en
Priority to PCT/CN2019/077741 priority patent/WO2020118928A1/en
Publication of CN109656887A publication Critical patent/CN109656887A/en
Application granted granted Critical
Publication of CN109656887B publication Critical patent/CN109656887B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a distributed time series mode retrieval method for massive high-speed rail shaft temperature data, and relates to the technical field of time series analysis. Firstly, setting a retrieval reference time sequence mode and the most similar mode number to be retrieved; reading historical shaft temperature time series data to be retrieved into a distributed system, and constructing a distributed data set X; then constructing indexes for the elements of X; constructing a plurality of auxiliary distributed data sets according to the length of the reference time sequence, connecting the X and the auxiliary data sets, and constructing a distributed data set Z, wherein each element of the distributed data set Z is a short time sequence; calculating Euclidean distances between elements of the reference time sequence and the elements of the Z, and constructing a distributed data set R; sorting R, taking the minimum k elements, and returning the index of the corresponding element; and acquiring corresponding elements in the data set Z according to the indexes. The distributed time series mode retrieval method for mass high-speed rail shaft temperature data improves the similarity retrieval efficiency of mass shaft temperature time series data.

Description

Distributed time series mode retrieval method for mass high-speed rail shaft temperature data
Technical Field
The invention relates to the technical field of time series analysis, in particular to a distributed time series mode retrieval method for massive high-speed rail shaft temperature data.
Background
The time sequence is a numerical sequence or symbol sequence which is associated with time and has a sequence, and is widely applied to the fields of finance, weather, fault diagnosis and the like. The high-speed rail shaft temperature data serving as an important component of the high-speed rail daily operation and maintenance data has typical time series characteristics, and the analysis and the processing of the time series are also an important direction of the current high-speed rail fault diagnosis research, including retrieval, mode mining, clustering and the like of abnormal modes. Because the high-speed rail shaft temperature sensors are large in number and high in acquisition frequency, the high-speed rail shaft temperature data has the characteristics of large data volume, high dimensionality, high updating frequency and the like, namely the high-speed rail shaft temperature data has typical large data characteristics. Therefore, how to efficiently process a huge time sequence formed by massive high-speed rail shaft temperature data is a problem which needs to be researched and solved at present.
The similarity pattern retrieval problem of time series can be described as that given a certain time series pattern, several subsequences most similar to the certain time series pattern are found out from a large time series, and the time series similarity retrieval is a precondition for realizing other time series analysis tasks, such as abnormality detection, pattern recognition and the like. The current time sequence similarity pattern retrieval method mainly adopts a single machine method to carry out serial retrieval on time sequences, and finds out all subsequence patterns with similarity meeting requirements. However, due to the limitation of machine performance, the single machine method has limited data amount to process, and the calculation efficiency is low, so that it is difficult to meet the search requirement of massive shaft temperature time series data.
At present, the development of big data and cloud computing makes distributed parallel computing of data possible, greatly improves the data processing capacity and efficiency, and provides a solution idea for the analysis problem of massive shaft temperature time series data. In order to improve the retrieval efficiency of massive shaft temperature time series data, a distributed time series similar mode retrieval method needs to be researched.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a distributed time series pattern retrieval method for massive high-speed rail shaft temperature data, aiming at the defects of the prior art, wherein a large data processing cluster consisting of distributed computing nodes is used, the data are distributed to memories of different nodes of the cluster, and when the cluster runs a computing task, the computing task is decomposed and distributed to different nodes, so that parallel retrieval of a time series similar pattern is realized.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a distributed time series mode retrieval method for massive high-speed rail shaft temperature data comprises the following steps:
step 1, setting a retrieved shaft temperature data reference time sequence mode s, determining the length m of a retrieval mode, and setting the number of most similar modes needing to be retrieved as k;
the shaft temperature data is bearing temperature data of a high-speed rail, and the sampling period is 1s;
the reference time sequence mode is data of a continuous time period selected by self-definition in the shaft temperature data, and the length of the reference time sequence mode is m;
step 2, reading mass historical shaft temperature time sequence data to be retrieved into a memory of a computing node of the distributed system, constructing an initial distributed data set X, and determining the number n of parallel computing tasks, wherein the method specifically comprises the following steps:
step 2.1, uploading historical axle temperature data stored in a hard disk storage medium to a Distributed File System (HDFS) through a network;
step 2.2, reading mass shaft temperature time sequence data stored in the HDFS by using a distributed Spark calculation engine;
step 2.3, setting the number of partitions of the Distributed data set to be created to be n, dividing the shaft temperature data stored on the HDFS into n data blocks by a Spark calculation engine according to the set number of partitions, creating a partition for each data block, and constructing a Distributed data set RDD (flexible Distributed data set) object X with the number of partitions of n, wherein each element of the object X is a shaft temperature data value at a certain moment, and the X maintains the sequence of the partitions and the offset of a first element of each partition in the whole data set;
step 2.4, determining the number of parallel computing tasks to be n according to the number of partitions, creating a computing task for each data partition of the RDD object X by the Spark computing engine, distributing the computing task to different computing nodes, enabling the tasks to be independent from one another, and enabling each task to process data of one partition in parallel;
step 3, constructing indexes for the distributed data set X in the step 2, numbering each element in the data set X from 0 according to a time sequence, wherein the numbering task of each partition is calculated in parallel at different nodes; converting each element in the data set X into a key value pair record form, wherein the key is an index number, and the value is an axial temperature time sequence numerical value at a corresponding moment;
step 4, constructing m-1 auxiliary distributed data sets Y according to the length of the reference time sequence mode j Wherein j ∈ (1,m-1), the distributed data set X and the auxiliary distributed data set in step 3 are connected to construct a distributed data set Z, each element of which is a short-time sequence with a length of m, and the specific steps include:
step 4.1, according to the length m of the set reference time sequence mode, m-1 auxiliary distributed data sets Y which are the same as the data sets X in the step 3 are constructed j Where j ∈ (1,m-1);
step 4.2, for m-1 auxiliary distributed data sets Y j Constructing an index of Y j The elements of (a) are numbered starting from-j;
step 4.3, sequentially adding m-1 auxiliary distributed data sets Y to the distributed data set X j Carrying out Cartesian product operation, and connecting elements with the same index key value in the two distributed data sets subjected to Cartesian product;
step 4.4, after step 4.3, constructing a distributed data set Z from the data set X in step 3, the elements of Z being<key 1 ,(value 1 ,value 2 ,...,value m )>Form, wherein, key 1 Is the number of the element, (value) 1 ,value 2 ,...,value m ) Represents a segment with the first key 1 Starting a short time sequence with the length of m at each moment;
step 5, creating calculation tasks through a Spark calculation engine, wherein the logic of each calculation task is to calculate the Euclidean distance between a reference time sequence mode and elements of each data partition, distribute each calculation task to a plurality of nodes in a distributed system for parallel calculation, and construct a distributed data set R, and the specific steps comprise:
step 5.1, creating a calculation task for each data partition of the data set Z in the step 4.4 through a Spark calculation engine, wherein the logic of completion of each calculation task is to calculate a Euclidean distance between a reference data sequence s and a short time sequence l represented by each element of the data partition;
step 5.2, scheduling the computing tasks created in the step 5.1 through a distributed Spark computing engine, distributing the computing tasks to computing nodes where data partitions are located, and processing partition data of the nodes where each computing task is located;
step 5.3, creating a distributed data set R, wherein the number of data is the same as that of the data set Z, and the elements are<key R ,value R >Key-value pair form, key R Numbering elements, value R A calculated euclidean distance value between each element of Z and the reference time series;
step 6, fully sorting the distributed data set R in the step 5.3 according to Euclidean distance values, taking the first k elements with the minimum distance, and returning the index of each element, wherein the specific steps comprise:
step 6.1, performing parallel computing and sequencing on the data of each partition of the distributed data set R through a Spark computing engine according to the value of each element, and taking the minimum k elements of each partition;
step 6.2, collecting n multiplied by k elements obtained by n partitions of R on a node in the distributed system, summarizing and sorting, taking the minimum k elements, and recording the index value of the element obtained by calculation;
and 7, acquiring corresponding elements in the data set Z in the step 4 according to the index obtained in the step 6, and obtaining k sub-time sequences which are most similar to the reference time sequence in the massive shaft temperature historical time sequence data set.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the distributed time series mode retrieval method for massive high-speed rail shaft temperature data provided by the invention is characterized in that on the basis of utilizing a distributed Spark calculation engine and a distributed file system HDFS, reorganization and transformation are carried out on a distributed data set formed by massive high-speed rail shaft temperature time series data, each element of the distributed data set formed by the high-speed rail shaft temperature data is converted into a short time series which is independent from each other and keeps time sequence, and a calculation task can be created for each data partition through the Spark calculation engine and distributed to different calculation nodes of the distributed system for calculation. Therefore, the parallel similarity retrieval effect of the time sequence can be realized, the problem of similarity retrieval of massive high-speed rail shaft temperature time sequences which cannot be processed by a single machine is solved, and the similarity retrieval efficiency of massive shaft temperature time sequence data is improved.
Drawings
Fig. 1 is a flowchart of a distributed time series pattern retrieval method for massive high-speed rail shaft temperature data according to an embodiment of the present invention;
fig. 2 is a diagram of the effect of time-series search according to the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
The high-speed rail shaft temperature data is typical time series data, and in this embodiment, taking high-speed rail stator shaft temperature data collected by a certain railway bureau as an example, the Spark-oriented parallel time series similar mode retrieval method for massive high-speed rail shaft temperature data is adopted to retrieve the similar mode of the time series.
The high-speed rail stator shaft temperature data is temperature data acquired by a shaft temperature sensor installed at a high-speed rail stator end, the data volume of the used historical shaft temperature is 20GB, the number of data records is 213002710, the sampling period is 1s, the data content comprises acquisition time and the bearing temperature value at the moment, and part of the data is shown in table 1.
TABLE 1 partial high-speed rail stator shaft temperature data
Figure BDA0001900735700000041
Figure BDA0001900735700000051
Because the historical shaft temperature data volume is large, only part of the stator shaft temperature data is listed to illustrate the form of the high-speed rail stator shaft temperature data.
A Spark-based parallel time series retrieval method for massive high-speed rail stator shaft temperature data comprises the following steps as shown in figure 1:
step 1, setting a retrieved shaft temperature data reference time series pattern s, selecting 160 th data as a starting point, using continuous data with the length m of 12 as a reference time series pattern, setting data of the reference time series pattern s as [103, 103, 105, 105, 110, 159, 144, 127, 116, 116, 117, 113], and setting the number k of the retrieved most similar patterns as 2; the task of time series pattern retrieval is to search a plurality of m-length sub-time sequences most similar to a reference time series pattern in a large time series represented by massive axle temperature historical data;
and 2, reading historical time sequence data of the temperature of the high-speed rail stator sub-shaft to be retrieved into a memory of a distributed computing node, constructing a distributed data set X, and setting the number n of parallel computing tasks to be 10. The method specifically comprises the following steps:
step 2.1, uploading historical data of the temperature of the high-speed rail stator shaft stored in a hard disk storage medium to a distributed file system HDFS (Hadoop distributed file system) through a network;
step 2.2, reading the high-speed rail stator shaft temperature data stored on the HDFS by using a distributed Spark calculation engine;
and 2.3, setting the number of data partitions to be 10 when Spark reads the stator shaft temperature data, dividing the read stator shaft temperature data into 10 data blocks by a Spark calculation engine according to the number of the data partitions, wherein the data partitions form a distributed data set RDD object X, and the data partitions of the X are in the form of { partition1, partition2, partition10}, wherein each partition represents one data partition and the partitions have temporal sequence.
In this embodiment, the data format of the inside of the partition is described by the data of partition1, the number of the axis temperature records of partition1 is 20001036, and the internal data format of partition1 is { x } 1 ,x2,...,x 20001036 },x n Representing the value of the stator shaft temperature, the elements of the subarea are arranged according to the time sequence, and the forms of other subareas are the same as the partition 1.
2.4, creating a calculation task for each data partition by the Spark calculation engine according to the set number of the data partitions of the RDD object, distributing the calculation task to different nodes in the distributed system, and only processing the stator shaft temperature data of the corresponding data partition by one task;
step 3, constructing indexes for the distributed data set X formed by the stator shaft temperature data in the step 2, numbering the elements of the X from 0 according to the time sequence, and converting the data of each partition of the data set X into data
{<index,x 1 >,<index+1,x 2 >,...,<index+n-1,xn>}
The index is the offset of the first element of the partition relative to the whole data set, n is the number of data records of the partition, an index is created for each element of the distributed data set, and the creation of the index of the element of each partition is independently completed at different nodes.
In this embodiment, partition1 is taken as an example to describe the conversion logic of data, and after an index is constructed, the data form of partition1 is converted into
{<0,x 1 >,<1,x 2 >,...,<20001035,x 20001036 >}
The data form of the other partitions is the same as partition 1.
Step 4, constructing 11 distributed data sets Y which are the same as X in the step 3 according to the length of the reference time sequence s j Where j e (1, 11), the distributed data set X and the auxiliary distributed data set are joined to construct a distributed data set Z with each element being a short time sequence of length 12.
Step 4.1, 11 distributed data sets Y which are the same as the data sets X in the step 3 are constructed j Wherein j ∈ (1, 11);
step 4.2, index is constructed for 11 auxiliary distributed data sets in step 4.1, and for Y j For example, j is subtracted from the indices of all its elements, i.e., the index of the first element after conversion is-j, and Y is used 1 For example, of the first elementThe index is-1, and indexes of other elements are increased by analogy;
and 4.3, carrying out Cartesian product operation on the data set X and the 11 auxiliary distributed data sets in the step 3, and reserving elements with the same index in the two data sets for connection of any two data sets subjected to the Cartesian product operation. The mathematical expression of Cartesian product of the two sets is
A×B={(x,y)|x∈A∧y∈B}
Wherein, A and B represent two data sets, x and y represent arbitrary elements in A and B, respectively, and (x, y) represent element forms in sets obtained after Cartesian product calculation is carried out on the two sets.
The result after the concatenation of two elements < key, value1> and < key, value2> having the same index in the two data sets is < key, (value 1, value 2) >.
This example uses X and Y 1 Illustrating the logic of the coupling, X and Y 1 The elements with the same index as 10 are respectively<10,99>And<10,100>the result after the joining of two elements is<10,(99,100)>The coupling of other elements is similar.
Step 4.4, after step 4.3, a distributed data set Z is constructed, each element of Z becoming
<key,(value1,value2,...,value12)>
Where, key is the index of the element, numbering from 0, value is converted to a short time sequence of length 12.
In the embodiment, the form of data in Z is described by the 160 th element in Z, namely the element with the index of 159, and the form of the element with the index key of 159 in Z is
<159,(103,103,105,105,110,159,144,127,116,116,117,113)>
And step 5, scheduling the computing tasks through a Spark computing engine, creating 10 computing tasks for 10 partitions of the data set Z in the step 4.4, and distributing the 10 computing tasks to 5 nodes of the distributed system for parallel computing. The method comprises the following specific steps:
step 5.1,Respectively creating a calculation task for 10 data partitions of the data set Z in the step 4.4 by using a Spark calculation engine, wherein the logic of completion of each calculation task is to calculate Euclidean distances between a reference data sequence s and short-time sequence elements with the length of 12 of each partition of the data set Z; for two k-dimensional vectors s (x) 11 ,x 12 ,...,x 1k ) And l (x) 21 ,x 22 ,...,x 2k ) The Euclidean distance between the two is as follows:
Figure BDA0001900735700000081
step 5.2, scheduling the computing tasks created in the step 5.1 through a distributed Spark computing engine, distributing the computing tasks to computing nodes where data partitions are located, and processing partition data of the nodes where the computing tasks are located by each computing task;
step 5.3, creating a distributed data set R, wherein the number of data is the same as that of the data set Z, and the elements are<key R ,value R >Key-value pair form, key R Numbering elements, value R A calculated euclidean distance value between each element of Z and the reference time series.
In this embodiment, the number of partitions in the constructed distributed data set R is also 10, and the form of data elements in R is as shown in table 2:
table 2 partial dataform of distributed data set R
Figure BDA0001900735700000082
And 6, carrying out full sequencing on the distributed data set created in the step 5.3 according to the value of the Euclidean distance, and obtaining the index value of the first 2 elements with the smallest values. The method comprises the following specific steps:
and 6.1, creating a computing task for each partition of the data set R in the step 5.3 through a Spark computing engine, and distributing the computing task to 5 nodes for computing. Calculating the minimum 2 elements of each partition element value, and recording the index values of the elements;
step 6.2, collecting the 2 × 10 local minimum elements obtained in the step 6.1 to a master node of the distributed system through a Spark calculation engine, performing summary sorting, calculating to obtain 2 elements with the minimum global state, and recording index values of the elements;
and 6.3, acquiring element values corresponding to corresponding indexes in the data set Z in the step 4.4 according to the 2 index values calculated in the step 6.2, namely 2 sub time sequences most similar to the reference time sequence s. In this embodiment, two most similar sub-time sequences retrieved from the axle temperature history data according to the reference time sequence s are respectively set as initial indexes 401 and 95, and the retrieval result is shown in table 3:
table 3 similar pattern search results
Index European distance Time series value of axle temperature
70 36.318 {105,105,112,117,122,121,178,132,116,113,115,111}
401 38.716 {106,107,107,108,108,119,128,136,112,110,110,111}
The effect of the retrieval method is shown in fig. 2, wherein the data identified by the rectangular wire frame in the figure is the stator shaft temperature reference time series pattern, and the data identified by the oval wire frame is the most similar 2 sub time series patterns retrieved from the stator shaft temperature historical time series.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit of the invention, which is defined by the claims.

Claims (4)

1. A distributed time series mode retrieval method for massive high-speed rail shaft temperature data is characterized by comprising the following steps: the method comprises the following steps:
step 1, setting a retrieved shaft temperature data reference time sequence mode s, determining the length m of a retrieval mode, and setting the number of most similar modes needing to be retrieved as k;
the shaft temperature data is bearing temperature data of a high-speed rail, and the sampling period is 1s;
the reference time sequence mode is data of a continuous time period selected by self-definition in the shaft temperature data, and the length of the reference time sequence mode is m;
step 2, reading mass historical shaft temperature time sequence data to be retrieved into a memory of a computing node of the distributed system, constructing an initial distributed data set X, and determining the number n of parallel computing tasks;
step 3, constructing indexes for the distributed data set X in the step 2, numbering each element in the data set X from 0 according to a time sequence, wherein the numbering task of each partition is calculated in parallel at different nodes; converting each element in the data set X into a key value pair recording form, wherein the key is an index number, and the value is an axial temperature time series numerical value at a corresponding moment;
step 4, constructing m-1 auxiliary distributed data sets Y according to the length of the reference time sequence mode j Wherein j is the same as (1,m-1), and the distributed data set X and the auxiliary distribution in the step 3 are distributedConnecting the formula data sets to construct a distributed data set Z with each element being a short time sequence with the length of m;
step 4.1, according to the length m of the set reference time sequence mode, m-1 auxiliary distributed data sets Y which are the same as the data sets X in the step 3 are constructed j Where j ∈ (1,m-1);
step 4.2, for m-1 auxiliary distributed data sets Y j Constructing an index in which Y j The elements of (a) are numbered starting from-j;
step 4.3, sequentially adding m-1 auxiliary distributed data sets Y to the distributed data set X j Carrying out Cartesian product operation, and connecting elements with the same index key value in the two distributed data sets subjected to Cartesian product;
step 4.4, after step 4.3, constructing a distributed data set Z from the data set X in step 3, where each element of Z becomes < key, (value 1, value 2.,. Value) >;
wherein, key is the index of the element, numbering is started from 0, and value is converted into a short time sequence with the length of m;
step 5, creating calculation tasks through a Spark calculation engine, wherein the logic of each calculation task is to calculate the Euclidean distance between a reference time sequence mode and elements of each data partition, distribute each calculation task to a plurality of nodes in a distributed system for parallel calculation, and construct a distributed data set R;
step 6, carrying out full sequencing on the distributed data set R in the step 5 according to Euclidean distance values, taking the first k elements with the minimum distance, and returning the index of each element;
and 7, acquiring corresponding elements in the data set Z in the step 4 according to the indexes obtained in the step 6, and obtaining k sub-time sequences which are most similar to the reference time sequence in the mass axle temperature historical time sequence data set.
2. The distributed time series pattern retrieval method for massive high-speed rail shaft temperature data according to claim 1, characterized in that: the specific method of the step 2 comprises the following steps:
step 2.1, uploading historical axle temperature data stored in a hard disk storage medium to a distributed file system HDFS through a network;
step 2.2, reading mass shaft temperature time sequence data stored in the HDFS by using a distributed Spark calculation engine;
2.3, setting the number of partitions of the distributed data set to be created to be n, dividing the shaft temperature data stored on the HDFS into n data blocks by a Spark calculation engine according to the set number of the partitions, creating a partition for each data block, and constructing an elastic distributed data set RDD object X with the number of the partitions of n, wherein each element of the X is a shaft temperature data value at a certain moment, and the X maintains the sequence of the partitions and the offset of a first element of each partition in the whole data set;
and 2.4, determining the number of the parallel computing tasks to be n according to the number of the partitions, creating a computing task for each data partition of the RDD object X by the Spark computing engine, distributing the computing task to different computing nodes, enabling the tasks to be independent from one another, and enabling each task to process data of one partition in parallel.
3. The distributed time series pattern retrieval method for massive high-speed rail shaft temperature data according to claim 2, characterized in that: the specific method of the step 5 comprises the following steps:
step 5.1, creating a calculation task for each data partition of the data set Z in the step 4.4 through a Spark calculation engine, wherein the logic of completion of each calculation task is to calculate a Euclidean distance between a reference data sequence s and a short time sequence l represented by each element of the data partition;
step 5.2, scheduling the computing tasks created in the step 5.1 through a distributed Spark computing engine, distributing the computing tasks to computing nodes where data partitions are located, and processing partition data of the nodes where the computing tasks are located by each computing task;
step 5.3, creating a distributed data set R, wherein the number of data is the same as that of the data set Z, and the elements are < key R ,value R Key value pair form, key R Numbering elements, value R A calculated euclidean distance value between each element of Z and the reference time series.
4. The distributed time series pattern retrieval method for massive high-speed rail shaft temperature data according to claim 3, characterized in that: the specific method of the step 6 comprises the following steps:
step 6.1, performing parallel computing and sequencing on the data of each partition of the distributed data set R through a Spark computing engine according to the value of each element, and taking the minimum k elements of each partition;
and 6.2, collecting n multiplied by k elements obtained by the n partitions of the R to a node in the distributed system, summarizing and sorting, taking the minimum k elements, and recording the index value of the element obtained by calculation.
CN201811510849.8A 2018-12-11 2018-12-11 Distributed time series mode retrieval method for mass high-speed rail shaft temperature data Active CN109656887B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811510849.8A CN109656887B (en) 2018-12-11 2018-12-11 Distributed time series mode retrieval method for mass high-speed rail shaft temperature data
PCT/CN2019/077741 WO2020118928A1 (en) 2018-12-11 2019-03-12 Distributed time sequence pattern retrieval method for massive equipment operation data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811510849.8A CN109656887B (en) 2018-12-11 2018-12-11 Distributed time series mode retrieval method for mass high-speed rail shaft temperature data

Publications (2)

Publication Number Publication Date
CN109656887A CN109656887A (en) 2019-04-19
CN109656887B true CN109656887B (en) 2023-03-21

Family

ID=66113798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811510849.8A Active CN109656887B (en) 2018-12-11 2018-12-11 Distributed time series mode retrieval method for mass high-speed rail shaft temperature data

Country Status (2)

Country Link
CN (1) CN109656887B (en)
WO (1) WO2020118928A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579661B (en) * 2019-09-29 2023-04-14 杭州海康威视数字技术股份有限公司 Method and device for determining specific target pair, computer equipment and storage medium
CN112732165A (en) * 2019-10-28 2021-04-30 北京沃东天骏信息技术有限公司 Offset management method, device and storage medium
CN113688877B (en) * 2021-07-30 2024-04-16 联合汽车电子有限公司 Test data processing method and device, storage medium, instrument and vehicle

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107103032A (en) * 2017-03-21 2017-08-29 中国科学院计算机网络信息中心 The global mass data paging query method sorted is avoided under a kind of distributed environment
CN108549696A (en) * 2018-04-16 2018-09-18 安徽工业大学 A kind of time series data similarity query method calculated based on memory

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114331A1 (en) * 2003-11-26 2005-05-26 International Business Machines Corporation Near-neighbor search in pattern distance spaces
US10503732B2 (en) * 2013-10-31 2019-12-10 Micro Focus Llc Storing time series data for a search query
CN104182460B (en) * 2014-07-18 2017-06-13 浙江大学 Time Series Similarity querying method based on inverted index
US20160328432A1 (en) * 2015-05-06 2016-11-10 Squigglee LLC System and method for management of time series data sets
CN107590143B (en) * 2016-07-06 2020-04-03 北京金山云网络技术有限公司 Time series retrieval method, device and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107103032A (en) * 2017-03-21 2017-08-29 中国科学院计算机网络信息中心 The global mass data paging query method sorted is avoided under a kind of distributed environment
CN108549696A (en) * 2018-04-16 2018-09-18 安徽工业大学 A kind of time series data similarity query method calculated based on memory

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MapReduce并行加速数据流多模式相似性搜索;付晨等;《计算机应用》;20170110(第01期);全文 *
基于Map/Reduce的时间序列相似性搜索算法;王会青等;《山东大学学报(工学版)》;20160122(第01期);全文 *
基于Spark的高维数据相似性连接;成小海;《计算机技术与发展》;20180428(第08期);全文 *
基于工业云的电熔镁炉监控系统与关键技术;冉振莉等;《计算机集成制造系统》;20180109(第11期);全文 *
时间序列相似性查询与索引方法研究;邱均平等;《山东图书馆学刊》;20091230(第06期);全文 *

Also Published As

Publication number Publication date
WO2020118928A1 (en) 2020-06-18
CN109656887A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN109656887B (en) Distributed time series mode retrieval method for mass high-speed rail shaft temperature data
CN108197132B (en) Graph database-based electric power asset portrait construction method and device
CN108595539A (en) A kind of recognition methods of trace analogical object and system based on big data
CN106127249B (en) Online detection method for abnormal subsequence in electrocardiogram data
He et al. Deep transfer learning method based on 1D-CNN for bearing fault diagnosis
CN103488790A (en) Polychronic time sequence similarity analysis method based on weighting BORDA counting method
US20180192245A1 (en) Extraction and Representation method of State Vector of Sensing Data of Internet of Things
CN112699605A (en) Charging pile fault element prediction method and system
CN103853752A (en) Method and device for managing time series database
CN112632127B (en) Data processing method for real-time data acquisition and time sequence of equipment operation
Wang et al. Pattern discovery from audio recordings by variable markov oracle: A music information dynamics approach
CN114819315A (en) Bearing degradation trend prediction method based on multi-parameter fusion health factor and time convolution neural network
CN112529053A (en) Short-term prediction method and system for time sequence data in server
CN112651576A (en) Long-term wind power prediction method and device
CN114742124A (en) Abnormal data processing method, system and device
Liu et al. A novel health prognosis method for system based on improved degenerated Hidden Markov model
Zhu et al. Software defect prediction based on non-linear manifold learning and hybrid deep learning techniques
CN116484239A (en) Hierarchical clustering method and hierarchical clustering device for photovoltaic scene
CN116245212A (en) PCA-LSTM-based power data anomaly detection and prediction method and system
CN105843724A (en) Monitoring state index compression analysis method of IT (Information Technology) system
Yi et al. Bearing fault diagnosis with deep learning models
Zarif et al. Improving performance of multi-label classification using ensemble of feature selection and outlier detection
Wu et al. Top-k contrast order-preserving pattern mining for time series classification
Nagy et al. Partitional clustering of tick data to reduce storage space
CN113572152B (en) Wind turbine generator oscillation mode and influence factor correlation analysis method based on FP-growth

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant