CN108470068A

CN108470068A - A kind of abstract index generation method of sequential key assignments type industrial process data

Info

Publication number: CN108470068A
Application number: CN201810270729.9A
Authority: CN
Inventors: 张可; 韩载道; 李媛
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2018-08-31

Abstract

The invention discloses a kind of abstract index generation methods of sequential key assignments type industrial process data, it includes S1：Obtain sequential key assignments type industrial process data；S2：The time series data of acquisition is pre-processed as smooth noise to obtain the time series data with timestamp；S3：Symbolization polymerization approximate representation method indicates that pretreatment obtains time series data；S4：Symbol is polymerize to the result after approximate representation and carries out pattern clustering, the result after progress pattern clustering is formed into index using prefix algorithm.The advantageous effect that the present invention obtains is：Symbol polymerization approximate representation method and prefix trees algorithm fusion are formed into the abstract index generation method of sequential key assignments type industrial process data based on data preprocessing method；This method can reduce the dimension of former time series data, effectively the feature of the former data of extraction, and realize abstract index generation method using prefix tree algorithm.

Description

A kind of abstract index generation method of sequential key assignments type industrial process data

Technical field

The present invention relates to Time Series Data Mining technical field, especially a kind of sequential key assignments type industrial process data Abstract index generation method.

Background technology

Time series data is widely present in the fields such as industrial process, weather detection, medical diagnosis.The type industry of sequential key assignments Process data has the characteristics that higher-dimension, magnanimity, therefore traditional data summarization index as a kind of typical time series data Generation method cannot analyze such data well.It is a kind of Symbolic Representation method of maturation that symbol, which polymerize approximate representation, It is widely used in the pretreatment and mode discovery of time series data.The advantage is that can utilize more mature efficient needle To the data mining algorithm of string operation.Prefix trees are a kind of key tree constructions, are a kind of mutation of Hash tree.Typical case is For a large amount of character string (but being not limited only to character string) that counts and sort, so often searched automotive engine system is used for text word Frequency counts.The disadvantage is that：Current time series data indexing means are mostly based on single dimension-reduction treatment expression or Symbolic Representation side Method, it is difficult to quickly, efficiently inquire time series data.Therefore, there is an urgent need for a kind of abstract ropes of new sequential key assignments type industrial process data Draw generation method.

Invention content

In view of the drawbacks described above of the prior art, it is an object of the invention to provide a kind of sequential key assignments type industrial process numbers According to abstract index generation method, a kind of indexing means encoded to it using prefix tree algorithm can be built.Symbolization Polymerization approximate representation method indicates that pretreatment obtains time series data；Then symbol is polymerize to the result after approximate representation carry out The cluster result is finally formed index by pattern clustering using prefix tree algorithm.

It is realized the purpose of the present invention is technical solution in this way, a kind of sequential key assignments type industrial process data is plucked Index generation method is wanted, it includes：

S1：Obtain sequential key assignments type industrial process data；

S2：The time series data of acquisition is pre-processed as smooth noise to obtain the time series data with timestamp；

S3：Symbolization polymerization approximate representation method indicates that pretreatment obtains time series data；

S4：Symbol is polymerize to the result after approximate representation and carries out pattern clustering, the result after progress pattern clustering is used Prefix algorithm forms index.

Further, the time series data to acquisition in the step S2 makees the pretreated specific steps of smooth noise such as Under：

S21：Separate-blas estimation is carried out to primordial time series data；It was found that noise, outlier and uncommon value, are investigated every The range of the domain and data type of a attribute and each attribute acceptable value；

S22：By investigating the value in data fields, by acquiring smoothed data according to case mean value method in branch mailbox method Value carrys out smooth ordered data, by continuous data discretization, obtains pretreated time series data, increases granularity.

Further, the step S3 is as follows：

S31：Equal length segmentation is carried out to the time series data obtained after step S2 pretreatments, takes each section of average value structure The time series data of Cheng Xin is indicating former Dimension Time Series；

S32：For the time series data of gained after dimensionality reduction, ordinal number when indicating to obtain this using symbol polymerization approximate representation method According to discretization approximate representation.

Further, the step S4 includes：

S41：For time series data Symbolic Representation form obtained by step S3, using K mean value pattern clustering methods to S3's As a result it clusters, obtains the character string mode result of a string of discretizations；

S42：It based on the above results, is encoded using prefix tree algorithm, forms index.

Further, the step S31 includes：

The time series data dimension that step S2 is obtained is n, and gained dimension is N after processing.I-th subsegment mean value can be by following formula It determines：

By adopting the above-described technical solution, the present invention has the advantage that：The present invention is by stage feeding polymerization approximate representation Method is used for the dimensionality reduction of time series data, ensure that apart from lower bound criterion so as to avoid the under-enumeration row in follow-up similar inquiry For.Invention applies classical Symbolic Representations so that it can be calculated on the basis of Data Dimensionality Reduction into row distance, be follow-up Theoretical foundation is provided using such as similar inquiry, abnormality detection etc..Most importantly the present invention is maximum by applying prefix tree algorithm Reduce to limit meaningless character string comparison, very big improve queried efficiency.

Other advantages, target and the feature of the present invention will be illustrated in the following description to a certain extent, and And to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, Huo Zheke To be instructed from the practice of the present invention.The target and other advantages of the present invention can be wanted by following specification and right Book is sought to realize and obtain.

Description of the drawings

The description of the drawings of the present invention is as follows：

Fig. 1 is the flow diagram of the abstract index generation method of sequential key assignments type industrial process data.

Fig. 2 is the prefix tree algorithm example flow diagram based on stage feeding polymerization approximate representation.

Specific implementation mode

The invention will be further described with reference to the accompanying drawings and examples.

Embodiment：As depicted in figs. 1 and 2；A kind of abstract index generation method of sequential key assignments type industrial process data, it Include：

S1：Obtain sequential key assignments type industrial process data；

The time series data to acquisition in the step S2 makees that smooth noise is pretreated to be as follows：

S22：By investigating the value in data fields, by acquiring smoothed data according to case mean value method in branch mailbox method Value carrys out smooth ordered data, by continuous data discretization, obtains pretreated time series data, increases granularity.Such as：Number in case According to for：6,8,10, then the smoothed data value acquired according to case mean value method is 8, in this way each value in the case can by for It is changed to 8.

The step S3 is as follows：

The step S31 includes：

The size for determining alphabet first, that is, the species number for defining symbol is α=5, i.e., meets Gauss what step 2 obtained The sequence of distribution is divided into 5 intervals of equal probabilitys according to the size of cut-point, and each section, which corresponds to, indicates a kind of symbol, wherein dividing The relationship of the definition of cutpoint and alphabetical table size is as shown in table 1.Symbol is allocated in the way of from low to high, is then compared The mean value of tract and the size of cut-point, if the tract is expressed as this by the mean value of tract in segmentation section Symbol corresponding to a segmentation section.I.e. value less than " -0.84 " section in, symbolic indication A, " -0.84 " to " - The symbol indicated in 0.25 " section is B, symbol C is corresponded in " -0.25 " to " 0.25 " section, in " 0.25 " to " 0.84 " area Between correspond to symbol be D, correspond to symbol E in the section of section " 0.84 " or more, be followed successively by A, B, C, D, E from below to up.Such as table 1 It is shown：

The alphabetical table size corresponding cut-point from 5 to 10 of table 1

The step S4 includes：

S41：For time series data Symbolic Representation sequence obtained by step S3, using K mean value pattern clustering methods to S3's As a result it clusters, obtains the character string mode result of a string of discretizations.K object is arbitrarily selected to make first from n data object Each object and these center objects are calculated according to the mean value (i.e. center object) of each clustering object for initial cluster center Euclidean distanceCorresponding object is divided again according to minimum range；It recalculates again every Until the mean value (center object) of a (changing) cluster, so cycle know that each cluster no longer changes.

S42：Based on above-mentioned cluster result, the symbol sebolic addressing of each classification is encoded respectively using prefix tree algorithm, Form index.

It should be understood that the part that this specification does not elaborate belongs to the prior art.

Finally illustrate, the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although with reference to compared with Good embodiment describes the invention in detail, it will be understood by those of ordinary skill in the art that, it can be to the skill of the present invention Art scheme is modified or replaced equivalently, and without departing from the objective and range of the technical program, should all be covered in the present invention Right in.

Claims

1. a kind of abstract index generation method of sequential key assignments type industrial process data, which is characterized in that the method step is such as Under：

S1：Obtain sequential key assignments type industrial process data；

S4：Symbol is polymerize to the result after approximate representation and carries out pattern clustering, the result after progress pattern clustering is used into prefix Algorithm forms index.

2. the abstract index generation method of sequential key assignments type industrial process data as described in claim 1, which is characterized in that institute It states the time series data to acquisition in step S2 and makees that smooth noise is pretreated to be as follows：

S21：Separate-blas estimation is carried out to primordial time series data；It was found that noise, outlier and uncommon value, investigate each belong to The range of the domain and data type and each attribute acceptable value of property；

S22：By investigate data fields in value, by branch mailbox method according to case mean value method acquire smoothed data value come Continuous data discretization is obtained pretreated time series data by smooth ordered data, increases granularity.

3. the abstract index generation method of sequential key assignments type industrial process data as described in claim 1, which is characterized in that institute Step S3 is stated to be as follows：

S31：Equal length segmentation is carried out to the time series data obtained after step S2 pretreatments, each section of average value is taken to constitute newly Time series data indicating former Dimension Time Series；

S32：For the time series data of gained after dimensionality reduction, indicate to obtain the time series data using symbol polymerization approximate representation method Discretization approximate representation.

4. the abstract index generation method of sequential key assignments type industrial process data as described in claim 1, which is characterized in that institute Stating step S4 includes：

S41：For time series data Symbolic Representation form obtained by step S3, using K mean value pattern clustering methods to the result of S3 Cluster, obtains the character string mode result of a string of discretizations；

5. the abstract index generation method of sequential key assignments type industrial process data as claimed in claim 3, which is characterized in that institute Stating step S31 includes：

The time series data dimension that step S2 is obtained is n, and gained dimension is N after processing；I-th subsegment mean value can be true by following formula It is fixed：