CN108470068A - A kind of abstract index generation method of sequential key assignments type industrial process data - Google Patents
A kind of abstract index generation method of sequential key assignments type industrial process data Download PDFInfo
- Publication number
- CN108470068A CN108470068A CN201810270729.9A CN201810270729A CN108470068A CN 108470068 A CN108470068 A CN 108470068A CN 201810270729 A CN201810270729 A CN 201810270729A CN 108470068 A CN108470068 A CN 108470068A
- Authority
- CN
- China
- Prior art keywords
- time series
- data
- series data
- industrial process
- key assignments
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of abstract index generation methods of sequential key assignments type industrial process data, it includes S1:Obtain sequential key assignments type industrial process data;S2:The time series data of acquisition is pre-processed as smooth noise to obtain the time series data with timestamp;S3:Symbolization polymerization approximate representation method indicates that pretreatment obtains time series data;S4:Symbol is polymerize to the result after approximate representation and carries out pattern clustering, the result after progress pattern clustering is formed into index using prefix algorithm.The advantageous effect that the present invention obtains is:Symbol polymerization approximate representation method and prefix trees algorithm fusion are formed into the abstract index generation method of sequential key assignments type industrial process data based on data preprocessing method;This method can reduce the dimension of former time series data, effectively the feature of the former data of extraction, and realize abstract index generation method using prefix tree algorithm.
Description
Technical field
The present invention relates to Time Series Data Mining technical field, especially a kind of sequential key assignments type industrial process data
Abstract index generation method.
Background technology
Time series data is widely present in the fields such as industrial process, weather detection, medical diagnosis.The type industry of sequential key assignments
Process data has the characteristics that higher-dimension, magnanimity, therefore traditional data summarization index as a kind of typical time series data
Generation method cannot analyze such data well.It is a kind of Symbolic Representation method of maturation that symbol, which polymerize approximate representation,
It is widely used in the pretreatment and mode discovery of time series data.The advantage is that can utilize more mature efficient needle
To the data mining algorithm of string operation.Prefix trees are a kind of key tree constructions, are a kind of mutation of Hash tree.Typical case is
For a large amount of character string (but being not limited only to character string) that counts and sort, so often searched automotive engine system is used for text word
Frequency counts.The disadvantage is that:Current time series data indexing means are mostly based on single dimension-reduction treatment expression or Symbolic Representation side
Method, it is difficult to quickly, efficiently inquire time series data.Therefore, there is an urgent need for a kind of abstract ropes of new sequential key assignments type industrial process data
Draw generation method.
Invention content
In view of the drawbacks described above of the prior art, it is an object of the invention to provide a kind of sequential key assignments type industrial process numbers
According to abstract index generation method, a kind of indexing means encoded to it using prefix tree algorithm can be built.Symbolization
Polymerization approximate representation method indicates that pretreatment obtains time series data;Then symbol is polymerize to the result after approximate representation carry out
The cluster result is finally formed index by pattern clustering using prefix tree algorithm.
It is realized the purpose of the present invention is technical solution in this way, a kind of sequential key assignments type industrial process data is plucked
Index generation method is wanted, it includes:
S1:Obtain sequential key assignments type industrial process data;
S2:The time series data of acquisition is pre-processed as smooth noise to obtain the time series data with timestamp;
S3:Symbolization polymerization approximate representation method indicates that pretreatment obtains time series data;
S4:Symbol is polymerize to the result after approximate representation and carries out pattern clustering, the result after progress pattern clustering is used
Prefix algorithm forms index.
Further, the time series data to acquisition in the step S2 makees the pretreated specific steps of smooth noise such as
Under:
S21:Separate-blas estimation is carried out to primordial time series data;It was found that noise, outlier and uncommon value, are investigated every
The range of the domain and data type of a attribute and each attribute acceptable value;
S22:By investigating the value in data fields, by acquiring smoothed data according to case mean value method in branch mailbox method
Value carrys out smooth ordered data, by continuous data discretization, obtains pretreated time series data, increases granularity.
Further, the step S3 is as follows:
S31:Equal length segmentation is carried out to the time series data obtained after step S2 pretreatments, takes each section of average value structure
The time series data of Cheng Xin is indicating former Dimension Time Series;
S32:For the time series data of gained after dimensionality reduction, ordinal number when indicating to obtain this using symbol polymerization approximate representation method
According to discretization approximate representation.
Further, the step S4 includes:
S41:For time series data Symbolic Representation form obtained by step S3, using K mean value pattern clustering methods to S3's
As a result it clusters, obtains the character string mode result of a string of discretizations;
S42:It based on the above results, is encoded using prefix tree algorithm, forms index.
Further, the step S31 includes:
The time series data dimension that step S2 is obtained is n, and gained dimension is N after processing.I-th subsegment mean value can be by following formula
It determines:
By adopting the above-described technical solution, the present invention has the advantage that:The present invention is by stage feeding polymerization approximate representation
Method is used for the dimensionality reduction of time series data, ensure that apart from lower bound criterion so as to avoid the under-enumeration row in follow-up similar inquiry
For.Invention applies classical Symbolic Representations so that it can be calculated on the basis of Data Dimensionality Reduction into row distance, be follow-up
Theoretical foundation is provided using such as similar inquiry, abnormality detection etc..Most importantly the present invention is maximum by applying prefix tree algorithm
Reduce to limit meaningless character string comparison, very big improve queried efficiency.
Other advantages, target and the feature of the present invention will be illustrated in the following description to a certain extent, and
And to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, Huo Zheke
To be instructed from the practice of the present invention.The target and other advantages of the present invention can be wanted by following specification and right
Book is sought to realize and obtain.
Description of the drawings
The description of the drawings of the present invention is as follows:
Fig. 1 is the flow diagram of the abstract index generation method of sequential key assignments type industrial process data.
Fig. 2 is the prefix tree algorithm example flow diagram based on stage feeding polymerization approximate representation.
Specific implementation mode
The invention will be further described with reference to the accompanying drawings and examples.
Embodiment:As depicted in figs. 1 and 2;A kind of abstract index generation method of sequential key assignments type industrial process data, it
Include:
S1:Obtain sequential key assignments type industrial process data;
S2:The time series data of acquisition is pre-processed as smooth noise to obtain the time series data with timestamp;
The time series data to acquisition in the step S2 makees that smooth noise is pretreated to be as follows:
S21:Separate-blas estimation is carried out to primordial time series data;It was found that noise, outlier and uncommon value, are investigated every
The range of the domain and data type of a attribute and each attribute acceptable value;
S22:By investigating the value in data fields, by acquiring smoothed data according to case mean value method in branch mailbox method
Value carrys out smooth ordered data, by continuous data discretization, obtains pretreated time series data, increases granularity.Such as:Number in case
According to for:6,8,10, then the smoothed data value acquired according to case mean value method is 8, in this way each value in the case can by for
It is changed to 8.
S3:Symbolization polymerization approximate representation method indicates that pretreatment obtains time series data;
The step S3 is as follows:
S31:Equal length segmentation is carried out to the time series data obtained after step S2 pretreatments, takes each section of average value structure
The time series data of Cheng Xin is indicating former Dimension Time Series;
The step S31 includes:
The time series data dimension that step S2 is obtained is n, and gained dimension is N after processing.I-th subsegment mean value can be by following formula
It determines:
S32:For the time series data of gained after dimensionality reduction, ordinal number when indicating to obtain this using symbol polymerization approximate representation method
According to discretization approximate representation.
The size for determining alphabet first, that is, the species number for defining symbol is α=5, i.e., meets Gauss what step 2 obtained
The sequence of distribution is divided into 5 intervals of equal probabilitys according to the size of cut-point, and each section, which corresponds to, indicates a kind of symbol, wherein dividing
The relationship of the definition of cutpoint and alphabetical table size is as shown in table 1.Symbol is allocated in the way of from low to high, is then compared
The mean value of tract and the size of cut-point, if the tract is expressed as this by the mean value of tract in segmentation section
Symbol corresponding to a segmentation section.I.e. value less than " -0.84 " section in, symbolic indication A, " -0.84 " to " -
The symbol indicated in 0.25 " section is B, symbol C is corresponded in " -0.25 " to " 0.25 " section, in " 0.25 " to " 0.84 " area
Between correspond to symbol be D, correspond to symbol E in the section of section " 0.84 " or more, be followed successively by A, B, C, D, E from below to up.Such as table 1
It is shown:
The alphabetical table size corresponding cut-point from 5 to 10 of table 1
S4:Symbol is polymerize to the result after approximate representation and carries out pattern clustering, the result after progress pattern clustering is used
Prefix algorithm forms index.
The step S4 includes:
S41:For time series data Symbolic Representation sequence obtained by step S3, using K mean value pattern clustering methods to S3's
As a result it clusters, obtains the character string mode result of a string of discretizations.K object is arbitrarily selected to make first from n data object
Each object and these center objects are calculated according to the mean value (i.e. center object) of each clustering object for initial cluster center
Euclidean distanceCorresponding object is divided again according to minimum range;It recalculates again every
Until the mean value (center object) of a (changing) cluster, so cycle know that each cluster no longer changes.
S42:Based on above-mentioned cluster result, the symbol sebolic addressing of each classification is encoded respectively using prefix tree algorithm,
Form index.
It should be understood that the part that this specification does not elaborate belongs to the prior art.
Finally illustrate, the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although with reference to compared with
Good embodiment describes the invention in detail, it will be understood by those of ordinary skill in the art that, it can be to the skill of the present invention
Art scheme is modified or replaced equivalently, and without departing from the objective and range of the technical program, should all be covered in the present invention
Right in.
Claims (5)
1. a kind of abstract index generation method of sequential key assignments type industrial process data, which is characterized in that the method step is such as
Under:
S1:Obtain sequential key assignments type industrial process data;
S2:The time series data of acquisition is pre-processed as smooth noise to obtain the time series data with timestamp;
S3:Symbolization polymerization approximate representation method indicates that pretreatment obtains time series data;
S4:Symbol is polymerize to the result after approximate representation and carries out pattern clustering, the result after progress pattern clustering is used into prefix
Algorithm forms index.
2. the abstract index generation method of sequential key assignments type industrial process data as described in claim 1, which is characterized in that institute
It states the time series data to acquisition in step S2 and makees that smooth noise is pretreated to be as follows:
S21:Separate-blas estimation is carried out to primordial time series data;It was found that noise, outlier and uncommon value, investigate each belong to
The range of the domain and data type and each attribute acceptable value of property;
S22:By investigate data fields in value, by branch mailbox method according to case mean value method acquire smoothed data value come
Continuous data discretization is obtained pretreated time series data by smooth ordered data, increases granularity.
3. the abstract index generation method of sequential key assignments type industrial process data as described in claim 1, which is characterized in that institute
Step S3 is stated to be as follows:
S31:Equal length segmentation is carried out to the time series data obtained after step S2 pretreatments, each section of average value is taken to constitute newly
Time series data indicating former Dimension Time Series;
S32:For the time series data of gained after dimensionality reduction, indicate to obtain the time series data using symbol polymerization approximate representation method
Discretization approximate representation.
4. the abstract index generation method of sequential key assignments type industrial process data as described in claim 1, which is characterized in that institute
Stating step S4 includes:
S41:For time series data Symbolic Representation form obtained by step S3, using K mean value pattern clustering methods to the result of S3
Cluster, obtains the character string mode result of a string of discretizations;
S42:It based on the above results, is encoded using prefix tree algorithm, forms index.
5. the abstract index generation method of sequential key assignments type industrial process data as claimed in claim 3, which is characterized in that institute
Stating step S31 includes:
The time series data dimension that step S2 is obtained is n, and gained dimension is N after processing;I-th subsegment mean value can be true by following formula
It is fixed:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810270729.9A CN108470068A (en) | 2018-03-29 | 2018-03-29 | A kind of abstract index generation method of sequential key assignments type industrial process data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810270729.9A CN108470068A (en) | 2018-03-29 | 2018-03-29 | A kind of abstract index generation method of sequential key assignments type industrial process data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108470068A true CN108470068A (en) | 2018-08-31 |
Family
ID=63262296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810270729.9A Pending CN108470068A (en) | 2018-03-29 | 2018-03-29 | A kind of abstract index generation method of sequential key assignments type industrial process data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108470068A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109684328A (en) * | 2018-12-11 | 2019-04-26 | 中国北方车辆研究所 | A kind of Dimension Time Series compression and storage method |
CN110297832A (en) * | 2019-07-01 | 2019-10-01 | 联想(北京)有限公司 | A kind of time series data storage method and device, time series data querying method and device |
CN110955294A (en) * | 2019-11-25 | 2020-04-03 | 重庆大学 | Configurable ordered key value class data simulation generation method and generation device thereof |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103676645A (en) * | 2013-12-11 | 2014-03-26 | 广东电网公司电力科学研究院 | Mining method for association rules in time series data flows |
CN104182460A (en) * | 2014-07-18 | 2014-12-03 | 浙江大学 | Time sequence similarity query method based on inverted indexes |
CN105744562A (en) * | 2016-03-25 | 2016-07-06 | 中国地质大学(武汉) | Method and system for compressing and reconstructing data of wireless sensor network based on symbolic aggregate approximation |
CN106095787A (en) * | 2016-05-30 | 2016-11-09 | 重庆大学 | A kind of Symbolic Representation method of time series data |
CN107562865A (en) * | 2017-08-30 | 2018-01-09 | 哈尔滨工业大学深圳研究生院 | Multivariate time series association rule mining method based on Eclat |
-
2018
- 2018-03-29 CN CN201810270729.9A patent/CN108470068A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103676645A (en) * | 2013-12-11 | 2014-03-26 | 广东电网公司电力科学研究院 | Mining method for association rules in time series data flows |
CN104182460A (en) * | 2014-07-18 | 2014-12-03 | 浙江大学 | Time sequence similarity query method based on inverted indexes |
CN105744562A (en) * | 2016-03-25 | 2016-07-06 | 中国地质大学(武汉) | Method and system for compressing and reconstructing data of wireless sensor network based on symbolic aggregate approximation |
CN106095787A (en) * | 2016-05-30 | 2016-11-09 | 重庆大学 | A kind of Symbolic Representation method of time series data |
CN107562865A (en) * | 2017-08-30 | 2018-01-09 | 哈尔滨工业大学深圳研究生院 | Multivariate time series association rule mining method based on Eclat |
Non-Patent Citations (2)
Title |
---|
才科扎西等: "《基于前缀树的高效频繁项集挖掘算法》", 《计算机工程》 * |
朱明: "《数据挖掘》", 30 November 2008 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109684328A (en) * | 2018-12-11 | 2019-04-26 | 中国北方车辆研究所 | A kind of Dimension Time Series compression and storage method |
CN109684328B (en) * | 2018-12-11 | 2020-06-16 | 中国北方车辆研究所 | High-dimensional time sequence data compression storage method |
CN110297832A (en) * | 2019-07-01 | 2019-10-01 | 联想(北京)有限公司 | A kind of time series data storage method and device, time series data querying method and device |
CN110955294A (en) * | 2019-11-25 | 2020-04-03 | 重庆大学 | Configurable ordered key value class data simulation generation method and generation device thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105574212B (en) | A kind of image search method of more index disk hash data structures | |
CN102902826B (en) | A kind of image method for quickly retrieving based on reference picture index | |
CN104182460B (en) | Time Series Similarity querying method based on inverted index | |
US10303800B2 (en) | System and method for optimization of audio fingerprint search | |
CN108470068A (en) | A kind of abstract index generation method of sequential key assignments type industrial process data | |
CN104572886B (en) | The financial time series similarity query method represented based on K line charts | |
CN105787492B (en) | Three value mode texture feature extracting methods of part based on mean value sampling | |
CN111125469B (en) | User clustering method and device of social network and computer equipment | |
CN102184186A (en) | Multi-feature adaptive fusion-based image retrieval method | |
CN105512143A (en) | Method and device for web page classification | |
Wang et al. | The research and realization of vehicle license plate character segmentation and recognition technology | |
CN104850859A (en) | Multi-scale analysis based image feature bag constructing method | |
CN106778869A (en) | A kind of quick accurate nearest neighbour classification algorithm based on reference point | |
CN104361135A (en) | Image search method | |
Gupta et al. | A classification method to classify high dimensional data | |
Gao et al. | A neural network classifier based on prior evolution and iterative approximation used for leaf recognition | |
CN116561230B (en) | Distributed storage and retrieval system based on cloud computing | |
Nayef et al. | Statistical grouping for segmenting symbols parts from line drawings, with application to symbol spotting | |
Patnaik et al. | Clustering of Categorical Data by Assigning Rank through Statistical Approach | |
Singh et al. | Survey on outlier detection in data mining | |
CN105653567A (en) | Method for quickly looking for feature character strings in text sequential data | |
CN110532867A (en) | A kind of facial image clustering method based on Fibonacci method | |
Yuan et al. | A lazy associative classifier for time series | |
CN104899477A (en) | Protein subcellular interval prediction method using bag-of-word model | |
Hu et al. | Feature reduction of multi-scale LBP for texture classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180831 |
|
RJ01 | Rejection of invention patent application after publication |