CN115514376A - High-frequency time sequence data compression method and device based on improved symbol aggregation approximation - Google Patents
High-frequency time sequence data compression method and device based on improved symbol aggregation approximation Download PDFInfo
- Publication number
- CN115514376A CN115514376A CN202211043071.0A CN202211043071A CN115514376A CN 115514376 A CN115514376 A CN 115514376A CN 202211043071 A CN202211043071 A CN 202211043071A CN 115514376 A CN115514376 A CN 115514376A
- Authority
- CN
- China
- Prior art keywords
- time sequence
- segments
- clustering
- segmentation
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 206
- 230000002776 aggregation Effects 0.000 title claims abstract description 56
- 238000004220 aggregation Methods 0.000 title claims abstract description 56
- 238000013144 data compression Methods 0.000 title claims abstract description 36
- 230000011218 segmentation Effects 0.000 claims abstract description 109
- 238000005457 optimization Methods 0.000 claims abstract description 35
- 239000000203 mixture Substances 0.000 claims abstract description 24
- 238000005520 cutting process Methods 0.000 claims abstract description 5
- 230000006837 decompression Effects 0.000 claims description 38
- 241000406668 Loxodonta cyclotis Species 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 13
- 239000012634 fragment Substances 0.000 claims description 9
- 238000012163 sequencing technique Methods 0.000 claims description 8
- 238000000926 separation method Methods 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 5
- 241000283080 Proboscidea <mammal> Species 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000010008 shearing Methods 0.000 claims 2
- 238000007906 compression Methods 0.000 abstract description 117
- 230000006835 compression Effects 0.000 abstract description 117
- 230000009467 reduction Effects 0.000 abstract description 28
- 238000000605 extraction Methods 0.000 abstract description 5
- 230000000694 effects Effects 0.000 description 28
- 238000011156 evaluation Methods 0.000 description 20
- 101150000419 GPC gene Proteins 0.000 description 12
- 101150026392 N gene Proteins 0.000 description 12
- 238000010586 diagram Methods 0.000 description 7
- 238000009826 distribution Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000007630 basic procedure Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 239000012466 permeate Substances 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a high-frequency time sequence data compression method and device based on improved symbol aggregation approximation, belonging to the technical field of time sequence compression, wherein the method comprises the following steps: dividing the time sequence by using a Gaussian segmentation model based on an improved image group optimization algorithm to obtain a plurality of segmentation points and a plurality of time sequence segments; clustering the plurality of time sequence segments by using a Gaussian mixture model initialized based on improved peak density to obtain a plurality of clustering centers, a sub-module variance and a clustering label; carrying out equidistant segmentation on each clustering center again according to the proportion of the variance of the sub-modules; converting the mean value of each segment of the center of each class into symbolic representation by using an SAX method, wherein the first character of each class is a capital letter; and cutting the same symbolic representation of the same class, reserving the first capital letter, and finally obtaining the time sequence value compressed data. The method adopts the Gaussian clustering of time sequence segments to realize the feature extraction and dimension reduction of the SAX method.
Description
Technical Field
The invention relates to the technical field of time sequence compression, in particular to a high-frequency time sequence data compression method based on improved symbol aggregation approximation.
Background
Time series compression is an important study in time series correlation studies. As science and technology develops rapidly, intellectualization permeates into aspects of production, manufacturing, monitoring and other work, a company, a platform or a system need to generate data at every moment, the generated data not only has large cardinality of required data acquisition devices, but also has high acquisition frequency, complex and various data types, and certain relativity is provided before and after the data. There is therefore a need for an efficient compression method to enable the storage of time series data.
The compression method of the time series has relatively mature research results and also has continuously updated research results. Including lossless compression models as well as lossy compression models. Most time-series compression methods focus on lossy compression. The sequence representation method is a main means, and comprises discrete Fourier transform, discrete wavelet transform, singular value decomposition, piecewise linear representation, symbolization method and the like.
To solve the similarity search problem of large time sequence databases, keogh et al introduced a new dimension reduction technique, i.e., a Piecewise Accumulation Approximation (PAA). On the basis, many developments and improvements are generated, including Adaptive Piecewise Constant Approximation (APCA), which is worth mentioning a symbol aggregation Approximation (SAX) method, which introduces partition of equal probability intervals of gaussian distribution and symbol transformation on the basis of a PAA method, and the discretization method provides a new direction for data representation and compression. SAX belongs to the category of symbolization methods, has the characteristics of simplicity, rapidness, wide application range and the like, but also has certain defects; also, there is little method of establishing a compression model with correlations existing before and after a time series as an entry point. Therefore, it is desirable to combine the two to compress the time-series data.
Disclosure of Invention
The present invention is directed to solving, at least in part, one of the technical problems in the related art.
To this end, a first objective of the present invention is to propose a high frequency time series data compression method based on improved symbol aggregation approximation, which can achieve lower compression rate and better data recovery capability.
The second purpose of the present invention is to provide a high frequency time series data compression device based on improved symbol aggregation approximation.
The third objective of the present invention is to provide a high frequency time series data decompression method based on improved symbol aggregation approximation.
The fourth purpose of the present invention is to provide a high frequency time series data decompression device based on the improved symbol aggregation approximation.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a high-frequency time series data compression method based on improved symbol aggregation approximation, including the following steps: step S101, a Gaussian segmentation model based on an improved image group optimization algorithm is used for dividing a time sequence to obtain a plurality of segmentation points and a plurality of time sequence segments; step S102, clustering the plurality of time series segments by using a Gaussian mixture model initialized based on improved peak density to obtain a plurality of clustering centers, module-divided variances and clustering labels; step S103, performing equidistant segmentation on each clustering center again according to the proportion of the sub-module variances; step S104, converting the average value of each segment of the center of each class into symbolic representation by using an SAX method, wherein the first character of each class is a capital letter; and step S105, cutting the same symbolic representation of the same class, reserving the first capital letter, and finally obtaining the time sequence value compressed data.
According to the high-frequency time sequence data compression method based on the improved symbol aggregation approximation, the time sequence is segmented by using the segmented Gaussian model, and the characteristic of random segmentation of the SAX method is improved; and the feature extraction and dimension reduction of the SAX method are realized by Gaussian clustering of the time sequence segments, so that lower compression rate and better data reduction capability are obtained.
In addition, the high-frequency time series data compression method based on the improved symbol aggregation approximation according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, the gaussian segmentation model based on the improved image group optimization algorithm in step S101 is:
adding disturbance to a clan update operator of the initial object group optimization algorithm to obtain:
wherein x is i,j The position of the j elephant representing the i tribe,representing the position of the elephant with the optimal fitness function value in all the elephants, and calling the position of the elephant as the elder;the influence of the family length of the tribe i on the elephant individual is shown, and alpha is an influence parameter;representing the influence of the optimal population on the elephant individual, and 1-alpha representing an influence parameter; levy (λ) denotes the mechanism of variation; round represents an operation of rounding a value in parentheses;
adjusting the clan classification operator of the initial object group optimization algorithm to obtain:
wherein,updated position of elephant with i drop, x min The position of the elephant can be selectedSmall value, x max The maximum value of the selectable positions of the elephant, round is the rounding operation, T is the time sequence length, and rand is a random number.
Further, in an embodiment of the present invention, the step S101 specifically includes: initializing parameters and populations based on an improved image group optimization algorithm; calculating the fitness values of all time sequence segments in the time sequence by adopting a segmented Gaussian module, starting iteration until a preset maximum iteration number is reached, outputting an optimal position and the fitness values, and otherwise, sequencing the fitness values of all time sequence segments and reserving a plurality of good time sequence segments; executing the clan update operator based on the improved image group optimization algorithm, and updating the positions of all time sequence segments and the positions of the family length until the time sequence segments in all clans are updated; executing the clan separation operator based on the improved image group optimization algorithm, and updating the positions and the fitness values of the time sequence segments of the differences until the time sequence segments of the differences are separated in all clans; and sequencing the fitness values of all the time sequence segments to obtain a plurality of time sequence segments which are different and need to be separated, updating the positions of the reserved time sequence segments to the time sequence segments which need to be separated, and calculating the fitness values to update the optimal positions of the time sequence segments, namely the optimal segmentation points in all the time sequence segments.
Further, in an embodiment of the present invention, the step S101 specifically includes: initializing parameters and populations based on an improved image group optimization algorithm; calculating the fitness values of all time sequence segments in the time sequence by adopting a segmented Gaussian module, starting iteration until a preset maximum iteration number is reached, outputting an optimal position and the fitness values, and otherwise, sequencing the fitness values of all time sequence segments and reserving a plurality of good time sequence segments; executing the clan update operator based on the improved image group optimization algorithm, and updating the positions of all time sequence segments and the positions of the family length until the time sequence segments in all clans are updated; executing the clan separation operator based on the improved image group optimization algorithm, and updating the positions and the fitness values of the time sequence segments of the differences until the time sequence segments of the differences are separated in all clans; and sequencing the fitness values of all the time sequence segments to obtain a plurality of different time sequence segments needing to be separated, updating the positions of the reserved time sequence segments to the time sequence segments needing to be separated, and calculating the fitness values to update the optimal positions of the ages, namely the optimal segmentation points in all the time sequence segments.
Further, in an embodiment of the present invention, the step S102 specifically includes: initializing the clustering algorithm based on the improved peak density to obtain the mean value of the time series Gaussian mixture modelInitialization value of, each partial model coefficientInitialization value and variance of each partial modelInitializing a value;
inputting a preset maximum iteration number, a preset threshold value, the plurality of segmentation points and the plurality of time sequence segments;
e-step computation time series segment S of iterative execution EM algorithm k Probability of belonging to mth partial modelIteratively executing M steps of the EM algorithm to calculate an updated mean for each clusterUpdating varianceUpdating coefficients
Judging whether the difference between the two log-likelihood function values isIf not, whether the variance of the sub-model of the Gaussian mixture model is 0 or not is judged, if yes, iteration is ended, and the parameter mean value corresponding to the optimal cluster is outputVariance (variance)Coefficient of performanceAnd probabilityOtherwise, judging whether the number of times of iteration is less than the preset maximum iteration number, if so, continuing the iteration, otherwise, ending the iteration, and outputting the parameter mean value corresponding to the optimal clusterVariance (variance)Coefficient of performanceAnd probability
Further, in an embodiment of the present invention, a specific process for initializing the clustering algorithm based on the improved peak density is as follows: adjusting a local density formula according to all the time sequence segments to calculate local density; calculating the local density without maximum ρ (S) j ) The relative distance of the time series segments with the maximum local density p (S) is calculated j ) Relative distance of time series segments of (a); carrying out normalization processing on the local density and the relative distance, and calculating the product of the local density and the relative distance as a clustering judgment standard to carry out descending order arrangement; selected far from zeroTime sequence segments and taking the mean value thereof as the mean value of the time sequence Gaussian mixture modelAnd taking the sequence number as a clustering number; calculating each sub-model coefficient according to the clustering numberAnd the variance of each partial model is calculatedInitialized to an identity matrix.
Further, in an embodiment of the present invention, the specific solution formula in the iterative EM algorithm is:
wherein,is the probability of the m-th partial model, α m Is a coefficient of the partial model, x t Is a time sequence, S k For a segment of a time series, phi is a simplified log-likelihood function, theta, for each segment of the time series m In order to be the parameters of the model,for the updated mean of each of the clusters,update variance, μ, for each cluster m For the original mean value of each of the clusters,the coefficients are updated for each cluster.
Further, in an embodiment of the present invention, the step S103 specifically includes:
when the number w of the segments is preset, the segments are distributed according to the proportion of the variance of each class, and the segmentation number of each class is determined, wherein the formula is as follows:
wherein M is the number of clusters, c j Is the variance, v, of the partial model j j Is the number of partitions of class j;
and carrying out equidistant segmentation on the clustering center of each class according to the segmentation number.
In order to achieve the above object, a second embodiment of the present invention provides a high frequency time series data compression apparatus based on improved symbol aggregation approximation, including: the method comprises the following steps: the dividing module is used for dividing the time sequence by using a Gaussian dividing model based on an improved image group optimization algorithm to obtain a plurality of dividing points and a plurality of time sequence segments; the clustering module is used for clustering the plurality of time sequence segments by using a Gaussian mixture model initialized based on the improved peak density to obtain a plurality of clustering centers, a module-divided variance and clustering labels; the equidistant segmentation module is used for performing equidistant segmentation on each clustering center again according to the proportion of the variance of the sub-modules; the conversion module is used for converting the average value of each segment of the center of each class into symbolic representation by using an SAX method, and the first character of each class is a capital letter; and the cutting module is used for cutting the same symbolic representation of the same class, reserving the first capital letter and finally obtaining the time sequence value compressed data.
The high-frequency time sequence data compression device based on the improved symbol aggregation approximation improves the random segmentation characteristic of the SAX method by segmenting the time sequence by using the segmented Gaussian model; and the feature extraction and dimension reduction of the SAX method are realized by Gaussian clustering of the time sequence segments, so that lower compression rate and better data reduction capability are obtained.
In order to achieve the above object, a third aspect of the present invention provides a high frequency time series data decompression method based on improved symbol aggregation approximation, including the following steps: step S201, scanning time sequence value compressed data, identifying capital characters, and obtaining symbolic representation of each cluster center; step S202, calculating the length of the sequence segment of each clustering center, and determining a segmentation point; step S203, restoring symbolic representation of each segment of sequence fragment according to the division points, and further obtaining symbolic representation of the whole time sequence; step S204, each symbol is inversely transformed by an SAX method to obtain a time sequence.
According to the high-frequency time sequence data decompression method based on the improved symbol aggregation approximation, the time sequence is segmented by using the segmented Gaussian model, and the characteristic of random segmentation of the SAX method is improved; and the feature extraction and dimension reduction of the SAX method are realized by Gaussian clustering of the time sequence segments, so that lower compression ratio and better data reduction capability are obtained.
In order to achieve the above object, a fourth aspect of the present invention provides a high frequency time series data decompression apparatus based on improved symbol aggregation approximation, including: the scanning and identifying module is used for scanning the compressed data of the time sequence value, identifying capitalized characters and obtaining symbolic representation of each clustering center; a division point determining module, configured to calculate a sequence segment length of each clustering center, and determine a division point; the symbolic representation restoring module is used for restoring the symbolic representation of each segment of the sequence fragment according to the division points so as to obtain the symbolic representation of the whole time sequence; and the inverse transformation module is used for carrying out inverse transformation on each symbol by an SAX method to obtain a time sequence.
The high-frequency time sequence data decompression device based on the improved symbol aggregation approximation improves the random segmentation characteristic of the SAX method by segmenting the time sequence by using the segmented Gaussian model; and the feature extraction and dimension reduction of the SAX method are realized by Gaussian clustering of the time sequence segments, so that lower compression rate and better data reduction capability are obtained.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a high frequency time series data compression method based on improved symbol aggregation approximation according to one embodiment of the invention;
FIG. 2 is a block diagram of a process for time series Gaussian segmentation based on an improved image group optimization algorithm according to an embodiment of the present invention;
FIG. 3 is a block flow diagram of Gaussian mixture model time series segment clustering based on improved peak density initialization, according to one embodiment of the invention;
FIG. 4 is a block flow diagram of high frequency time series data compression based on improved symbol aggregation approximation according to one embodiment of the present invention;
FIG. 5 is a schematic diagram of a high frequency time series data compression apparatus based on improved symbol aggregation approximation according to an embodiment of the present invention;
FIG. 6 is a flow chart of a high frequency time series data decompression method based on improved symbol aggregation approximation according to an embodiment of the present invention;
FIG. 7 is a block diagram of the decompression flow of time series values based on an improved symbol aggregation approximation, in accordance with an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a high-frequency time-series data decompression apparatus based on improved symbol aggregation approximation according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
It should be noted that, although the SAX method has good data expression and compression effects for time series compression, but has a large defect, the method can divide the time series into different segments to obtain different compression rates, but the divided sequences cannot guarantee that the sequences obtained by division have similar characteristics, if the data change range is large, the average value of the segment sequence is not enough to describe the characteristics of the segment, and even if the compression rate of the data is low, the error after decompression is large. Therefore, the SAX method needs further improvement in time series representation and compression, and the present invention provides a high frequency time series data compression method and apparatus based on improved symbol aggregation approximation, and a high frequency time series data decompression method and apparatus based on improved symbol aggregation approximation.
The high-frequency time series data compression method and apparatus based on improved symbol aggregation approximation and the high-frequency time series data decompression method and apparatus based on improved symbol aggregation approximation proposed according to the embodiments of the present invention will be described below with reference to the accompanying drawings, and first, the high-frequency time series data compression method based on improved symbol aggregation approximation proposed according to the embodiments of the present invention will be described with reference to the accompanying drawings.
Fig. 1 is a flow chart of a high frequency time series data compression method based on improved symbol aggregation approximation according to an embodiment of the invention.
As shown in fig. 1, the high frequency time series data compression method based on improved symbol aggregation approximation comprises the following steps:
in step S101, the time series is divided by using a gaussian segmentation model based on an improved image group optimization algorithm, and a plurality of segmentation points and a plurality of time series segments are obtained.
That is, the time sequence is divided by using a segmented Gaussian model based on an improved image group algorithm, and the obtained sequence segments are ensured to have certain characteristics.
Specifically, the solution of the segmented gaussian model, i.e., the solution of equation (1), is first performed:
wherein,for simplified log-likelihood functions, K +1 represents the number of segments, | S k I denotes the segment S k Length of the sequence, i.e. number of sequence values, ∑ k Is the covariance, lambda is the regularization coefficient,is a trace fetch operation. This is an optimization problem and is therefore optimized using an improved quasigroup algorithm.
Further, for updating individuals, because the clan update operator in the image group optimization algorithm only considers the influence of the clan girth on the internal elephant of the clan, ignores the influence of the best elephant in the group on the individual, and the searching capability still needs to be improved, the embodiment of the invention improves the clan update operator as follows:
adding disturbance to a clan update operator of the initial object group optimization algorithm to obtain:
wherein x is i,j The position of the j elephant representing the i tribe,the position of the elephant with the optimal fitness function value in all the elephants is expressed and called as the senior citizen;display unitThe influence of the group length of the fallen i on the elephant individual is added with certain disturbance;the influence of the optimal population on the elephant individual is shown, certain disturbance is added, and the influence on the elephant individual is mainly from the family length (the family length is respectively the optimal segmentation point in each clan in the time sequence) and the age (the optimal segmentation point in all clans in the time sequence), so that the influence parameters are divided into alpha and 1-alpha to show the influence parameters; in consideration of other burst factors, levy (lambda) is added to represent a mutation mechanism, so that the algorithm can jump out local extreme points more easily; round denotes an operation of rounding a value in parentheses so as to satisfy that the time-series division point is an integer;
the improved tribe updating operator considers the influence of the local optimal value and the influence of the global optimal value, fully utilizes the characteristic that the distance between the flight length of the Levy and rain and dew are uniformly stained, can enable the elephant individual to approach to the optimal direction, can expand the optimization range, is easy to jump out of the local extreme value, and accelerates convergence.
Note that the family length update is expressed as follows:
wherein
Wherein n is i The number of elephants in the clan i, including the family length;represents the center position or the mean position of the clan i;the updated position of the best elephant in the clan i, i.e. the new position of the clan i's family; beta is epsilon [0,1 ]]Indicating the new position of the represented familyThe degree of influence of the center position of the tribe i is also an influence parameter, as is α.
Then, for the optimization operation of the integer position, the clan classification operator of the initial object cluster optimization algorithm is adjusted as follows:
wherein,updated position of elephant with i drop, x min Is the minimum value of the selectable positions of the elephant, x max The maximum value of the selectable positions of the elephant, round is the rounding operation, T is the time sequence length, and rand is a random number.
Since the segmentation point of the segmented Gaussian model is left-closed and right-open, the internal segmentation point T of the time sequence with the length of T can be obtained, namely, the integer position is selected from 2, \8230, and the integer position is selected between T.
Then, in order to accelerate the algorithm to approach to the optimal solution, besides performing separation operation updating position on the individual with the worst fitness function value in each clan, the whole elephant group can be sequenced after the operation of a clan updating operator and a clan separation operator is completed, a certain number of individuals with the worst fitness function value are selected for separation, and the individuals are updated to the positions of the excellent individuals reserved in the last iteration, and finally the improved elephant group optimization algorithm is obtained.
As shown in fig. 2 and as shown in table 1 below, the basic steps of applying the improved image group algorithm to the segmented gaussian model to obtain the time series gaussian segmentation based on the improved image group algorithm are as follows:
initializing parameters and populations based on an improved image group optimization algorithm;
calculating the fitness values of all time sequence segments in the time sequence by adopting a segmented Gaussian module, starting iteration until the preset maximum iteration times is reached, outputting the optimal position and the fitness values, otherwise, sequencing the fitness values of all time sequence segments, and reserving a plurality of good time sequence segments;
executing a clan update operator based on an improved image group optimization algorithm, and updating the positions of all time sequence segments and the position of the ethnic group until the time sequence segments in all clans are updated;
executing a clan separation operator based on an improved image group optimization algorithm, and updating the positions and the fitness values of a plurality of poor time sequence segments until all the poor time sequence segments are separated;
and sequencing the fitness values of all the time sequence segments to obtain a plurality of different time sequence segments needing to be separated, updating the positions of the reserved time sequence segments to the time sequence segments needing to be separated, and calculating the fitness values to update the optimal positions of the ages, namely the optimal segmentation points in all the time sequence segments.
TABLE 1 basic procedure for time series Gaussian segmentation based on improved image group optimization algorithm
In step S102, a gaussian mixture model initialized based on the improved peak density is used to cluster the plurality of time series segments, and a plurality of cluster centers, module-divided variances, and cluster labels are obtained.
That is, the subsequences obtained by segmentation are used as objects, and a time sequence segment Gaussian clustering model initialized based on improved peak density is used for clustering the subsequences.
Specifically, the gaussian clustering of time-series segments assumes a time-series segment S obtained by gaussian segmentation 1 ,S 2 ,…,S K+1 The time sequence variables are independent and respectively obey respective Gaussian distribution, and a new class obtained by clustering similar segments obeys a large-range Gaussian distribution. Therefore, the embodiment of the present invention uses a gaussian mixture model to describe all time segments, and the description form is as follows:
wherein alpha is m > 0 represents the coefficients of the partial model and hasM represents the number of clusters; θ = (α) m ;θ m )=(α m ;μ m ,∑ m L M =1,2, \8230;, M) represents parameters of the model (6); p (S | theta) represents a time-series segment S 1 ,S 2 ,…,S K+1 A probability distribution of (a); phi (S | theta) m ) Expressed at a parameter theta m =(μ m ,∑ m ) The gaussian distribution of the time series fragments, which is also the mth partial model of model (6), has the following expression:
the Gaussian mixture model of the time series has hidden variables, which are defined as follows:
k=1,2,…·,K+1;m=1,2,…,M
then, estimating Gaussian mixture model parameters of the time sequence segment by using an EM algorithm, and deriving:
wherein,is the probability of the m-th partial model, α m Is a coefficient of the partial model, x t Is a time sequence, S k For a segment of a time series, phi is a simplified log-likelihood function, theta, for each segment of the time series m Are the parameters of the model and are used as the parameters,for the updated mean of each of the clusters,update variance, μ, for each cluster m For the original mean value of each of the clusters,the coefficients are updated for each cluster.
The EM algorithm is sensitive to initial values and may not jump out once trapped in a local extremum. For Gaussian clustering of time series, mean values need to be matchedVariance (variance)And coefficient ofInitialization is performed. Coefficient of partial modelThe initialization of (2) does not affect the clustering result of the time series Gaussian mixture model, so the embodiment of the invention adopts uniform distribution for initialization, as shown in the following formula:
where M represents the number of clusters.
In order to reduce the influence of the variance on the clustering, an identity matrix is used for initializing the variance. And most importantly, the mean value is initialized, and the peak density clustering algorithm is improved and is allowed to initialize the mean value.
For time series segment S j And S k Is defined as having a Papanicolaou distance of
Distance D between two adjacent Papanicolae B (S j ,S k ) The smaller, the time series segment S is illustrated j And S k The more similar. Time series segment S j And all time segments have a Papanicolaou distance of
Distance D between two adjacent Papanicolae B (S j The smaller S) the sequence fragment S is specified j Similar to most time sequence segments, the method is suitable for being used as the center of clustering. But data points with higher local density are more suitable as cluster centers, so D B (S j S) is not suitable as a local density of data points directly, so the local density is defined as follows:
time series segment S j Relative distance δ (S) j ) Can be expressed as
Wherein equation (13) is used to calculate the local density ρ (S) without the maximum j ) Time series segment S of j The relative distance of (a); equation (14) calculates the local density ρ (S) having the maximum j ) Time series segment S of j The relative distance of (a); d ji Representing a time series segment S j And S i Euclidean distance of the mean of the gaussian distributions obeyed.
Adopting normalization processing of the maximum value and the minimum value, and calculating the product of the local density and the relative distance as a judgment criterion of clustering, wherein the judgment criterion is as follows:
η′(S j )=ρ′(S j )×δ′(S j ),j=1,2,…,K+1 (16)
from the calculated η' (S) j ) The values of (A) are sorted in descending order, a variation graph of the values is drawn, and a plurality of time series segments S far away from the value of 0 are selected m (M =1,2, \8230;, M). Assuming that M are selected, the fragments are grouped into M classes.
And obtaining the clustering center of each class, the variance of the Gaussian mixture model submodel and the label of which class the sequence segment belongs to based on the time sequence segment clustering process of the Gaussian mixture model. The cluster label requires additional storage and has an important role in representing the entire time series and in decompression. Since only the cluster center is used to represent the corresponding sub-sequence, only the data of the cluster center needs to be retained during compression.
As shown in fig. 3 and as shown in table 2 below, the basic flow of time series segment clustering based on the gaussian mixture model is:
initializing clustering algorithm based on improved peak density to obtain mean value of time series Gaussian mixture modelInitialization value of (2), each partial model coefficientInitialization value and variance of each partial modelInitializing a value;
inputting a preset maximum iteration number, a preset threshold, a plurality of segmentation points and a plurality of time sequence segments;
e-step computation time sequence segment S of iterative execution EM algorithm k Probability of belonging to m-th partial modelIteratively executing M steps of the EM algorithm to calculate an updated mean for each clusterUpdating varianceUpdating coefficients
Judging whether the difference of the log-likelihood function values of the two times is smaller than a preset threshold or not, or whether the variance of the sub-model of the Gaussian mixture model is 0 or not, if so, ending iteration, and outputting the parameter mean value corresponding to the optimal clusterVariance (variance)Coefficient of performanceAnd probabilityOtherwise, judging whether the number of times of iteration is smaller than the preset maximum iteration number, if so, continuing the iteration, otherwise, ending the iteration, and outputting the parameter mean value corresponding to the optimal clusterVariance (variance)Coefficient of performanceAnd probability
TABLE 2 basic procedure for time series segment clustering based on Gaussian mixture model
In step S103, each cluster center is again equally divided according to the proportion of the block variance.
Specifically, the clustering condition of each class is judged according to the variance of each cluster, the greater the variance is, the stronger the fluctuation is, and the larger the number of segments to be allocated is, when the number of segments w is preset, the number of segments to be allocated is distributed according to the proportion of the variance of each class, and the number of segments of each class is determined, wherein the formula is as follows:
wherein M is the number of clusters, c j Is the variance, v, of the partial model j j Is the number of partitions of class j;
and (3) carrying out equidistant segmentation on the clustering center of each class according to the segmentation number, as follows:
In step S104, each segment of the mean value at the center of each class is converted into a symbolic representation by using the SAX method, and the first character of each class is a capital letter.
In particular, since the symbolic representations of each class are concatenated together, to distinguish that this belongs to a different class, the first symbolic representation at the center of each class is converted into a capital letter form. Since the number of divisions of the probability interval of the common gaussian distribution is generally 3 to 8, it is sufficient to perform the symbolic representation of the SAX method using lower case letters, and thus it is appropriate to use upper case letters to distinguish the start of a cluster center in the embodiment of the present invention.
In step S105, the same symbolic representation of the same type is clipped, and the first capital letter is retained, so as to obtain the time-series value compressed data.
In particular, since each class center is subdivided, when data changes slowly, the symbolic representation converted into SAX is likely to fall within the same interval, and thus, clipping is required. In this case, only the symbolic representation of the first capital letter of the class is retained.
Therefore, as shown in fig. 4 and as shown in table 3 below, the basic flow of high frequency time series data compression based on the improved symbol aggregation approximation.
TABLE 3 basic flow of high frequency time series data compression based on improved symbol aggregation approximation
In addition, an embodiment of the present invention provides a high frequency time series data decompression method based on improved symbol aggregation approximation according to a high frequency time series data compression method based on improved symbol aggregation approximation, and specifically as shown in fig. 5 and 6 and shown in table 4 below, the method includes the following steps: step S201, scanning time sequence value compressed data, identifying capital characters, and obtaining symbolic representation of each clustering center; step S202, calculating the sequence segment length of each clustering center, and determining a segmentation point; step S203, restoring the symbolic representation of each segment of sequence fragment according to the segmentation point, and further obtaining the symbolic representation of the whole time sequence; step S204, each symbol is inversely transformed by the SAX method to obtain a time sequence.
TABLE 4 basic flow of high frequency time series data decompression based on improved symbol aggregation approximation
The above steps may implement the decompression process for the entire time series. It should be noted that Step1 and Step2 have no sequential relationship, and may be performed simultaneously or one after the other. In Step3.3, the length of each segment and the number of symbols in the center of the cluster are used to represent the length of the segment, and the length of each segment is calculated by assuming that the length is l and n, i.e. the length of l/n, and then each symbol is extended to continuous l/n symbols, so that the complete symbol representation of each segment can be realized.
The high-frequency time sequence data compression method based on the improved symbol aggregation approximation provided by the embodiment of the invention is subjected to experimental simulation and result analysis.
Firstly, determining an evaluation index to perform a time-series compression experiment, and adopting index compression ratio, mean square error, root mean square error and average absolute error to explain the compression effect, wherein the compression ratio evaluates the compression effect, and the other three evaluate the decompression effect. The expression forms are respectively as follows:
wherein CR is the compressed sizeAnd the ratio of the size w before compression. The smaller the compression ratio, the more space occupied by the compressed data is reduced greatly, and the better the compression effect is.
Where MSE denotes the mean square error, x t Representing the true time-series data of the object,representing the decompressed data and T represents the number of sequence values of the current time series. The mean square error is an index for describing the fitting effect, and the smaller the index is, the stronger the data reduction capability after decompression is, and the better the compression effect is.
The RMSE represents a root mean square error, and is a root operation performed on a mean square error MSE, so that the smaller the index is, the better the compression effect on data is.
The MAE represents an average absolute error, and similarly, the index describes a difference between the decompressed data and the original data, and a smaller value indicates that a smaller difference between the decompressed data and the original data is, and a better compression effect is obtained.
(II) in order to compare the practical types of the compression modules, 3 comparison models are selected: SAX method, SAX + segmentation method and SAX + segmentation + clustering method, wherein,
in the SAX method, since the model proposed in the embodiment of the present invention is an improvement on the defect that the SAX method cannot identify the data features and the decompression effect is poor due to random segmentation, the improvement effect can be seen by using the SAX method for comparison, thereby further explaining the applicability of the method proposed in the embodiment of the present invention.
The SAX + segmentation method is based on the improved image group optimization algorithm provided by the embodiment of the invention, the Gaussian segmentation is used for segmenting the time sequence, some characteristics of data can be extracted, and the SAX + segmentation method is compared with the SAX + segmentation method to show whether the SAX + segmentation method provided by the embodiment of the invention can achieve a better compression effect.
The SAX + segmentation + clustering method is similar to the SAX + segmentation + clustering method, clustering is carried out on the basis of segmentation, the clustering center of each class is used as the representation of each sequence segment in the class, and compared with the SAX + segmentation method which uses the mean representation of the segmented sequence segments, the SAX + segmentation + clustering method has smaller compression ratio and generates larger error after decompression. Therefore, this method is needed to illustrate the effect of cluster-centric resegmentation and clipping operations on the compression effect based on embodiments of the present invention.
(III) authentication and analysis Using Gesture dataset
The first dimension data of the gettrue data sets of lengths 400 and 1300 are used to illustrate the proposed compression strategy.
In order to illustrate the compression effect of the method provided by the embodiment of the invention, different numbers of segments are selected for data with different lengths, and 3 numbers of segments are selected in total, wherein one number of segments is the number of segments obtained by Gaussian segmentation based on a time series, and the other two numbers are values greater than or less than the number of segments obtained by Gaussian segmentation. Thus, for the first dimension data of the Gesture data set of length 400, the number of segments is selected to be 5, 20, and 3, and for the long-time sequence data of length 1300, the number of segments is selected to be 15, 50, and 10; the alpha value is selected from 3 values 3, 5 and 8 within the usual range.
Tables 5 to 8 are compression evaluation and decompression evaluation of time-series values of the Gesture first-dimensional data of length 400, where tables 5 to 7 respectively show the compression and decompression evaluation results of the method proposed by the embodiment of the present invention and the SAX method when the number of segments is 5, 20, and 3. Tables 9 to 12 are compression evaluation and decompression evaluation of time-series values of the gettrue first-dimensional data of length 1300, where tables 9 to 11 respectively show the compression and decompression evaluation results of the method proposed by the embodiment of the present invention and the SAX method when the number of segments is 15, 50, and 10. Tables 8 and 12 show the evaluation results under the SAX + segmentation method and the SAX + segmentation + clustering method, and these two methods are not affected by the number of segments, but are related to the α value. These 8 tables all show the compression and decompression effects for alpha values of 3, 5 and 8, respectively.
TABLE 5 compression of Gesture first dimension data time series value of length 400 at number of segments of 5
TABLE 6 compression of 400-Length Gesture first-dimension data time-series values at a number of segments of 20
TABLE 7 compression results for Gesture first dimension data time series value of length 400 at number of segments of 3
TABLE 8 compression results of Gesture first dimension data time series values of length 400
For short sequences, no matter the number of segments is 5, 20 or 3, and the α value is 3, 5 or 8, as shown in tables 6 to 7, the method provided by the embodiment of the present invention can have smaller MSE value, RMSE value and MAE value after decompression while maintaining the same compression ratio as the SAX method, which indicates that the method provided by the embodiment of the present invention can realize higher fitting after decompression, and realize reduction of original data to a greater extent; for a long sequence, no matter the number of segments is 15, 50 or 10, and the α value is 3, 5 or 8, as shown in tables 9 to 11, the method proposed by the embodiment of the present invention not only has a lower compression rate than the SAX method, but also has smaller MSE, RMSE and MAE index evaluation values after decompression, which indicates that the method proposed by the embodiment of the present invention can achieve both compression to a greater extent and restoration of original data to a greater extent, and therefore, the method proposed by the embodiment of the present invention has a better compression effect than the SAX method.
Comparing the SAX + segmentation method with the SAX + segmentation + clustering method, since the former is compressed by segmenting the segments according to time series and the latter is clustered on the basis of the former, the latter has a lower compression ratio, as shown in tables 8 and 12. For short sequences, when α =3, the SAX + segmentation method and the SAX + segmentation + clustering method have the same evaluation index value after decompression, and when α =5 and 8, the three index values of the SAX + segmentation method are smaller; for long sequences, the MSE, RMSE and MAE index values for the SAX + segmentation method at 3 α values are all smaller than those for the SAX + segmentation + clustering method. This shows that the SAX + segmentation + clustering method achieves higher compression by sacrificing the data reduction capability, regardless of short sequences or long sequences, and whether the method achieves the same reduction capability as the SAX + segmentation method, depends on the data itself and the alpha value. So the emphasis of the two methods is different. But the compression ratio of both methods is fixed.
The method proposed by the embodiment of the present invention is compared with the SAX + segmentation method. For short sequences with the same compression rate, as shown in tables 6 and 8, the method provided by the embodiment of the invention has smaller MSE, RMSE and MAE values, which indicates that the method can better fit the original time series data and has better compression effect. In addition, when the number of segments is 3, the compression rate of the method proposed by the embodiment of the present invention is lower, and 3 index values can reach the same level as that of the SAX + segmentation method when α =3, and unfortunately, 3 index values are larger than the latter when α =5 and 8, but can reach a level of better fitting when having the same compression rate and reaching the same fitting effect at a partial α value when having a lower compression rate, and it is worth to say that the method proposed by the embodiment of the present invention is stronger than the SAX + segmentation method.
For long-time sequences, as shown in tables 9 to 12, when the number of segments is 15, the compression rate and 3 evaluation indexes of the method proposed by the embodiment of the present invention are lower than those of the SAX + segmentation method at α =5, and the index value is higher at α =3 and 8, but the compression rate is lower; when the number of segments is 50, the values of the 4-item index of the proposed method are all higher than the index value of the SAX + segmentation method when α =3, and at α =5 and 8, although the compression rate is higher, the other 3-item index value is lower than the index value of the latter. At a segment number of 10, the MSE, RMSE, and MAE index values of the proposed method are all larger than the latter, but the compression rate is lower.
In summary, compared with the SAX + segmentation method, the method provided by the embodiment of the present invention can ensure better data reduction capability and have a lower compression rate.
The method proposed by the embodiment of the invention is compared with the SAX + segmentation + clustering method. For short sequences, when the compression rates are the same, the other 3 index values of the two methods are the same as in tables 8 and 9; the other 3 items of index values of the method proposed by the embodiment of the present invention are lower when the compression rate is higher. For long sequences, both methods have the same index value when the number of segments is 15 and α =3, and when the number of segments is 10 and α =3 and 5. And under other segmentation numbers and alpha, the method provided by the embodiment of the invention has higher compression ratio and lower other 3 index values. Therefore, compared with the SAX + segmentation + clustering method, the method provided by the embodiment of the invention has lower compression ratio and better data reduction capability.
TABLE 9 compression of Gesture first dimension data time series values of length 1300 at 15 segments
TABLE 10 compression of Gesture first dimension data time series values of length 1300 at a segment number of 50
TABLE 11 compression results of Gesture first dimension data time series values of length 1300 at a segment number of 10
TABLE 12 compression results of Gesture first dimension data time series values of length 1300
In addition, compared with the characteristic that the compression ratios are fixed by the two methods, the method provided by the embodiment of the invention can adjust the compression ratio, and for the short sequences, when the number of the segments is 20 and 3, the obtained compression ratio is different from that obtained when the number of the segments is 5, and the method also has the same effect on the long sequences. The larger the number of segments, the higher the compression ratio, the smaller the number of segments, the lower the compression ratio, and the smaller the fitting errors, a strategy with a higher compression ratio can be selected under tolerable fitting errors.
In summary, the method provided in the embodiment of the present invention is suitable for compressing the Gesture data, and it has both the high data restoration capability of the SAX + segmentation method and the low compression rate of the SAX + segmentation + clustering method, and if it is desired to achieve the lowest compression rate, it is only necessary to make the number of segments be the number of clusters, so that it has a certain adaptability. Therefore, has certain applicability.
(IV) validation and analysis of PSCADA datasets
Using la data in the passcada dataset, data of length 100 and 1500 were used, respectively.
Similarly, since the SAX method is related to the number of segments and the α value, in order to illustrate the compression effect of the method proposed by the embodiment of the present invention on data with a certain periodicity, 3 numbers of segments are selected for data with different lengths. For a short-time sequence of length 100, one of the number of segments is the number of segments 6 obtained based on chapter ii, and the other two select the values 10 and 4 on both sides of the number of segments 6; for long time sequences, 109, 50 and 150 are selected as the number of segments, where 109 is the number of segments from chapter three; the alpha value is selected from 3 values 3, 5 and 8 within the usual range.
Tables 13 to 15 show the compression evaluation and decompression evaluation of the la data of length 100, tables 16 to 18 show the compression evaluation and decompression evaluation of the la data of length 1500, and these 6 tables show the evaluation results in different numbers of segments, different α values, and different methods. The method proposed in the embodiment of the present invention enables the short sequences to have the same compression effect when the number of segments is 6, 4, and 10, and the long sequences to have the same compression effect when the number of segments is 109, 50, and 150, so that they are shown only in tables 13 and 16, respectively, and also the SAX + segmentation method and the SAX + segmentation + clustering method are not affected by the number of segments, but are related to the α value, as shown in tables 15 and 18.
TABLE 13 compression results of la data time series values of length 100 under the proposed method and SAX method
TABLE 14 compression results of la data time series values of length 100 under SAX method
TABLE 15 compression results of la data time series values of length 100
For a short sequence, when the number of segments is 6, no matter the α value is 3, 5 or 8, as shown in table 13, the method proposed by the embodiment of the present invention can achieve a smaller compression rate while maintaining the same MSE value, RMSE value and MAE value as the SAX method, which indicates that the method can store data in a smaller space. In addition, when the number of segments is 10 and 4, the compression rate of the method provided by the embodiment of the invention is lower than that of the SAX method under the corresponding alpha value, and the index values of MSE, RMSE and MAE are smaller, which can obviously show that the method provided by the embodiment of the invention has better compression effect than that of the SAX method. The SAX method has a low compression rate when the number of segments is 6 and has the minimum other evaluation values because la data with the length of 100 has certain periodicity, the periodicity of the data is just divided by the number of segments at the time, and the characteristics of the data are damaged by the other two numbers of segments, so that the compression effect is poor.
For a long sequence, no matter the number of segments is 109, 0 or 150, and the α value is 3, 5 or 8, the compression rate and the index values of MSE, RMSE and MAE obtained by the method provided by the embodiment of the present invention are lower than those obtained by the SAX method, which shows that for longer la data, the method provided by the embodiment of the present invention can still obtain a lower compression rate and a higher data reduction capability. The reason why the SAX method cannot have its own lowest compression ratio and the best other 3 index values when the optimal number of segments 109 is found is that although the time series has a certain periodicity, the periodicity is not so strong, and there may be some data outside the period between periods, so that the method divides data in different periods into the same segment when averaging the sequence segments, and thus the data reduction capability is poor.
In summary, the compression ratio and the decompression effect of the SAX method have a great relationship with the number of segments, the α value, and the characteristics of the data itself, but the method provided by the embodiment of the present invention can better extract the characteristics of the data, and compress the data by using the characteristics, so that a good compression effect can be obtained.
TABLE 16 compression results of la data time series values of length 1500 under the proposed method and SAX method
TABLE 17 compression results of la data time series values of length 1500 under SAX method
TABLE 18 compression results for la data time-series values of length 1500
Comparing the SAX + segmentation method with the SAX + segmentation + clustering method, the compression ratio of the former is higher than that of the latter, which is consistent with theory, as shown in tables 15 and 18. For short sequences, the 3-term index values of the two methods are the same at α =3 and 5, while the three-term index values of the SAX + segmentation method are a little smaller at α = 8; for long sequences, the MSE, RMSE and MAE values were all the same at 3 α values. Based on the above performances, it is demonstrated that the SAX + segmentation method is more focused on the reduction capability of data, the SAX + segmentation + clustering method is more focused on having a lower compression ratio and being related to the characteristics of the data, and when the data is more stable, the SAX + segmentation + clustering method is better than the SAX + segmentation method.
The method proposed in this embodiment of the present invention is compared with the SAX + segmentation method. For short sequences, as shown in tables 13 and 15, at α =3 and 5, the method proposed by the embodiment of the present invention can achieve the same evaluation values of MSE, RMSE, and MAE indexes as the latter, but with a smaller compression rate; and at α =8, although the evaluation values of these three indices are slightly larger than the latter, the compression rate is much lower. For long sequences, as shown in tables 16 and 18, the method proposed by the embodiment of the present invention and the latter have the same evaluation values of MSE, RMSE, and MAE indexes, but the compression rate is smaller. Therefore, the method provided by the embodiment of the invention can achieve the effects that the compression rate is lower while the same data reduction capability is achieved, and the compression rate is still very low while the data reduction capability is slightly worse.
The method proposed by the embodiment of the present invention is compared with the SAX + segmentation + clustering method. Both methods yield the same compression ratio and 3 index values, regardless of short or long sequences. This is because there is substantially no change in the data.
Based on the above, the method provided by the embodiment of the invention combines the characteristic of high data reduction capability of the SAX + segmentation method and the advantage of low compression ratio of the SAX + segmentation + clustering method.
In summary, according to the high-frequency time series data compression method based on the improved symbol aggregation approximation provided by the embodiment of the invention, the features of the time series can be well segmented through segmentation, and the time series segments with similar features can be classified into one class through clustering, so that the dimension reduction and compression of the time series data are realized, and the data can be compressed again by combining the segmentation and clustering with the SAX method; experiments show that the method provided by the embodiment of the invention can simultaneously reach the level of lower compression ratio and higher data reduction capability, or reach one of the levels.
Next, a high-frequency time-series data compression apparatus based on an improved symbol aggregation approximation proposed according to an embodiment of the present invention will be described with reference to the accompanying drawings.
Fig. 7 is a high frequency time series data compression apparatus based on improved symbol aggregation approximation according to an embodiment of the present invention.
As shown in fig. 7, the apparatus 10 includes: a partitioning module 101, a clustering module 102, an equidistant segmentation module 103, a transformation module 104, and a clipping module 105.
The dividing module 101 is configured to divide the time series by using a gaussian division model based on an improved image cluster optimization algorithm to obtain a plurality of division points and a plurality of time series segments. The clustering module 102 is configured to cluster the plurality of time series segments using a gaussian mixture model initialized based on the improved peak density to obtain a plurality of cluster centers, a plurality of module-divided variances, and a plurality of cluster labels. And the equidistant segmentation module 103 is used for performing equidistant segmentation on each cluster center again according to the proportion of the variance of the sub-modules. The conversion module 104 is configured to convert the mean value of each segment at the center of each class into a symbolic representation by using an SAX method, where a first character of each class is a capital letter. The clipping module 105 is configured to clip the same symbolic representation of the same class, and retain the first capital letter, thereby obtaining the time series value compressed data.
It should be noted that the above explanation of the embodiment of the high frequency time series data compression method based on the improved symbol aggregation approximation is also applicable to the apparatus of the embodiment, and is not repeated herein.
According to the high-frequency time sequence data compression device based on the improved symbol aggregation approximation, the characteristics of the time sequence can be well segmented through segmentation, the time sequence segments with similar characteristics can be divided into a class through clustering, so that the dimension reduction and compression of the time sequence data are realized, and the data can be compressed again by combining the segmentation and clustering with an SAX method; experiments show that the method provided by the embodiment of the invention can simultaneously reach the level of lower compression ratio and higher data reduction capability, or reach one of the levels.
The high-frequency time series data decompression device based on the improved symbol aggregation approximation proposed according to the embodiment of the invention is also described with reference to the attached drawings.
Fig. 8 is a schematic structural diagram of a high-frequency time-series data decompression apparatus based on improved symbol aggregation approximation according to an embodiment of the present invention.
As shown in fig. 8, the apparatus 20 includes: a scanning and identification module 201, a partitioning point determination module 202, a restacking symbolization module 203 and an inverse transformation module 204.
The scanning and recognition module 201 is configured to scan the compressed data of the time series value, recognize the capital character, and obtain a symbolic representation of each cluster center. The determine division point module 202 is configured to calculate a sequence segment length of each cluster center, and determine a division point. The symbolic representation restoring module 203 is configured to restore the symbolic representation of each segment of the sequence fragment according to the segmentation point, so as to obtain the symbolic representation of the entire time sequence. The inverse transform module 204 is configured to perform an inverse transform with the SAX method on each symbol to obtain a time sequence.
It should be noted that the foregoing explanation of the embodiment of the high-frequency time series data decompression method based on the improved symbol aggregation approximation is also applicable to the apparatus of the embodiment, and is not repeated here.
According to the high-frequency time sequence data decompression device based on the improved symbol aggregation approximation, the characteristics of the time sequence can be well segmented through segmentation, the time sequence segments with similar characteristics can be divided into a class through clustering, so that the dimension reduction and compression of the time sequence data are realized, and the data can be compressed again by combining the segmentation and clustering with an SAX method; experiments show that the method provided by the embodiment of the invention can simultaneously achieve the levels of lower compression ratio and higher data reduction capability, or achieve one of the levels.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (10)
1. A high-frequency time sequence data compression method based on improved symbol aggregation approximation is characterized by comprising the following steps:
step S101, a Gaussian segmentation model based on an improved image group optimization algorithm is used for dividing a time sequence to obtain a plurality of segmentation points and a plurality of time sequence segments;
step S102, clustering the plurality of time series segments by using a Gaussian mixture model initialized based on improved peak density to obtain a plurality of clustering centers, module-divided variances and clustering labels;
step S103, performing equidistant segmentation on each clustering center again according to the proportion of the sub-module variances;
step S104, converting the average value of each segment of the center of each class into symbolic representation by using an SAX method, wherein the first character of each class is a capital letter;
and step S105, cutting the same symbolic representation of the same class, reserving the first capital letter, and finally obtaining the time sequence value compressed data.
2. The improved symbolic aggregation approximation-based high frequency time series data compression method according to claim 1, wherein the gaussian segmentation model based on the improved image group optimization algorithm in step S101 is:
adding disturbance to a clan update operator of the initial object group optimization algorithm to obtain:
wherein x is i,j The position of the j elephant representing the i tribe,the position of the elephant with the optimal fitness function value in all the elephants is expressed and called as the senior citizen;the influence of the family length of the tribe i on the elephant individual is shown, and alpha is an influence parameter;presentation groupInfluence of optimal somatotype on elephant individuals, wherein 1-alpha represents an influence parameter; levy (λ) denotes the mechanism of variation; round represents an operation of rounding a value in parentheses;
adjusting the clan classification operator of the initial object group optimization algorithm to obtain:
3. The high frequency time series data compression method based on improved symbol aggregation approximation as claimed in claim 1, wherein said step S101 specifically comprises:
initializing parameters and populations based on an improved image group optimization algorithm;
calculating the fitness values of all time sequence segments in the time sequence by adopting a segmented Gaussian module, starting iteration until the preset maximum iteration times is reached, outputting the optimal position and the fitness values, and otherwise, sequencing the fitness values of all time sequence segments and reserving a plurality of good time sequence segments;
executing the clan update operator based on the improved image group optimization algorithm, and updating the positions of all time sequence segments and the positions of the family length until the time sequence segments in all clans are updated;
executing the clan separation operator based on the improved image group optimization algorithm, and updating the positions and the fitness values of the time sequence segments of the differences until the time sequence segments of the differences are separated in all clans;
and sequencing the fitness values of all the time sequence segments to obtain a plurality of time sequence segments which are different and need to be separated, updating the positions of the reserved time sequence segments to the time sequence segments which need to be separated, and calculating the fitness values to update the optimal positions of the time sequence segments, namely the optimal segmentation points in all the time sequence segments.
4. The high-frequency time series data compression method based on improved symbol aggregation approximation as claimed in claim 1, wherein said step S102 specifically comprises:
initializing clustering algorithm based on improved peak density to obtain mean value of time series Gaussian mixture modelInitialization value of, each partial model coefficientInitialization value and variance of each partial modelInitializing a value;
inputting a preset maximum iteration number, a preset threshold value, the plurality of segmentation points and the plurality of time sequence segments;
e-step computation time series segment S of iterative execution EM algorithm k Probability of belonging to m-th partial modelIteratively executing M steps of the EM algorithm to calculate an updated mean for each clusterUpdating varianceUpdating coefficients
Judging whether the difference of the log-likelihood function values of the two times is smaller than the preset threshold or not, or whether the variance of the sub-model of the Gaussian mixture model is 0 or not, if so, ending iteration, and outputting the parameter mean value corresponding to the optimal clusterVariance (variance)Coefficient of performanceAnd probabilityOtherwise, judging whether the number of times of iteration is less than the preset maximum iteration number, if so, continuing the iteration, otherwise, ending the iteration, and outputting the parameter mean value corresponding to the optimal clusterVariance (variance)Coefficient of performanceAnd probability
5. The high-frequency time series data compression method based on the improved symbol aggregation approximation as claimed in claim 4, wherein the specific process of initializing the improved peak density clustering algorithm is as follows:
adjusting a local density formula according to all the time sequence segments to calculate local density;
is calculated withoutMaximum local density ρ (S) j ) The relative distance of the time series segments with the maximum local density p (S) is calculated j ) Relative distance of time series segments of (a);
carrying out normalization processing on the local density and the relative distance, and calculating the product of the local density and the relative distance as a clustering judgment standard to carry out descending order arrangement;
selecting time sequence segments far away from zero value, and taking the mean value of the time sequence segments as the mean value of the time sequence Gaussian mixture modelAnd taking the sequence number as a clustering number;
6. The improved symbol aggregation approximation-based high frequency time series data compression method according to claim 4, wherein the specific solution formula in the iterative EM algorithm is as follows:
wherein,is the probability of the m-th partial model, α m Is a coefficient of the partial model, x t Is a time sequence, S k For a segment of the time series, phi is a simplified log-likelihood function, theta, for each segment of the time series m In order to be the parameters of the model,for the updated mean of each of the clusters,update variance, μ, for each cluster m For the original mean value of each of the clusters,the coefficients are updated for each cluster.
7. The high-frequency time series data compression method based on improved symbol aggregation approximation as claimed in claim 1, wherein said step S103 specifically comprises:
when the number w of the segments is preset, the segments are distributed according to the proportion of the variance of each class, and the segmentation number of each class is determined, wherein the formula is as follows:
wherein M is the number of clusters, c j Is the variance, v, of the partial model j j Is the number of splits for class j;
and carrying out equidistant segmentation on the clustering center of each class according to the segmentation number.
8. A high frequency time series data compression apparatus based on improved symbol aggregation approximation, comprising:
the dividing module is used for dividing the time sequence by using a Gaussian dividing model based on an improved image group optimization algorithm to obtain a plurality of dividing points and a plurality of time sequence segments;
the clustering module is used for clustering the plurality of time sequence segments by using a Gaussian mixture model initialized based on the improved peak density to obtain a plurality of clustering centers, a module-divided variance and clustering labels;
the equidistant segmentation module is used for performing equidistant segmentation on each clustering center again according to the proportion of the variance of the sub-modules;
the conversion module is used for converting the average value of each segment of the center of each class into symbolic representation by using an SAX method, and the first character of each class is a capital letter;
and the shearing module is used for shearing the same symbolic representation of the same type, reserving the first capital letter and finally obtaining the time sequence value compressed data.
9. A high-frequency time sequence data decompression method based on improved symbol aggregation approximation is characterized by comprising the following steps:
step S201, scanning time sequence value compressed data, identifying capital characters, and obtaining symbolic representation of each cluster center;
step S202, calculating the length of the sequence segment of each clustering center, and determining a segmentation point;
step S203, restoring symbolic representation of each segment of sequence fragment according to the division points, and further obtaining symbolic representation of the whole time sequence;
step S204, each symbol is inversely transformed by an SAX method to obtain a time sequence.
10. A high frequency time series data decompression apparatus based on improved symbol aggregation approximation, comprising:
the scanning and identifying module is used for scanning the compressed data of the time sequence value, identifying capitalized characters and obtaining symbolic representation of each clustering center;
a division point determining module, configured to calculate a sequence segment length of each clustering center, and determine a division point;
the symbolic representation restoring module is used for restoring the symbolic representation of each segment of the sequence fragment according to the division points so as to obtain the symbolic representation of the whole time sequence;
and the inverse transformation module is used for carrying out inverse transformation on each symbol by an SAX method to obtain a time sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211043071.0A CN115514376A (en) | 2022-08-29 | 2022-08-29 | High-frequency time sequence data compression method and device based on improved symbol aggregation approximation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211043071.0A CN115514376A (en) | 2022-08-29 | 2022-08-29 | High-frequency time sequence data compression method and device based on improved symbol aggregation approximation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115514376A true CN115514376A (en) | 2022-12-23 |
Family
ID=84501827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211043071.0A Pending CN115514376A (en) | 2022-08-29 | 2022-08-29 | High-frequency time sequence data compression method and device based on improved symbol aggregation approximation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115514376A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115733498A (en) * | 2023-01-10 | 2023-03-03 | 北京四维纵横数据技术有限公司 | Compression method and device of time sequence data, computer equipment and medium |
CN116166978A (en) * | 2023-04-23 | 2023-05-26 | 山东民生集团有限公司 | Logistics data compression storage method for supply chain management |
CN116760908A (en) * | 2023-08-18 | 2023-09-15 | 浙江大学山东(临沂)现代农业研究院 | Agricultural information optimization management method and system based on digital twin |
CN116775692A (en) * | 2023-04-21 | 2023-09-19 | 清华大学 | Segmented aggregation query method and system for time sequence database |
-
2022
- 2022-08-29 CN CN202211043071.0A patent/CN115514376A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115733498A (en) * | 2023-01-10 | 2023-03-03 | 北京四维纵横数据技术有限公司 | Compression method and device of time sequence data, computer equipment and medium |
CN116775692A (en) * | 2023-04-21 | 2023-09-19 | 清华大学 | Segmented aggregation query method and system for time sequence database |
CN116775692B (en) * | 2023-04-21 | 2024-01-30 | 清华大学 | Segmented aggregation query method and system for time sequence database |
CN116166978A (en) * | 2023-04-23 | 2023-05-26 | 山东民生集团有限公司 | Logistics data compression storage method for supply chain management |
CN116166978B (en) * | 2023-04-23 | 2023-07-25 | 山东民生集团有限公司 | Logistics data compression storage method for supply chain management |
CN116760908A (en) * | 2023-08-18 | 2023-09-15 | 浙江大学山东(临沂)现代农业研究院 | Agricultural information optimization management method and system based on digital twin |
CN116760908B (en) * | 2023-08-18 | 2023-11-10 | 浙江大学山东(临沂)现代农业研究院 | Agricultural information optimization management method and system based on digital twin |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115514376A (en) | High-frequency time sequence data compression method and device based on improved symbol aggregation approximation | |
CN115459782A (en) | Industrial Internet of things high-frequency data compression method based on time sequence segmentation and clustering | |
AU2020200997B2 (en) | Optimization of audio fingerprint search | |
US7930281B2 (en) | Method, apparatus and computer program for information retrieval | |
CN108804731B (en) | Time series trend feature extraction method based on important point dual evaluation factors | |
WO2006004797A2 (en) | Methods and systems for feature selection | |
US6023673A (en) | Hierarchical labeler in a speech recognition system | |
CN102663681B (en) | Gray scale image segmentation method based on sequencing K-mean algorithm | |
CN115170868A (en) | Clustering-based small sample image classification two-stage meta-learning method | |
CN111640438B (en) | Audio data processing method and device, storage medium and electronic equipment | |
CN116561230B (en) | Distributed storage and retrieval system based on cloud computing | |
CN112651424A (en) | GIS insulation defect identification method and system based on LLE dimension reduction and chaos algorithm optimization | |
CN110032585B (en) | Time sequence double-layer symbolization method and device | |
US7797160B2 (en) | Signal compression method, device, program, and recording medium; and signal retrieval method, device, program, and recording medium | |
US20140343944A1 (en) | Method of visual voice recognition with selection of groups of most relevant points of interest | |
US20140343945A1 (en) | Method of visual voice recognition by following-up the local deformations of a set of points of interest of the speaker's mouth | |
CN115438727A (en) | Time sequence Gaussian segmentation method based on improved image group algorithm | |
CN114398991A (en) | Electroencephalogram emotion recognition method based on Transformer structure search | |
CN110826628A (en) | Characteristic subset selection and characteristic multivariate time sequence ordering system | |
Nishii et al. | Similar subsequence retrieval from two time series data using homology search | |
CN117453671A (en) | Method, system and medium for cleaning repeated or similar data | |
CN117829239A (en) | Model pruning method and system for embedded equipment | |
CN115618254A (en) | Three-branch clustering method based on sample similarity | |
CN117475443A (en) | Image segmentation and recombination system based on AIGC | |
CN117459187A (en) | High-speed data transmission method based on optical fiber network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |