CN115514376A - High-frequency time sequence data compression method and device based on improved symbol aggregation approximation - Google Patents

High-frequency time sequence data compression method and device based on improved symbol aggregation approximation Download PDF

Info

Publication number
CN115514376A
CN115514376A CN202211043071.0A CN202211043071A CN115514376A CN 115514376 A CN115514376 A CN 115514376A CN 202211043071 A CN202211043071 A CN 202211043071A CN 115514376 A CN115514376 A CN 115514376A
Authority
CN
China
Prior art keywords
time sequence
segments
clustering
segmentation
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211043071.0A
Other languages
Chinese (zh)
Inventor
石振锋
牛晓东
肖红彬
崔鲲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Thinking Shichuang Technology Co ltd
Original Assignee
Beijing Thinking Shichuang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Thinking Shichuang Technology Co ltd filed Critical Beijing Thinking Shichuang Technology Co ltd
Priority to CN202211043071.0A priority Critical patent/CN115514376A/en
Publication of CN115514376A publication Critical patent/CN115514376A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a high-frequency time sequence data compression method and device based on improved symbol aggregation approximation, belonging to the technical field of time sequence compression, wherein the method comprises the following steps: dividing the time sequence by using a Gaussian segmentation model based on an improved image group optimization algorithm to obtain a plurality of segmentation points and a plurality of time sequence segments; clustering the plurality of time sequence segments by using a Gaussian mixture model initialized based on improved peak density to obtain a plurality of clustering centers, a sub-module variance and a clustering label; carrying out equidistant segmentation on each clustering center again according to the proportion of the variance of the sub-modules; converting the mean value of each segment of the center of each class into symbolic representation by using an SAX method, wherein the first character of each class is a capital letter; and cutting the same symbolic representation of the same class, reserving the first capital letter, and finally obtaining the time sequence value compressed data. The method adopts the Gaussian clustering of time sequence segments to realize the feature extraction and dimension reduction of the SAX method.

Description

High-frequency time sequence data compression method and device based on improved symbol aggregation approximation
Technical Field
The invention relates to the technical field of time sequence compression, in particular to a high-frequency time sequence data compression method based on improved symbol aggregation approximation.
Background
Time series compression is an important study in time series correlation studies. As science and technology develops rapidly, intellectualization permeates into aspects of production, manufacturing, monitoring and other work, a company, a platform or a system need to generate data at every moment, the generated data not only has large cardinality of required data acquisition devices, but also has high acquisition frequency, complex and various data types, and certain relativity is provided before and after the data. There is therefore a need for an efficient compression method to enable the storage of time series data.
The compression method of the time series has relatively mature research results and also has continuously updated research results. Including lossless compression models as well as lossy compression models. Most time-series compression methods focus on lossy compression. The sequence representation method is a main means, and comprises discrete Fourier transform, discrete wavelet transform, singular value decomposition, piecewise linear representation, symbolization method and the like.
To solve the similarity search problem of large time sequence databases, keogh et al introduced a new dimension reduction technique, i.e., a Piecewise Accumulation Approximation (PAA). On the basis, many developments and improvements are generated, including Adaptive Piecewise Constant Approximation (APCA), which is worth mentioning a symbol aggregation Approximation (SAX) method, which introduces partition of equal probability intervals of gaussian distribution and symbol transformation on the basis of a PAA method, and the discretization method provides a new direction for data representation and compression. SAX belongs to the category of symbolization methods, has the characteristics of simplicity, rapidness, wide application range and the like, but also has certain defects; also, there is little method of establishing a compression model with correlations existing before and after a time series as an entry point. Therefore, it is desirable to combine the two to compress the time-series data.
Disclosure of Invention
The present invention is directed to solving, at least in part, one of the technical problems in the related art.
To this end, a first objective of the present invention is to propose a high frequency time series data compression method based on improved symbol aggregation approximation, which can achieve lower compression rate and better data recovery capability.
The second purpose of the present invention is to provide a high frequency time series data compression device based on improved symbol aggregation approximation.
The third objective of the present invention is to provide a high frequency time series data decompression method based on improved symbol aggregation approximation.
The fourth purpose of the present invention is to provide a high frequency time series data decompression device based on the improved symbol aggregation approximation.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a high-frequency time series data compression method based on improved symbol aggregation approximation, including the following steps: step S101, a Gaussian segmentation model based on an improved image group optimization algorithm is used for dividing a time sequence to obtain a plurality of segmentation points and a plurality of time sequence segments; step S102, clustering the plurality of time series segments by using a Gaussian mixture model initialized based on improved peak density to obtain a plurality of clustering centers, module-divided variances and clustering labels; step S103, performing equidistant segmentation on each clustering center again according to the proportion of the sub-module variances; step S104, converting the average value of each segment of the center of each class into symbolic representation by using an SAX method, wherein the first character of each class is a capital letter; and step S105, cutting the same symbolic representation of the same class, reserving the first capital letter, and finally obtaining the time sequence value compressed data.
According to the high-frequency time sequence data compression method based on the improved symbol aggregation approximation, the time sequence is segmented by using the segmented Gaussian model, and the characteristic of random segmentation of the SAX method is improved; and the feature extraction and dimension reduction of the SAX method are realized by Gaussian clustering of the time sequence segments, so that lower compression rate and better data reduction capability are obtained.
In addition, the high-frequency time series data compression method based on the improved symbol aggregation approximation according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, the gaussian segmentation model based on the improved image group optimization algorithm in step S101 is:
adding disturbance to a clan update operator of the initial object group optimization algorithm to obtain:
Figure RE-GDA0003941580120000021
wherein x is i,j The position of the j elephant representing the i tribe,
Figure RE-GDA0003941580120000022
representing the position of the elephant with the optimal fitness function value in all the elephants, and calling the position of the elephant as the elder;
Figure RE-GDA0003941580120000023
the influence of the family length of the tribe i on the elephant individual is shown, and alpha is an influence parameter;
Figure RE-GDA0003941580120000024
representing the influence of the optimal population on the elephant individual, and 1-alpha representing an influence parameter; levy (λ) denotes the mechanism of variation; round represents an operation of rounding a value in parentheses;
adjusting the clan classification operator of the initial object group optimization algorithm to obtain:
Figure RE-GDA0003941580120000025
wherein,
Figure RE-GDA0003941580120000026
updated position of elephant with i drop, x min The position of the elephant can be selectedSmall value, x max The maximum value of the selectable positions of the elephant, round is the rounding operation, T is the time sequence length, and rand is a random number.
Further, in an embodiment of the present invention, the step S101 specifically includes: initializing parameters and populations based on an improved image group optimization algorithm; calculating the fitness values of all time sequence segments in the time sequence by adopting a segmented Gaussian module, starting iteration until a preset maximum iteration number is reached, outputting an optimal position and the fitness values, and otherwise, sequencing the fitness values of all time sequence segments and reserving a plurality of good time sequence segments; executing the clan update operator based on the improved image group optimization algorithm, and updating the positions of all time sequence segments and the positions of the family length until the time sequence segments in all clans are updated; executing the clan separation operator based on the improved image group optimization algorithm, and updating the positions and the fitness values of the time sequence segments of the differences until the time sequence segments of the differences are separated in all clans; and sequencing the fitness values of all the time sequence segments to obtain a plurality of time sequence segments which are different and need to be separated, updating the positions of the reserved time sequence segments to the time sequence segments which need to be separated, and calculating the fitness values to update the optimal positions of the time sequence segments, namely the optimal segmentation points in all the time sequence segments.
Further, in an embodiment of the present invention, the step S101 specifically includes: initializing parameters and populations based on an improved image group optimization algorithm; calculating the fitness values of all time sequence segments in the time sequence by adopting a segmented Gaussian module, starting iteration until a preset maximum iteration number is reached, outputting an optimal position and the fitness values, and otherwise, sequencing the fitness values of all time sequence segments and reserving a plurality of good time sequence segments; executing the clan update operator based on the improved image group optimization algorithm, and updating the positions of all time sequence segments and the positions of the family length until the time sequence segments in all clans are updated; executing the clan separation operator based on the improved image group optimization algorithm, and updating the positions and the fitness values of the time sequence segments of the differences until the time sequence segments of the differences are separated in all clans; and sequencing the fitness values of all the time sequence segments to obtain a plurality of different time sequence segments needing to be separated, updating the positions of the reserved time sequence segments to the time sequence segments needing to be separated, and calculating the fitness values to update the optimal positions of the ages, namely the optimal segmentation points in all the time sequence segments.
Further, in an embodiment of the present invention, the step S102 specifically includes: initializing the clustering algorithm based on the improved peak density to obtain the mean value of the time series Gaussian mixture model
Figure RE-GDA0003941580120000031
Initialization value of, each partial model coefficient
Figure RE-GDA0003941580120000032
Initialization value and variance of each partial model
Figure RE-GDA0003941580120000033
Initializing a value;
inputting a preset maximum iteration number, a preset threshold value, the plurality of segmentation points and the plurality of time sequence segments;
e-step computation time series segment S of iterative execution EM algorithm k Probability of belonging to mth partial model
Figure RE-GDA0003941580120000034
Iteratively executing M steps of the EM algorithm to calculate an updated mean for each cluster
Figure RE-GDA0003941580120000035
Updating variance
Figure RE-GDA0003941580120000036
Updating coefficients
Figure RE-GDA0003941580120000037
Judging whether the difference between the two log-likelihood function values isIf not, whether the variance of the sub-model of the Gaussian mixture model is 0 or not is judged, if yes, iteration is ended, and the parameter mean value corresponding to the optimal cluster is output
Figure RE-GDA0003941580120000038
Variance (variance)
Figure RE-GDA0003941580120000039
Coefficient of performance
Figure RE-GDA00039415801200000310
And probability
Figure RE-GDA00039415801200000311
Otherwise, judging whether the number of times of iteration is less than the preset maximum iteration number, if so, continuing the iteration, otherwise, ending the iteration, and outputting the parameter mean value corresponding to the optimal cluster
Figure RE-GDA0003941580120000041
Variance (variance)
Figure RE-GDA0003941580120000042
Coefficient of performance
Figure RE-GDA0003941580120000043
And probability
Figure RE-GDA0003941580120000044
Further, in an embodiment of the present invention, a specific process for initializing the clustering algorithm based on the improved peak density is as follows: adjusting a local density formula according to all the time sequence segments to calculate local density; calculating the local density without maximum ρ (S) j ) The relative distance of the time series segments with the maximum local density p (S) is calculated j ) Relative distance of time series segments of (a); carrying out normalization processing on the local density and the relative distance, and calculating the product of the local density and the relative distance as a clustering judgment standard to carry out descending order arrangement; selected far from zeroTime sequence segments and taking the mean value thereof as the mean value of the time sequence Gaussian mixture model
Figure RE-GDA0003941580120000045
And taking the sequence number as a clustering number; calculating each sub-model coefficient according to the clustering number
Figure RE-GDA0003941580120000046
And the variance of each partial model is calculated
Figure RE-GDA0003941580120000047
Initialized to an identity matrix.
Further, in an embodiment of the present invention, the specific solution formula in the iterative EM algorithm is:
Figure RE-GDA0003941580120000048
Figure RE-GDA0003941580120000049
Figure RE-GDA00039415801200000410
Figure RE-GDA00039415801200000411
wherein,
Figure RE-GDA00039415801200000412
is the probability of the m-th partial model, α m Is a coefficient of the partial model, x t Is a time sequence, S k For a segment of a time series, phi is a simplified log-likelihood function, theta, for each segment of the time series m In order to be the parameters of the model,
Figure RE-GDA00039415801200000413
for the updated mean of each of the clusters,
Figure RE-GDA00039415801200000414
update variance, μ, for each cluster m For the original mean value of each of the clusters,
Figure RE-GDA00039415801200000415
the coefficients are updated for each cluster.
Further, in an embodiment of the present invention, the step S103 specifically includes:
when the number w of the segments is preset, the segments are distributed according to the proportion of the variance of each class, and the segmentation number of each class is determined, wherein the formula is as follows:
Figure RE-GDA0003941580120000051
wherein M is the number of clusters, c j Is the variance, v, of the partial model j j Is the number of partitions of class j;
and carrying out equidistant segmentation on the clustering center of each class according to the segmentation number.
In order to achieve the above object, a second embodiment of the present invention provides a high frequency time series data compression apparatus based on improved symbol aggregation approximation, including: the method comprises the following steps: the dividing module is used for dividing the time sequence by using a Gaussian dividing model based on an improved image group optimization algorithm to obtain a plurality of dividing points and a plurality of time sequence segments; the clustering module is used for clustering the plurality of time sequence segments by using a Gaussian mixture model initialized based on the improved peak density to obtain a plurality of clustering centers, a module-divided variance and clustering labels; the equidistant segmentation module is used for performing equidistant segmentation on each clustering center again according to the proportion of the variance of the sub-modules; the conversion module is used for converting the average value of each segment of the center of each class into symbolic representation by using an SAX method, and the first character of each class is a capital letter; and the cutting module is used for cutting the same symbolic representation of the same class, reserving the first capital letter and finally obtaining the time sequence value compressed data.
The high-frequency time sequence data compression device based on the improved symbol aggregation approximation improves the random segmentation characteristic of the SAX method by segmenting the time sequence by using the segmented Gaussian model; and the feature extraction and dimension reduction of the SAX method are realized by Gaussian clustering of the time sequence segments, so that lower compression rate and better data reduction capability are obtained.
In order to achieve the above object, a third aspect of the present invention provides a high frequency time series data decompression method based on improved symbol aggregation approximation, including the following steps: step S201, scanning time sequence value compressed data, identifying capital characters, and obtaining symbolic representation of each cluster center; step S202, calculating the length of the sequence segment of each clustering center, and determining a segmentation point; step S203, restoring symbolic representation of each segment of sequence fragment according to the division points, and further obtaining symbolic representation of the whole time sequence; step S204, each symbol is inversely transformed by an SAX method to obtain a time sequence.
According to the high-frequency time sequence data decompression method based on the improved symbol aggregation approximation, the time sequence is segmented by using the segmented Gaussian model, and the characteristic of random segmentation of the SAX method is improved; and the feature extraction and dimension reduction of the SAX method are realized by Gaussian clustering of the time sequence segments, so that lower compression ratio and better data reduction capability are obtained.
In order to achieve the above object, a fourth aspect of the present invention provides a high frequency time series data decompression apparatus based on improved symbol aggregation approximation, including: the scanning and identifying module is used for scanning the compressed data of the time sequence value, identifying capitalized characters and obtaining symbolic representation of each clustering center; a division point determining module, configured to calculate a sequence segment length of each clustering center, and determine a division point; the symbolic representation restoring module is used for restoring the symbolic representation of each segment of the sequence fragment according to the division points so as to obtain the symbolic representation of the whole time sequence; and the inverse transformation module is used for carrying out inverse transformation on each symbol by an SAX method to obtain a time sequence.
The high-frequency time sequence data decompression device based on the improved symbol aggregation approximation improves the random segmentation characteristic of the SAX method by segmenting the time sequence by using the segmented Gaussian model; and the feature extraction and dimension reduction of the SAX method are realized by Gaussian clustering of the time sequence segments, so that lower compression rate and better data reduction capability are obtained.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a high frequency time series data compression method based on improved symbol aggregation approximation according to one embodiment of the invention;
FIG. 2 is a block diagram of a process for time series Gaussian segmentation based on an improved image group optimization algorithm according to an embodiment of the present invention;
FIG. 3 is a block flow diagram of Gaussian mixture model time series segment clustering based on improved peak density initialization, according to one embodiment of the invention;
FIG. 4 is a block flow diagram of high frequency time series data compression based on improved symbol aggregation approximation according to one embodiment of the present invention;
FIG. 5 is a schematic diagram of a high frequency time series data compression apparatus based on improved symbol aggregation approximation according to an embodiment of the present invention;
FIG. 6 is a flow chart of a high frequency time series data decompression method based on improved symbol aggregation approximation according to an embodiment of the present invention;
FIG. 7 is a block diagram of the decompression flow of time series values based on an improved symbol aggregation approximation, in accordance with an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a high-frequency time-series data decompression apparatus based on improved symbol aggregation approximation according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
It should be noted that, although the SAX method has good data expression and compression effects for time series compression, but has a large defect, the method can divide the time series into different segments to obtain different compression rates, but the divided sequences cannot guarantee that the sequences obtained by division have similar characteristics, if the data change range is large, the average value of the segment sequence is not enough to describe the characteristics of the segment, and even if the compression rate of the data is low, the error after decompression is large. Therefore, the SAX method needs further improvement in time series representation and compression, and the present invention provides a high frequency time series data compression method and apparatus based on improved symbol aggregation approximation, and a high frequency time series data decompression method and apparatus based on improved symbol aggregation approximation.
The high-frequency time series data compression method and apparatus based on improved symbol aggregation approximation and the high-frequency time series data decompression method and apparatus based on improved symbol aggregation approximation proposed according to the embodiments of the present invention will be described below with reference to the accompanying drawings, and first, the high-frequency time series data compression method based on improved symbol aggregation approximation proposed according to the embodiments of the present invention will be described with reference to the accompanying drawings.
Fig. 1 is a flow chart of a high frequency time series data compression method based on improved symbol aggregation approximation according to an embodiment of the invention.
As shown in fig. 1, the high frequency time series data compression method based on improved symbol aggregation approximation comprises the following steps:
in step S101, the time series is divided by using a gaussian segmentation model based on an improved image group optimization algorithm, and a plurality of segmentation points and a plurality of time series segments are obtained.
That is, the time sequence is divided by using a segmented Gaussian model based on an improved image group algorithm, and the obtained sequence segments are ensured to have certain characteristics.
Specifically, the solution of the segmented gaussian model, i.e., the solution of equation (1), is first performed:
Figure RE-GDA0003941580120000071
wherein,
Figure RE-GDA0003941580120000072
for simplified log-likelihood functions, K +1 represents the number of segments, | S k I denotes the segment S k Length of the sequence, i.e. number of sequence values, ∑ k Is the covariance, lambda is the regularization coefficient,
Figure RE-GDA0003941580120000073
is a trace fetch operation. This is an optimization problem and is therefore optimized using an improved quasigroup algorithm.
Further, for updating individuals, because the clan update operator in the image group optimization algorithm only considers the influence of the clan girth on the internal elephant of the clan, ignores the influence of the best elephant in the group on the individual, and the searching capability still needs to be improved, the embodiment of the invention improves the clan update operator as follows:
adding disturbance to a clan update operator of the initial object group optimization algorithm to obtain:
Figure RE-GDA0003941580120000074
wherein x is i,j The position of the j elephant representing the i tribe,
Figure RE-GDA0003941580120000075
the position of the elephant with the optimal fitness function value in all the elephants is expressed and called as the senior citizen;
Figure RE-GDA0003941580120000076
display unitThe influence of the group length of the fallen i on the elephant individual is added with certain disturbance;
Figure RE-GDA0003941580120000077
the influence of the optimal population on the elephant individual is shown, certain disturbance is added, and the influence on the elephant individual is mainly from the family length (the family length is respectively the optimal segmentation point in each clan in the time sequence) and the age (the optimal segmentation point in all clans in the time sequence), so that the influence parameters are divided into alpha and 1-alpha to show the influence parameters; in consideration of other burst factors, levy (lambda) is added to represent a mutation mechanism, so that the algorithm can jump out local extreme points more easily; round denotes an operation of rounding a value in parentheses so as to satisfy that the time-series division point is an integer;
the improved tribe updating operator considers the influence of the local optimal value and the influence of the global optimal value, fully utilizes the characteristic that the distance between the flight length of the Levy and rain and dew are uniformly stained, can enable the elephant individual to approach to the optimal direction, can expand the optimization range, is easy to jump out of the local extreme value, and accelerates convergence.
Note that the family length update is expressed as follows:
Figure RE-GDA0003941580120000081
wherein
Figure RE-GDA0003941580120000082
Wherein n is i The number of elephants in the clan i, including the family length;
Figure RE-GDA0003941580120000083
represents the center position or the mean position of the clan i;
Figure RE-GDA0003941580120000084
the updated position of the best elephant in the clan i, i.e. the new position of the clan i's family; beta is epsilon [0,1 ]]Indicating the new position of the represented family
Figure RE-GDA0003941580120000085
The degree of influence of the center position of the tribe i is also an influence parameter, as is α.
Then, for the optimization operation of the integer position, the clan classification operator of the initial object cluster optimization algorithm is adjusted as follows:
Figure RE-GDA0003941580120000086
wherein,
Figure RE-GDA0003941580120000087
updated position of elephant with i drop, x min Is the minimum value of the selectable positions of the elephant, x max The maximum value of the selectable positions of the elephant, round is the rounding operation, T is the time sequence length, and rand is a random number.
Since the segmentation point of the segmented Gaussian model is left-closed and right-open, the internal segmentation point T of the time sequence with the length of T can be obtained, namely, the integer position is selected from 2, \8230, and the integer position is selected between T.
Then, in order to accelerate the algorithm to approach to the optimal solution, besides performing separation operation updating position on the individual with the worst fitness function value in each clan, the whole elephant group can be sequenced after the operation of a clan updating operator and a clan separation operator is completed, a certain number of individuals with the worst fitness function value are selected for separation, and the individuals are updated to the positions of the excellent individuals reserved in the last iteration, and finally the improved elephant group optimization algorithm is obtained.
As shown in fig. 2 and as shown in table 1 below, the basic steps of applying the improved image group algorithm to the segmented gaussian model to obtain the time series gaussian segmentation based on the improved image group algorithm are as follows:
initializing parameters and populations based on an improved image group optimization algorithm;
calculating the fitness values of all time sequence segments in the time sequence by adopting a segmented Gaussian module, starting iteration until the preset maximum iteration times is reached, outputting the optimal position and the fitness values, otherwise, sequencing the fitness values of all time sequence segments, and reserving a plurality of good time sequence segments;
executing a clan update operator based on an improved image group optimization algorithm, and updating the positions of all time sequence segments and the position of the ethnic group until the time sequence segments in all clans are updated;
executing a clan separation operator based on an improved image group optimization algorithm, and updating the positions and the fitness values of a plurality of poor time sequence segments until all the poor time sequence segments are separated;
and sequencing the fitness values of all the time sequence segments to obtain a plurality of different time sequence segments needing to be separated, updating the positions of the reserved time sequence segments to the time sequence segments needing to be separated, and calculating the fitness values to update the optimal positions of the ages, namely the optimal segmentation points in all the time sequence segments.
TABLE 1 basic procedure for time series Gaussian segmentation based on improved image group optimization algorithm
Figure RE-GDA0003941580120000091
Figure RE-GDA0003941580120000101
In step S102, a gaussian mixture model initialized based on the improved peak density is used to cluster the plurality of time series segments, and a plurality of cluster centers, module-divided variances, and cluster labels are obtained.
That is, the subsequences obtained by segmentation are used as objects, and a time sequence segment Gaussian clustering model initialized based on improved peak density is used for clustering the subsequences.
Specifically, the gaussian clustering of time-series segments assumes a time-series segment S obtained by gaussian segmentation 1 ,S 2 ,…,S K+1 The time sequence variables are independent and respectively obey respective Gaussian distribution, and a new class obtained by clustering similar segments obeys a large-range Gaussian distribution. Therefore, the embodiment of the present invention uses a gaussian mixture model to describe all time segments, and the description form is as follows:
Figure RE-GDA0003941580120000102
wherein alpha is m > 0 represents the coefficients of the partial model and has
Figure RE-GDA0003941580120000103
M represents the number of clusters; θ = (α) m ;θ m )=(α m ;μ m ,∑ m L M =1,2, \8230;, M) represents parameters of the model (6); p (S | theta) represents a time-series segment S 1 ,S 2 ,…,S K+1 A probability distribution of (a); phi (S | theta) m ) Expressed at a parameter theta m =(μ m ,∑ m ) The gaussian distribution of the time series fragments, which is also the mth partial model of model (6), has the following expression:
Figure RE-GDA0003941580120000104
the Gaussian mixture model of the time series has hidden variables, which are defined as follows:
Figure RE-GDA0003941580120000105
k=1,2,…·,K+1;m=1,2,…,M
then, estimating Gaussian mixture model parameters of the time sequence segment by using an EM algorithm, and deriving:
Figure RE-GDA0003941580120000106
Figure RE-GDA0003941580120000111
Figure RE-GDA0003941580120000112
Figure RE-GDA0003941580120000113
wherein,
Figure RE-GDA0003941580120000114
is the probability of the m-th partial model, α m Is a coefficient of the partial model, x t Is a time sequence, S k For a segment of a time series, phi is a simplified log-likelihood function, theta, for each segment of the time series m Are the parameters of the model and are used as the parameters,
Figure RE-GDA0003941580120000115
for the updated mean of each of the clusters,
Figure RE-GDA0003941580120000116
update variance, μ, for each cluster m For the original mean value of each of the clusters,
Figure RE-GDA0003941580120000117
the coefficients are updated for each cluster.
The EM algorithm is sensitive to initial values and may not jump out once trapped in a local extremum. For Gaussian clustering of time series, mean values need to be matched
Figure RE-GDA0003941580120000118
Variance (variance)
Figure RE-GDA0003941580120000119
And coefficient of
Figure RE-GDA00039415801200001110
Initialization is performed. Coefficient of partial model
Figure RE-GDA00039415801200001111
The initialization of (2) does not affect the clustering result of the time series Gaussian mixture model, so the embodiment of the invention adopts uniform distribution for initialization, as shown in the following formula:
Figure RE-GDA00039415801200001112
where M represents the number of clusters.
In order to reduce the influence of the variance on the clustering, an identity matrix is used for initializing the variance. And most importantly, the mean value is initialized, and the peak density clustering algorithm is improved and is allowed to initialize the mean value.
For time series segment S j And S k Is defined as having a Papanicolaou distance of
Figure RE-GDA00039415801200001113
Distance D between two adjacent Papanicolae B (S j ,S k ) The smaller, the time series segment S is illustrated j And S k The more similar. Time series segment S j And all time segments have a Papanicolaou distance of
Figure RE-GDA00039415801200001114
Distance D between two adjacent Papanicolae B (S j The smaller S) the sequence fragment S is specified j Similar to most time sequence segments, the method is suitable for being used as the center of clustering. But data points with higher local density are more suitable as cluster centers, so D B (S j S) is not suitable as a local density of data points directly, so the local density is defined as follows:
Figure RE-GDA0003941580120000121
time series segment S j Relative distance δ (S) j ) Can be expressed as
Figure RE-GDA0003941580120000122
Figure RE-GDA0003941580120000123
Figure RE-GDA0003941580120000124
Wherein equation (13) is used to calculate the local density ρ (S) without the maximum j ) Time series segment S of j The relative distance of (a); equation (14) calculates the local density ρ (S) having the maximum j ) Time series segment S of j The relative distance of (a); d ji Representing a time series segment S j And S i Euclidean distance of the mean of the gaussian distributions obeyed.
Adopting normalization processing of the maximum value and the minimum value, and calculating the product of the local density and the relative distance as a judgment criterion of clustering, wherein the judgment criterion is as follows:
Figure RE-GDA0003941580120000125
η′(S j )=ρ′(S j )×δ′(S j ),j=1,2,…,K+1 (16)
from the calculated η' (S) j ) The values of (A) are sorted in descending order, a variation graph of the values is drawn, and a plurality of time series segments S far away from the value of 0 are selected m (M =1,2, \8230;, M). Assuming that M are selected, the fragments are grouped into M classes.
And obtaining the clustering center of each class, the variance of the Gaussian mixture model submodel and the label of which class the sequence segment belongs to based on the time sequence segment clustering process of the Gaussian mixture model. The cluster label requires additional storage and has an important role in representing the entire time series and in decompression. Since only the cluster center is used to represent the corresponding sub-sequence, only the data of the cluster center needs to be retained during compression.
As shown in fig. 3 and as shown in table 2 below, the basic flow of time series segment clustering based on the gaussian mixture model is:
initializing clustering algorithm based on improved peak density to obtain mean value of time series Gaussian mixture model
Figure RE-GDA0003941580120000126
Initialization value of (2), each partial model coefficient
Figure RE-GDA0003941580120000127
Initialization value and variance of each partial model
Figure RE-GDA0003941580120000128
Initializing a value;
inputting a preset maximum iteration number, a preset threshold, a plurality of segmentation points and a plurality of time sequence segments;
e-step computation time sequence segment S of iterative execution EM algorithm k Probability of belonging to m-th partial model
Figure RE-GDA0003941580120000131
Iteratively executing M steps of the EM algorithm to calculate an updated mean for each cluster
Figure RE-GDA0003941580120000132
Updating variance
Figure RE-GDA0003941580120000133
Updating coefficients
Figure RE-GDA0003941580120000134
Judging whether the difference of the log-likelihood function values of the two times is smaller than a preset threshold or not, or whether the variance of the sub-model of the Gaussian mixture model is 0 or not, if so, ending iteration, and outputting the parameter mean value corresponding to the optimal cluster
Figure RE-GDA0003941580120000135
Variance (variance)
Figure RE-GDA0003941580120000136
Coefficient of performance
Figure RE-GDA0003941580120000137
And probability
Figure RE-GDA0003941580120000138
Otherwise, judging whether the number of times of iteration is smaller than the preset maximum iteration number, if so, continuing the iteration, otherwise, ending the iteration, and outputting the parameter mean value corresponding to the optimal cluster
Figure RE-GDA0003941580120000139
Variance (variance)
Figure RE-GDA00039415801200001310
Coefficient of performance
Figure RE-GDA00039415801200001311
And probability
Figure RE-GDA00039415801200001312
TABLE 2 basic procedure for time series segment clustering based on Gaussian mixture model
Figure RE-GDA0003941580120000141
In step S103, each cluster center is again equally divided according to the proportion of the block variance.
Specifically, the clustering condition of each class is judged according to the variance of each cluster, the greater the variance is, the stronger the fluctuation is, and the larger the number of segments to be allocated is, when the number of segments w is preset, the number of segments to be allocated is distributed according to the proportion of the variance of each class, and the number of segments of each class is determined, wherein the formula is as follows:
Figure RE-GDA0003941580120000142
wherein M is the number of clusters, c j Is the variance, v, of the partial model j j Is the number of partitions of class j;
and (3) carrying out equidistant segmentation on the clustering center of each class according to the segmentation number, as follows:
Figure RE-GDA0003941580120000151
wherein,
Figure RE-GDA0003941580120000152
representing class j centered at v j Mean over segment.
In step S104, each segment of the mean value at the center of each class is converted into a symbolic representation by using the SAX method, and the first character of each class is a capital letter.
In particular, since the symbolic representations of each class are concatenated together, to distinguish that this belongs to a different class, the first symbolic representation at the center of each class is converted into a capital letter form. Since the number of divisions of the probability interval of the common gaussian distribution is generally 3 to 8, it is sufficient to perform the symbolic representation of the SAX method using lower case letters, and thus it is appropriate to use upper case letters to distinguish the start of a cluster center in the embodiment of the present invention.
In step S105, the same symbolic representation of the same type is clipped, and the first capital letter is retained, so as to obtain the time-series value compressed data.
In particular, since each class center is subdivided, when data changes slowly, the symbolic representation converted into SAX is likely to fall within the same interval, and thus, clipping is required. In this case, only the symbolic representation of the first capital letter of the class is retained.
Therefore, as shown in fig. 4 and as shown in table 3 below, the basic flow of high frequency time series data compression based on the improved symbol aggregation approximation.
TABLE 3 basic flow of high frequency time series data compression based on improved symbol aggregation approximation
Figure RE-GDA0003941580120000153
In addition, an embodiment of the present invention provides a high frequency time series data decompression method based on improved symbol aggregation approximation according to a high frequency time series data compression method based on improved symbol aggregation approximation, and specifically as shown in fig. 5 and 6 and shown in table 4 below, the method includes the following steps: step S201, scanning time sequence value compressed data, identifying capital characters, and obtaining symbolic representation of each clustering center; step S202, calculating the sequence segment length of each clustering center, and determining a segmentation point; step S203, restoring the symbolic representation of each segment of sequence fragment according to the segmentation point, and further obtaining the symbolic representation of the whole time sequence; step S204, each symbol is inversely transformed by the SAX method to obtain a time sequence.
TABLE 4 basic flow of high frequency time series data decompression based on improved symbol aggregation approximation
Figure RE-GDA0003941580120000161
The above steps may implement the decompression process for the entire time series. It should be noted that Step1 and Step2 have no sequential relationship, and may be performed simultaneously or one after the other. In Step3.3, the length of each segment and the number of symbols in the center of the cluster are used to represent the length of the segment, and the length of each segment is calculated by assuming that the length is l and n, i.e. the length of l/n, and then each symbol is extended to continuous l/n symbols, so that the complete symbol representation of each segment can be realized.
The high-frequency time sequence data compression method based on the improved symbol aggregation approximation provided by the embodiment of the invention is subjected to experimental simulation and result analysis.
Firstly, determining an evaluation index to perform a time-series compression experiment, and adopting index compression ratio, mean square error, root mean square error and average absolute error to explain the compression effect, wherein the compression ratio evaluates the compression effect, and the other three evaluate the decompression effect. The expression forms are respectively as follows:
Figure RE-GDA0003941580120000162
wherein CR is the compressed size
Figure RE-GDA0003941580120000163
And the ratio of the size w before compression. The smaller the compression ratio, the more space occupied by the compressed data is reduced greatly, and the better the compression effect is.
Figure RE-GDA0003941580120000164
Where MSE denotes the mean square error, x t Representing the true time-series data of the object,
Figure RE-GDA0003941580120000165
representing the decompressed data and T represents the number of sequence values of the current time series. The mean square error is an index for describing the fitting effect, and the smaller the index is, the stronger the data reduction capability after decompression is, and the better the compression effect is.
Figure RE-GDA0003941580120000166
The RMSE represents a root mean square error, and is a root operation performed on a mean square error MSE, so that the smaller the index is, the better the compression effect on data is.
Figure RE-GDA0003941580120000171
The MAE represents an average absolute error, and similarly, the index describes a difference between the decompressed data and the original data, and a smaller value indicates that a smaller difference between the decompressed data and the original data is, and a better compression effect is obtained.
(II) in order to compare the practical types of the compression modules, 3 comparison models are selected: SAX method, SAX + segmentation method and SAX + segmentation + clustering method, wherein,
in the SAX method, since the model proposed in the embodiment of the present invention is an improvement on the defect that the SAX method cannot identify the data features and the decompression effect is poor due to random segmentation, the improvement effect can be seen by using the SAX method for comparison, thereby further explaining the applicability of the method proposed in the embodiment of the present invention.
The SAX + segmentation method is based on the improved image group optimization algorithm provided by the embodiment of the invention, the Gaussian segmentation is used for segmenting the time sequence, some characteristics of data can be extracted, and the SAX + segmentation method is compared with the SAX + segmentation method to show whether the SAX + segmentation method provided by the embodiment of the invention can achieve a better compression effect.
The SAX + segmentation + clustering method is similar to the SAX + segmentation + clustering method, clustering is carried out on the basis of segmentation, the clustering center of each class is used as the representation of each sequence segment in the class, and compared with the SAX + segmentation method which uses the mean representation of the segmented sequence segments, the SAX + segmentation + clustering method has smaller compression ratio and generates larger error after decompression. Therefore, this method is needed to illustrate the effect of cluster-centric resegmentation and clipping operations on the compression effect based on embodiments of the present invention.
(III) authentication and analysis Using Gesture dataset
The first dimension data of the gettrue data sets of lengths 400 and 1300 are used to illustrate the proposed compression strategy.
In order to illustrate the compression effect of the method provided by the embodiment of the invention, different numbers of segments are selected for data with different lengths, and 3 numbers of segments are selected in total, wherein one number of segments is the number of segments obtained by Gaussian segmentation based on a time series, and the other two numbers are values greater than or less than the number of segments obtained by Gaussian segmentation. Thus, for the first dimension data of the Gesture data set of length 400, the number of segments is selected to be 5, 20, and 3, and for the long-time sequence data of length 1300, the number of segments is selected to be 15, 50, and 10; the alpha value is selected from 3 values 3, 5 and 8 within the usual range.
Tables 5 to 8 are compression evaluation and decompression evaluation of time-series values of the Gesture first-dimensional data of length 400, where tables 5 to 7 respectively show the compression and decompression evaluation results of the method proposed by the embodiment of the present invention and the SAX method when the number of segments is 5, 20, and 3. Tables 9 to 12 are compression evaluation and decompression evaluation of time-series values of the gettrue first-dimensional data of length 1300, where tables 9 to 11 respectively show the compression and decompression evaluation results of the method proposed by the embodiment of the present invention and the SAX method when the number of segments is 15, 50, and 10. Tables 8 and 12 show the evaluation results under the SAX + segmentation method and the SAX + segmentation + clustering method, and these two methods are not affected by the number of segments, but are related to the α value. These 8 tables all show the compression and decompression effects for alpha values of 3, 5 and 8, respectively.
TABLE 5 compression of Gesture first dimension data time series value of length 400 at number of segments of 5
Figure RE-GDA0003941580120000181
TABLE 6 compression of 400-Length Gesture first-dimension data time-series values at a number of segments of 20
Figure RE-GDA0003941580120000182
TABLE 7 compression results for Gesture first dimension data time series value of length 400 at number of segments of 3
Figure RE-GDA0003941580120000183
TABLE 8 compression results of Gesture first dimension data time series values of length 400
Figure RE-GDA0003941580120000184
For short sequences, no matter the number of segments is 5, 20 or 3, and the α value is 3, 5 or 8, as shown in tables 6 to 7, the method provided by the embodiment of the present invention can have smaller MSE value, RMSE value and MAE value after decompression while maintaining the same compression ratio as the SAX method, which indicates that the method provided by the embodiment of the present invention can realize higher fitting after decompression, and realize reduction of original data to a greater extent; for a long sequence, no matter the number of segments is 15, 50 or 10, and the α value is 3, 5 or 8, as shown in tables 9 to 11, the method proposed by the embodiment of the present invention not only has a lower compression rate than the SAX method, but also has smaller MSE, RMSE and MAE index evaluation values after decompression, which indicates that the method proposed by the embodiment of the present invention can achieve both compression to a greater extent and restoration of original data to a greater extent, and therefore, the method proposed by the embodiment of the present invention has a better compression effect than the SAX method.
Comparing the SAX + segmentation method with the SAX + segmentation + clustering method, since the former is compressed by segmenting the segments according to time series and the latter is clustered on the basis of the former, the latter has a lower compression ratio, as shown in tables 8 and 12. For short sequences, when α =3, the SAX + segmentation method and the SAX + segmentation + clustering method have the same evaluation index value after decompression, and when α =5 and 8, the three index values of the SAX + segmentation method are smaller; for long sequences, the MSE, RMSE and MAE index values for the SAX + segmentation method at 3 α values are all smaller than those for the SAX + segmentation + clustering method. This shows that the SAX + segmentation + clustering method achieves higher compression by sacrificing the data reduction capability, regardless of short sequences or long sequences, and whether the method achieves the same reduction capability as the SAX + segmentation method, depends on the data itself and the alpha value. So the emphasis of the two methods is different. But the compression ratio of both methods is fixed.
The method proposed by the embodiment of the present invention is compared with the SAX + segmentation method. For short sequences with the same compression rate, as shown in tables 6 and 8, the method provided by the embodiment of the invention has smaller MSE, RMSE and MAE values, which indicates that the method can better fit the original time series data and has better compression effect. In addition, when the number of segments is 3, the compression rate of the method proposed by the embodiment of the present invention is lower, and 3 index values can reach the same level as that of the SAX + segmentation method when α =3, and unfortunately, 3 index values are larger than the latter when α =5 and 8, but can reach a level of better fitting when having the same compression rate and reaching the same fitting effect at a partial α value when having a lower compression rate, and it is worth to say that the method proposed by the embodiment of the present invention is stronger than the SAX + segmentation method.
For long-time sequences, as shown in tables 9 to 12, when the number of segments is 15, the compression rate and 3 evaluation indexes of the method proposed by the embodiment of the present invention are lower than those of the SAX + segmentation method at α =5, and the index value is higher at α =3 and 8, but the compression rate is lower; when the number of segments is 50, the values of the 4-item index of the proposed method are all higher than the index value of the SAX + segmentation method when α =3, and at α =5 and 8, although the compression rate is higher, the other 3-item index value is lower than the index value of the latter. At a segment number of 10, the MSE, RMSE, and MAE index values of the proposed method are all larger than the latter, but the compression rate is lower.
In summary, compared with the SAX + segmentation method, the method provided by the embodiment of the present invention can ensure better data reduction capability and have a lower compression rate.
The method proposed by the embodiment of the invention is compared with the SAX + segmentation + clustering method. For short sequences, when the compression rates are the same, the other 3 index values of the two methods are the same as in tables 8 and 9; the other 3 items of index values of the method proposed by the embodiment of the present invention are lower when the compression rate is higher. For long sequences, both methods have the same index value when the number of segments is 15 and α =3, and when the number of segments is 10 and α =3 and 5. And under other segmentation numbers and alpha, the method provided by the embodiment of the invention has higher compression ratio and lower other 3 index values. Therefore, compared with the SAX + segmentation + clustering method, the method provided by the embodiment of the invention has lower compression ratio and better data reduction capability.
TABLE 9 compression of Gesture first dimension data time series values of length 1300 at 15 segments
Figure RE-GDA0003941580120000201
TABLE 10 compression of Gesture first dimension data time series values of length 1300 at a segment number of 50
Figure RE-GDA0003941580120000202
TABLE 11 compression results of Gesture first dimension data time series values of length 1300 at a segment number of 10
Figure RE-GDA0003941580120000203
TABLE 12 compression results of Gesture first dimension data time series values of length 1300
Figure RE-GDA0003941580120000204
Figure RE-GDA0003941580120000211
In addition, compared with the characteristic that the compression ratios are fixed by the two methods, the method provided by the embodiment of the invention can adjust the compression ratio, and for the short sequences, when the number of the segments is 20 and 3, the obtained compression ratio is different from that obtained when the number of the segments is 5, and the method also has the same effect on the long sequences. The larger the number of segments, the higher the compression ratio, the smaller the number of segments, the lower the compression ratio, and the smaller the fitting errors, a strategy with a higher compression ratio can be selected under tolerable fitting errors.
In summary, the method provided in the embodiment of the present invention is suitable for compressing the Gesture data, and it has both the high data restoration capability of the SAX + segmentation method and the low compression rate of the SAX + segmentation + clustering method, and if it is desired to achieve the lowest compression rate, it is only necessary to make the number of segments be the number of clusters, so that it has a certain adaptability. Therefore, has certain applicability.
(IV) validation and analysis of PSCADA datasets
Using la data in the passcada dataset, data of length 100 and 1500 were used, respectively.
Similarly, since the SAX method is related to the number of segments and the α value, in order to illustrate the compression effect of the method proposed by the embodiment of the present invention on data with a certain periodicity, 3 numbers of segments are selected for data with different lengths. For a short-time sequence of length 100, one of the number of segments is the number of segments 6 obtained based on chapter ii, and the other two select the values 10 and 4 on both sides of the number of segments 6; for long time sequences, 109, 50 and 150 are selected as the number of segments, where 109 is the number of segments from chapter three; the alpha value is selected from 3 values 3, 5 and 8 within the usual range.
Tables 13 to 15 show the compression evaluation and decompression evaluation of the la data of length 100, tables 16 to 18 show the compression evaluation and decompression evaluation of the la data of length 1500, and these 6 tables show the evaluation results in different numbers of segments, different α values, and different methods. The method proposed in the embodiment of the present invention enables the short sequences to have the same compression effect when the number of segments is 6, 4, and 10, and the long sequences to have the same compression effect when the number of segments is 109, 50, and 150, so that they are shown only in tables 13 and 16, respectively, and also the SAX + segmentation method and the SAX + segmentation + clustering method are not affected by the number of segments, but are related to the α value, as shown in tables 15 and 18.
TABLE 13 compression results of la data time series values of length 100 under the proposed method and SAX method
Figure RE-GDA0003941580120000212
TABLE 14 compression results of la data time series values of length 100 under SAX method
Figure RE-GDA0003941580120000221
TABLE 15 compression results of la data time series values of length 100
Figure RE-GDA0003941580120000222
For a short sequence, when the number of segments is 6, no matter the α value is 3, 5 or 8, as shown in table 13, the method proposed by the embodiment of the present invention can achieve a smaller compression rate while maintaining the same MSE value, RMSE value and MAE value as the SAX method, which indicates that the method can store data in a smaller space. In addition, when the number of segments is 10 and 4, the compression rate of the method provided by the embodiment of the invention is lower than that of the SAX method under the corresponding alpha value, and the index values of MSE, RMSE and MAE are smaller, which can obviously show that the method provided by the embodiment of the invention has better compression effect than that of the SAX method. The SAX method has a low compression rate when the number of segments is 6 and has the minimum other evaluation values because la data with the length of 100 has certain periodicity, the periodicity of the data is just divided by the number of segments at the time, and the characteristics of the data are damaged by the other two numbers of segments, so that the compression effect is poor.
For a long sequence, no matter the number of segments is 109, 0 or 150, and the α value is 3, 5 or 8, the compression rate and the index values of MSE, RMSE and MAE obtained by the method provided by the embodiment of the present invention are lower than those obtained by the SAX method, which shows that for longer la data, the method provided by the embodiment of the present invention can still obtain a lower compression rate and a higher data reduction capability. The reason why the SAX method cannot have its own lowest compression ratio and the best other 3 index values when the optimal number of segments 109 is found is that although the time series has a certain periodicity, the periodicity is not so strong, and there may be some data outside the period between periods, so that the method divides data in different periods into the same segment when averaging the sequence segments, and thus the data reduction capability is poor.
In summary, the compression ratio and the decompression effect of the SAX method have a great relationship with the number of segments, the α value, and the characteristics of the data itself, but the method provided by the embodiment of the present invention can better extract the characteristics of the data, and compress the data by using the characteristics, so that a good compression effect can be obtained.
TABLE 16 compression results of la data time series values of length 1500 under the proposed method and SAX method
Figure RE-GDA0003941580120000231
TABLE 17 compression results of la data time series values of length 1500 under SAX method
Figure RE-GDA0003941580120000232
TABLE 18 compression results for la data time-series values of length 1500
Figure RE-GDA0003941580120000233
Comparing the SAX + segmentation method with the SAX + segmentation + clustering method, the compression ratio of the former is higher than that of the latter, which is consistent with theory, as shown in tables 15 and 18. For short sequences, the 3-term index values of the two methods are the same at α =3 and 5, while the three-term index values of the SAX + segmentation method are a little smaller at α = 8; for long sequences, the MSE, RMSE and MAE values were all the same at 3 α values. Based on the above performances, it is demonstrated that the SAX + segmentation method is more focused on the reduction capability of data, the SAX + segmentation + clustering method is more focused on having a lower compression ratio and being related to the characteristics of the data, and when the data is more stable, the SAX + segmentation + clustering method is better than the SAX + segmentation method.
The method proposed in this embodiment of the present invention is compared with the SAX + segmentation method. For short sequences, as shown in tables 13 and 15, at α =3 and 5, the method proposed by the embodiment of the present invention can achieve the same evaluation values of MSE, RMSE, and MAE indexes as the latter, but with a smaller compression rate; and at α =8, although the evaluation values of these three indices are slightly larger than the latter, the compression rate is much lower. For long sequences, as shown in tables 16 and 18, the method proposed by the embodiment of the present invention and the latter have the same evaluation values of MSE, RMSE, and MAE indexes, but the compression rate is smaller. Therefore, the method provided by the embodiment of the invention can achieve the effects that the compression rate is lower while the same data reduction capability is achieved, and the compression rate is still very low while the data reduction capability is slightly worse.
The method proposed by the embodiment of the present invention is compared with the SAX + segmentation + clustering method. Both methods yield the same compression ratio and 3 index values, regardless of short or long sequences. This is because there is substantially no change in the data.
Based on the above, the method provided by the embodiment of the invention combines the characteristic of high data reduction capability of the SAX + segmentation method and the advantage of low compression ratio of the SAX + segmentation + clustering method.
In summary, according to the high-frequency time series data compression method based on the improved symbol aggregation approximation provided by the embodiment of the invention, the features of the time series can be well segmented through segmentation, and the time series segments with similar features can be classified into one class through clustering, so that the dimension reduction and compression of the time series data are realized, and the data can be compressed again by combining the segmentation and clustering with the SAX method; experiments show that the method provided by the embodiment of the invention can simultaneously reach the level of lower compression ratio and higher data reduction capability, or reach one of the levels.
Next, a high-frequency time-series data compression apparatus based on an improved symbol aggregation approximation proposed according to an embodiment of the present invention will be described with reference to the accompanying drawings.
Fig. 7 is a high frequency time series data compression apparatus based on improved symbol aggregation approximation according to an embodiment of the present invention.
As shown in fig. 7, the apparatus 10 includes: a partitioning module 101, a clustering module 102, an equidistant segmentation module 103, a transformation module 104, and a clipping module 105.
The dividing module 101 is configured to divide the time series by using a gaussian division model based on an improved image cluster optimization algorithm to obtain a plurality of division points and a plurality of time series segments. The clustering module 102 is configured to cluster the plurality of time series segments using a gaussian mixture model initialized based on the improved peak density to obtain a plurality of cluster centers, a plurality of module-divided variances, and a plurality of cluster labels. And the equidistant segmentation module 103 is used for performing equidistant segmentation on each cluster center again according to the proportion of the variance of the sub-modules. The conversion module 104 is configured to convert the mean value of each segment at the center of each class into a symbolic representation by using an SAX method, where a first character of each class is a capital letter. The clipping module 105 is configured to clip the same symbolic representation of the same class, and retain the first capital letter, thereby obtaining the time series value compressed data.
It should be noted that the above explanation of the embodiment of the high frequency time series data compression method based on the improved symbol aggregation approximation is also applicable to the apparatus of the embodiment, and is not repeated herein.
According to the high-frequency time sequence data compression device based on the improved symbol aggregation approximation, the characteristics of the time sequence can be well segmented through segmentation, the time sequence segments with similar characteristics can be divided into a class through clustering, so that the dimension reduction and compression of the time sequence data are realized, and the data can be compressed again by combining the segmentation and clustering with an SAX method; experiments show that the method provided by the embodiment of the invention can simultaneously reach the level of lower compression ratio and higher data reduction capability, or reach one of the levels.
The high-frequency time series data decompression device based on the improved symbol aggregation approximation proposed according to the embodiment of the invention is also described with reference to the attached drawings.
Fig. 8 is a schematic structural diagram of a high-frequency time-series data decompression apparatus based on improved symbol aggregation approximation according to an embodiment of the present invention.
As shown in fig. 8, the apparatus 20 includes: a scanning and identification module 201, a partitioning point determination module 202, a restacking symbolization module 203 and an inverse transformation module 204.
The scanning and recognition module 201 is configured to scan the compressed data of the time series value, recognize the capital character, and obtain a symbolic representation of each cluster center. The determine division point module 202 is configured to calculate a sequence segment length of each cluster center, and determine a division point. The symbolic representation restoring module 203 is configured to restore the symbolic representation of each segment of the sequence fragment according to the segmentation point, so as to obtain the symbolic representation of the entire time sequence. The inverse transform module 204 is configured to perform an inverse transform with the SAX method on each symbol to obtain a time sequence.
It should be noted that the foregoing explanation of the embodiment of the high-frequency time series data decompression method based on the improved symbol aggregation approximation is also applicable to the apparatus of the embodiment, and is not repeated here.
According to the high-frequency time sequence data decompression device based on the improved symbol aggregation approximation, the characteristics of the time sequence can be well segmented through segmentation, the time sequence segments with similar characteristics can be divided into a class through clustering, so that the dimension reduction and compression of the time sequence data are realized, and the data can be compressed again by combining the segmentation and clustering with an SAX method; experiments show that the method provided by the embodiment of the invention can simultaneously achieve the levels of lower compression ratio and higher data reduction capability, or achieve one of the levels.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A high-frequency time sequence data compression method based on improved symbol aggregation approximation is characterized by comprising the following steps:
step S101, a Gaussian segmentation model based on an improved image group optimization algorithm is used for dividing a time sequence to obtain a plurality of segmentation points and a plurality of time sequence segments;
step S102, clustering the plurality of time series segments by using a Gaussian mixture model initialized based on improved peak density to obtain a plurality of clustering centers, module-divided variances and clustering labels;
step S103, performing equidistant segmentation on each clustering center again according to the proportion of the sub-module variances;
step S104, converting the average value of each segment of the center of each class into symbolic representation by using an SAX method, wherein the first character of each class is a capital letter;
and step S105, cutting the same symbolic representation of the same class, reserving the first capital letter, and finally obtaining the time sequence value compressed data.
2. The improved symbolic aggregation approximation-based high frequency time series data compression method according to claim 1, wherein the gaussian segmentation model based on the improved image group optimization algorithm in step S101 is:
adding disturbance to a clan update operator of the initial object group optimization algorithm to obtain:
Figure FDA0003821552660000011
wherein x is i,j The position of the j elephant representing the i tribe,
Figure FDA0003821552660000012
the position of the elephant with the optimal fitness function value in all the elephants is expressed and called as the senior citizen;
Figure FDA0003821552660000013
the influence of the family length of the tribe i on the elephant individual is shown, and alpha is an influence parameter;
Figure FDA0003821552660000014
presentation groupInfluence of optimal somatotype on elephant individuals, wherein 1-alpha represents an influence parameter; levy (λ) denotes the mechanism of variation; round represents an operation of rounding a value in parentheses;
adjusting the clan classification operator of the initial object group optimization algorithm to obtain:
Figure FDA0003821552660000015
wherein,
Figure FDA0003821552660000016
updated position of elephant with i drop, x min Is the minimum value of the selectable positions of the elephant, x max The maximum value of the selectable positions of the elephant, round is the rounding operation, T is the time sequence length, and rand is a random number.
3. The high frequency time series data compression method based on improved symbol aggregation approximation as claimed in claim 1, wherein said step S101 specifically comprises:
initializing parameters and populations based on an improved image group optimization algorithm;
calculating the fitness values of all time sequence segments in the time sequence by adopting a segmented Gaussian module, starting iteration until the preset maximum iteration times is reached, outputting the optimal position and the fitness values, and otherwise, sequencing the fitness values of all time sequence segments and reserving a plurality of good time sequence segments;
executing the clan update operator based on the improved image group optimization algorithm, and updating the positions of all time sequence segments and the positions of the family length until the time sequence segments in all clans are updated;
executing the clan separation operator based on the improved image group optimization algorithm, and updating the positions and the fitness values of the time sequence segments of the differences until the time sequence segments of the differences are separated in all clans;
and sequencing the fitness values of all the time sequence segments to obtain a plurality of time sequence segments which are different and need to be separated, updating the positions of the reserved time sequence segments to the time sequence segments which need to be separated, and calculating the fitness values to update the optimal positions of the time sequence segments, namely the optimal segmentation points in all the time sequence segments.
4. The high-frequency time series data compression method based on improved symbol aggregation approximation as claimed in claim 1, wherein said step S102 specifically comprises:
initializing clustering algorithm based on improved peak density to obtain mean value of time series Gaussian mixture model
Figure FDA0003821552660000021
Initialization value of, each partial model coefficient
Figure FDA0003821552660000022
Initialization value and variance of each partial model
Figure FDA0003821552660000023
Initializing a value;
inputting a preset maximum iteration number, a preset threshold value, the plurality of segmentation points and the plurality of time sequence segments;
e-step computation time series segment S of iterative execution EM algorithm k Probability of belonging to m-th partial model
Figure FDA0003821552660000024
Iteratively executing M steps of the EM algorithm to calculate an updated mean for each cluster
Figure FDA0003821552660000025
Updating variance
Figure FDA0003821552660000026
Updating coefficients
Figure FDA0003821552660000027
Judging whether the difference of the log-likelihood function values of the two times is smaller than the preset threshold or not, or whether the variance of the sub-model of the Gaussian mixture model is 0 or not, if so, ending iteration, and outputting the parameter mean value corresponding to the optimal cluster
Figure FDA0003821552660000028
Variance (variance)
Figure FDA0003821552660000029
Coefficient of performance
Figure FDA00038215526600000210
And probability
Figure FDA00038215526600000211
Otherwise, judging whether the number of times of iteration is less than the preset maximum iteration number, if so, continuing the iteration, otherwise, ending the iteration, and outputting the parameter mean value corresponding to the optimal cluster
Figure FDA00038215526600000212
Variance (variance)
Figure FDA00038215526600000213
Coefficient of performance
Figure FDA00038215526600000214
And probability
Figure FDA00038215526600000215
5. The high-frequency time series data compression method based on the improved symbol aggregation approximation as claimed in claim 4, wherein the specific process of initializing the improved peak density clustering algorithm is as follows:
adjusting a local density formula according to all the time sequence segments to calculate local density;
is calculated withoutMaximum local density ρ (S) j ) The relative distance of the time series segments with the maximum local density p (S) is calculated j ) Relative distance of time series segments of (a);
carrying out normalization processing on the local density and the relative distance, and calculating the product of the local density and the relative distance as a clustering judgment standard to carry out descending order arrangement;
selecting time sequence segments far away from zero value, and taking the mean value of the time sequence segments as the mean value of the time sequence Gaussian mixture model
Figure FDA00038215526600000216
And taking the sequence number as a clustering number;
calculating each sub-model coefficient according to the clustering number
Figure FDA0003821552660000031
And the variance of each partial model is calculated
Figure FDA0003821552660000032
Initialized to an identity matrix.
6. The improved symbol aggregation approximation-based high frequency time series data compression method according to claim 4, wherein the specific solution formula in the iterative EM algorithm is as follows:
Figure FDA0003821552660000033
Figure FDA0003821552660000034
Figure FDA0003821552660000035
Figure FDA0003821552660000036
wherein,
Figure FDA0003821552660000037
is the probability of the m-th partial model, α m Is a coefficient of the partial model, x t Is a time sequence, S k For a segment of the time series, phi is a simplified log-likelihood function, theta, for each segment of the time series m In order to be the parameters of the model,
Figure FDA0003821552660000038
for the updated mean of each of the clusters,
Figure FDA0003821552660000039
update variance, μ, for each cluster m For the original mean value of each of the clusters,
Figure FDA00038215526600000310
the coefficients are updated for each cluster.
7. The high-frequency time series data compression method based on improved symbol aggregation approximation as claimed in claim 1, wherein said step S103 specifically comprises:
when the number w of the segments is preset, the segments are distributed according to the proportion of the variance of each class, and the segmentation number of each class is determined, wherein the formula is as follows:
Figure FDA00038215526600000311
wherein M is the number of clusters, c j Is the variance, v, of the partial model j j Is the number of splits for class j;
and carrying out equidistant segmentation on the clustering center of each class according to the segmentation number.
8. A high frequency time series data compression apparatus based on improved symbol aggregation approximation, comprising:
the dividing module is used for dividing the time sequence by using a Gaussian dividing model based on an improved image group optimization algorithm to obtain a plurality of dividing points and a plurality of time sequence segments;
the clustering module is used for clustering the plurality of time sequence segments by using a Gaussian mixture model initialized based on the improved peak density to obtain a plurality of clustering centers, a module-divided variance and clustering labels;
the equidistant segmentation module is used for performing equidistant segmentation on each clustering center again according to the proportion of the variance of the sub-modules;
the conversion module is used for converting the average value of each segment of the center of each class into symbolic representation by using an SAX method, and the first character of each class is a capital letter;
and the shearing module is used for shearing the same symbolic representation of the same type, reserving the first capital letter and finally obtaining the time sequence value compressed data.
9. A high-frequency time sequence data decompression method based on improved symbol aggregation approximation is characterized by comprising the following steps:
step S201, scanning time sequence value compressed data, identifying capital characters, and obtaining symbolic representation of each cluster center;
step S202, calculating the length of the sequence segment of each clustering center, and determining a segmentation point;
step S203, restoring symbolic representation of each segment of sequence fragment according to the division points, and further obtaining symbolic representation of the whole time sequence;
step S204, each symbol is inversely transformed by an SAX method to obtain a time sequence.
10. A high frequency time series data decompression apparatus based on improved symbol aggregation approximation, comprising:
the scanning and identifying module is used for scanning the compressed data of the time sequence value, identifying capitalized characters and obtaining symbolic representation of each clustering center;
a division point determining module, configured to calculate a sequence segment length of each clustering center, and determine a division point;
the symbolic representation restoring module is used for restoring the symbolic representation of each segment of the sequence fragment according to the division points so as to obtain the symbolic representation of the whole time sequence;
and the inverse transformation module is used for carrying out inverse transformation on each symbol by an SAX method to obtain a time sequence.
CN202211043071.0A 2022-08-29 2022-08-29 High-frequency time sequence data compression method and device based on improved symbol aggregation approximation Pending CN115514376A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211043071.0A CN115514376A (en) 2022-08-29 2022-08-29 High-frequency time sequence data compression method and device based on improved symbol aggregation approximation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211043071.0A CN115514376A (en) 2022-08-29 2022-08-29 High-frequency time sequence data compression method and device based on improved symbol aggregation approximation

Publications (1)

Publication Number Publication Date
CN115514376A true CN115514376A (en) 2022-12-23

Family

ID=84501827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211043071.0A Pending CN115514376A (en) 2022-08-29 2022-08-29 High-frequency time sequence data compression method and device based on improved symbol aggregation approximation

Country Status (1)

Country Link
CN (1) CN115514376A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115733498A (en) * 2023-01-10 2023-03-03 北京四维纵横数据技术有限公司 Compression method and device of time sequence data, computer equipment and medium
CN116166978A (en) * 2023-04-23 2023-05-26 山东民生集团有限公司 Logistics data compression storage method for supply chain management
CN116760908A (en) * 2023-08-18 2023-09-15 浙江大学山东(临沂)现代农业研究院 Agricultural information optimization management method and system based on digital twin
CN116775692A (en) * 2023-04-21 2023-09-19 清华大学 Segmented aggregation query method and system for time sequence database

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115733498A (en) * 2023-01-10 2023-03-03 北京四维纵横数据技术有限公司 Compression method and device of time sequence data, computer equipment and medium
CN116775692A (en) * 2023-04-21 2023-09-19 清华大学 Segmented aggregation query method and system for time sequence database
CN116775692B (en) * 2023-04-21 2024-01-30 清华大学 Segmented aggregation query method and system for time sequence database
CN116166978A (en) * 2023-04-23 2023-05-26 山东民生集团有限公司 Logistics data compression storage method for supply chain management
CN116166978B (en) * 2023-04-23 2023-07-25 山东民生集团有限公司 Logistics data compression storage method for supply chain management
CN116760908A (en) * 2023-08-18 2023-09-15 浙江大学山东(临沂)现代农业研究院 Agricultural information optimization management method and system based on digital twin
CN116760908B (en) * 2023-08-18 2023-11-10 浙江大学山东(临沂)现代农业研究院 Agricultural information optimization management method and system based on digital twin

Similar Documents

Publication Publication Date Title
CN115514376A (en) High-frequency time sequence data compression method and device based on improved symbol aggregation approximation
CN115459782A (en) Industrial Internet of things high-frequency data compression method based on time sequence segmentation and clustering
AU2020200997B2 (en) Optimization of audio fingerprint search
US7930281B2 (en) Method, apparatus and computer program for information retrieval
CN108804731B (en) Time series trend feature extraction method based on important point dual evaluation factors
WO2006004797A2 (en) Methods and systems for feature selection
US6023673A (en) Hierarchical labeler in a speech recognition system
CN102663681B (en) Gray scale image segmentation method based on sequencing K-mean algorithm
CN115170868A (en) Clustering-based small sample image classification two-stage meta-learning method
CN111640438B (en) Audio data processing method and device, storage medium and electronic equipment
CN116561230B (en) Distributed storage and retrieval system based on cloud computing
CN112651424A (en) GIS insulation defect identification method and system based on LLE dimension reduction and chaos algorithm optimization
CN110032585B (en) Time sequence double-layer symbolization method and device
US7797160B2 (en) Signal compression method, device, program, and recording medium; and signal retrieval method, device, program, and recording medium
US20140343944A1 (en) Method of visual voice recognition with selection of groups of most relevant points of interest
US20140343945A1 (en) Method of visual voice recognition by following-up the local deformations of a set of points of interest of the speaker's mouth
CN115438727A (en) Time sequence Gaussian segmentation method based on improved image group algorithm
CN114398991A (en) Electroencephalogram emotion recognition method based on Transformer structure search
CN110826628A (en) Characteristic subset selection and characteristic multivariate time sequence ordering system
Nishii et al. Similar subsequence retrieval from two time series data using homology search
CN117453671A (en) Method, system and medium for cleaning repeated or similar data
CN117829239A (en) Model pruning method and system for embedded equipment
CN115618254A (en) Three-branch clustering method based on sample similarity
CN117475443A (en) Image segmentation and recombination system based on AIGC
CN117459187A (en) High-speed data transmission method based on optical fiber network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination