CN110032585B - Time sequence double-layer symbolization method and device - Google Patents

Time sequence double-layer symbolization method and device Download PDF

Info

Publication number
CN110032585B
CN110032585B CN201910261214.7A CN201910261214A CN110032585B CN 110032585 B CN110032585 B CN 110032585B CN 201910261214 A CN201910261214 A CN 201910261214A CN 110032585 B CN110032585 B CN 110032585B
Authority
CN
China
Prior art keywords
time
value
determining
time sequence
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910261214.7A
Other languages
Chinese (zh)
Other versions
CN110032585A (en
Inventor
王玲
李俊飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201910261214.7A priority Critical patent/CN110032585B/en
Publication of CN110032585A publication Critical patent/CN110032585A/en
Application granted granted Critical
Publication of CN110032585B publication Critical patent/CN110032585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a time sequence double-layer symbolization method and device of air quality data, which can keep a specific time interval which can be sustained by each subsequence. The method comprises the following steps: grouping the time series according to the size of the observed value in the time series; for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering; acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method; determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range; and according to the division points of the time sequence, determining the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range in which the starting and stopping time value is positioned, and converting the time sequence into a symbolized sequence containing a temporal relation. The present invention relates to the field of data processing.

Description

Time sequence double-layer symbolization method and device
Technical Field
The invention relates to the field of data processing, in particular to a time-series double-layer symbolization method and device for air quality data.
Background
A time series is made up of a set of observations recorded at a particular time, often with fixed time intervals. However, the time series of continuous numerical values is not easy to analyze in practical application, the symbolization of the time series is a suitable discretization means for effectively obtaining the internal structure of the time series, and the time series symbolization is widely applied in various fields such as engineering, science, sociology, economics and the like. However, in the prior art, most of the methods simply perform clustering or directly define the size of a symbol set to perform symbolization, which easily results in loss of data information and further cannot feedback the duration of different states of data, for example:
in the first prior art, a Symbolic Aggregate Approximation (SAX) is used to artificially define the size of a time sequence symbol set, and equally divide a time sequence value domain according to the number of symbols, and finally, the average value of time sequence subsequences that can be divided between the divided regions is used as a representative symbol of the segment, so as to convert the time sequence into a Symbolic sequence.
In the second prior art, symbolic conversion is performed by combining a clustering algorithm, for example, K initial clustering centers are set by a K-Means (K-Means) clustering algorithm, and K clusters are obtained by continuously iteratively updating the clustering centers, wherein each cluster corresponds to a different symbol, so as to convert a time sequence into a corresponding symbolic sequence.
Although the time series can be discretized into the required symbolized sequence, the discretization process needs to continuously adjust parameters to achieve the optimal result, the symbolization of the time series is an important data preprocessing step, information contained in data should be saved as much as possible in the process except for initial data cleaning, and meanwhile, the method has general applicability, so that the importance of the time series symbolization method can be revealed. In the prior art, the required purpose is achieved based on continuous adjustment of parameters, and for different time sequences, parameters before adjustment are needed again, more importantly, the obtained final symbolization sequence cannot better reflect the duration of each state, and only one sequence of different states can be shown, and in sum, the whole process of the existing time sequence symbolization method depends on artificially set parameters too much, and data information is easy to lose.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a time-series double-layer symbolization method and device for air quality data, so as to solve the problems that parameters need to be set manually in the time-series symbolization process and the duration of each state cannot be reserved in the prior art.
To solve the above technical problem, an embodiment of the present invention provides a time-series double-layer symbolization method for air quality data, including:
grouping the time series according to the size of the observed value in the time series;
the time sequence is PM2.5, PM10 and NO2、O3、SO2Any one of the time series of air quality data;
for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering;
acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method;
determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range;
and according to the division points of the time sequence, determining the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range in which the starting and stopping time value is positioned, and converting the time sequence into a symbolized sequence containing a temporal relation.
Further, the grouping the time series according to the size of the observation values in the time series includes:
sequencing the time sequence according to the principle that the observed value in the time sequence is increased progressively;
and grouping the sorted time sequences according to the principle that each interval contains the same kind of data to obtain a plurality of initial intervals.
Further, the determining the size of the symbol set and the value range corresponding to the symbols in the symbol set by shannon entropy adaptive clustering on the grouped time series includes:
s21, determining the entropy value of the whole original time sequence through a Shannon entropy calculation formula;
s22, combining any two adjacent intervals, determining the sum of entropy values of all the combined intervals, and determining the difference between the sum and the entropy value of the whole original time sequence;
s23, performing iteration S22, merging only once in each iteration, and merging two adjacent intervals merged when the difference is maximum after the iteration is finished;
and S24, returning to execute S22 and S23 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval.
Further, the obtaining of the feature point sequence of the time sequence by the minimum description length criterion and the gradient slope method includes:
compressing the time sequence by a minimum description length criterion, and extracting candidate characteristic points of the time sequence;
analyzing the gradient change of the slope between observation points in the time sequence by a slope gradient method, and taking the observation points larger than a gradient threshold value as the trend change points of the time sequence;
and supplementing the obtained trend change point of the time series into the candidate characteristic points of the time series to obtain the characteristic point series of the time series.
Further, analyzing the slope gradient change between observation points in the time series by a slope gradient method, and taking an observation point larger than a gradient threshold value as a trend change point of the time series includes:
determining observation point x in time seriesiAnd observation point xjSlope k betweenijAnd determining observation point xiAnd observation point xj+1Slope k betweeni(j+1)
According to the obtained slope kij、ki(j+1)By the formula Δij=|kij-ki(j+1)J is more than or equal to 1 and less than or equal to n-1, and determining the gradient delta corresponding to the observation points i and jij(ii) a Wherein n represents the length of the time series;
for observation point i, by formula
Figure GDA0003200408410000031
Determining a gradient threshold λ for an observation point ii
Judging whether the observation point i is larger than a gradient threshold lambdaiAnd if so, taking the observation point i as a trend change point of the time series.
Further, the determining the division point of the time sequence on the time axis according to the jump relationship of the adjacent feature points of the time sequence in the range of the value range includes:
determining three trend states through three jump relations of rising, falling and stability of two adjacent characteristic points in a time sequence in a value range; wherein the content of the first and second substances,
if two adjacent feature points are located in the same value range, the corresponding subsequence is in a first trend state;
if the value domain grade of two adjacent characteristic points is increased, the corresponding subsequence is in a second trend state;
if the value domain grade of two adjacent characteristic points is reduced, the corresponding subsequence is in a third trend state;
and acquiring the characteristic point of the jump of the value range grade as a division point of the time sequence on a time axis.
Further, the determining the starting and ending time value of each sub-sequence and the symbolic representation of the value range corresponding to the starting and ending time value according to the division point of the time sequence, and the converting the time sequence into the symbolic sequence containing the temporal relationship includes:
acquiring intermediate time values of the second trend state subsequence and the third trend state subsequence, and updating the division points of the time sequence on the time axis according to the relation between the observation value corresponding to the intermediate time value and the adjacent value range to obtain the final division points of the time sequence;
merging the observation points in the same value range according to the updated division points of the time sequence, and acquiring the leftmost time and the rightmost time of all the observation points in the same value range as the starting and stopping time values of the corresponding subsequences;
and obtaining a time sequence symbolized sequence containing a temporal relation according to the starting and stopping time value of each subsequence and the symbol corresponding to the value domain range of each subsequence.
Further, the determining, according to the division point of the time series, the start-stop time value of each subsequence in the time series and the symbol corresponding to the value range in which the start-stop time value is located, and the converting the time series into the symbolized sequence including the temporal relationship includes:
if the observation value corresponding to the middle moment is in the range of the previous value range, updating the right time axis division point of the previous subsequence by the middle moment value to be used as a new right time axis division point of the previous subsequence;
if the observed value corresponding to the middle moment is in the range of the next value range, updating the left time axis division point of the next subsequence by using the middle moment value as a new left time axis division point of the next subsequence;
and if the observation value corresponding to the intermediate time does not change the value range, not updating the time axis division points of the adjacent subsequences.
The embodiment of the present invention further provides a time-series double-layer symbolization apparatus for air quality data, including:
the grouping module is used for grouping the time sequence according to the size of the observed value in the time sequence;
the time sequence is PM2.5, PM10 and NO2、O3、SO2Any one of the time series of air quality data;
the first determining module is used for determining the size of the symbol set and the value range corresponding to the symbols in the symbol set by Shannon entropy self-adaptive clustering on the grouped time sequences;
the acquisition module is used for acquiring a characteristic point sequence of the time sequence by a minimum description length criterion and a gradient slope method;
the second determining module is used for determining the division point of the time sequence in the time axis according to the hopping relation of the adjacent characteristic points of the time sequence in the range of the value range;
and the symbolization module is used for determining the starting and stopping time value of each subsequence in the time sequence and the symbol corresponding to the value domain range in which the starting and stopping time value is positioned according to the division point of the time sequence, and converting the time sequence into a symbolized sequence containing a temporal relation.
The technical scheme of the invention has the following beneficial effects:
in the scheme, the time series are grouped according to the size of the observed value in the time series; for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering; acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method; determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range; according to the division points of the time sequence, the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range of the starting and stopping time value are determined, the time sequence is converted into a symbolic sequence containing a temporal relation, the starting and stopping time of each subsequence can be reserved, and therefore the specific time interval which each subsequence can last is reserved.
Drawings
Fig. 1 is a schematic flow chart of a time-series two-layer symbolization method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a time sequence provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of the calculation process of L (H) and L (D | H) according to the embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating the working principle of distance measurement according to an embodiment of the present invention;
fig. 5 is a schematic diagram of candidate feature points extracted according to the MDL criterion according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a trend change point extracted by a slope gradient method according to an embodiment of the present invention;
FIG. 7 is a detailed diagram of a double-layer symbolization process according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a symbolization process based on a conventional method according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a two-layer symbolization process provided by an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a time-series two-layer symbolization apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The invention provides a time-series double-layer symbolization method and device for air quality data, aiming at the problems that parameters need to be set manually and the duration of each state cannot be reserved in the existing time-series symbolization process.
Example one
As shown in fig. 1, a time-series two-layer symbolization method for air quality data according to an embodiment of the present invention includes:
s1, grouping the time series according to the size of the observed value in the time series;
s2, determining the size of the symbol set and the value range corresponding to the symbols in the symbol set by Shannon entropy self-adaptive clustering on the grouped time sequences;
s3, acquiring a characteristic point sequence of the time sequence by a minimum description length criterion and a gradient slope method;
s4, determining the division point of the time sequence on the time axis according to the jump relation of the adjacent characteristic points of the time sequence in the range of the value range;
and S5, determining the starting and stopping time value of each subsequence in the time sequence and the symbol corresponding to the value range of each subsequence in the time sequence according to the division points of the time sequence, and converting the time sequence into a symbolized sequence containing a temporal relation.
According to the time sequence double-layer symbolization method of the air quality data, the time sequence is grouped according to the size of an observed value in the time sequence; for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering; acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method; determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range; according to the division points of the time sequence, the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range of the starting and stopping time value are determined, the time sequence is converted into a symbolic sequence containing a temporal relation, the starting and stopping time of each subsequence can be reserved, and therefore the specific time interval which each subsequence can last is reserved.
In an embodiment of the foregoing time-series two-layer symbolization method, further grouping the time series according to the size of the observed value in the time series includes:
sequencing the time sequence according to the principle that the observed value in the time sequence is increased progressively;
and grouping the sorted time sequences according to the principle that each interval contains the same kind of data to obtain a plurality of initial intervals.
In the present embodiment, for example, as shown in fig. 2, the observed values in the time series X of continuous numerical values are arranged in ascending order, and the time series is preliminarily divided into p intervals I1,I2,……,Ij,……,IpI represents the interval of the time series preliminary division, j belongs to [1]The method can specifically comprise the following steps:
S11, assuming that X is a time series of continuous numerical values<(x1,t1),(x2,t2),...,(xi,ti),...,(xn,tn)>Wherein x isiRepresents tiThe observed value of the time is n according to the observed value xiThe time series X is arranged from small to large, and X is assumed to be1<…<xi<…xnThen the time sequence after the sorting is<(x1,t1),(x2,t2),…,(xi,ti),…,(xn,tn)>。
S12, dividing the sorted time sequence into p intervals, wherein each interval contains the same kind of data, and assuming for convenience of expression that the length is n/p, the data of the jth interval is
Figure GDA0003200408410000071
In a specific implementation manner of the foregoing time-series double-layer symbolization method, further, the determining, by shannon entropy adaptive clustering, a size of a symbol set and a value range corresponding to a symbol in the symbol set for the grouped time series includes:
s21, determining the entropy value of the whole original time sequence through a Shannon entropy calculation formula;
s22, combining any two adjacent intervals, determining the sum of entropy values of all the combined intervals, and determining the difference between the sum and the entropy value of the whole original time sequence;
s23, performing iteration S22, merging only once in each iteration, and merging two adjacent intervals merged when the difference is maximum after the iteration is finished;
and S24, returning to execute S22 and S23 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval, wherein the number of the symbol types contained in the symbol set is determined by the value range, and therefore the number of the symbol types is equal to the number of the interval.
In this embodiment, the iterative combination of the time sequence interval I obtained by the preliminary division, the calculation of the change of the entire original time sequence and the combined data entropy to obtain the size of the clustering interval determined symbol set, and the determination of the value range corresponding to each symbol in the symbol set according to the range of the observed value included in the interval may specifically include the following steps:
b1, according to the formula for calculating information entropy given by shannon, for any variable x, the information entropy h (i) can be expressed as:
Figure GDA0003200408410000081
where P (x) denotes the probability of occurrence of the variable x, and assuming a sequence of Y ═ 1,2,3,4,5,6,1,2], the probability P (1) of occurrence of the value 1 in this sequence Y is equal to 2/8, i.e., 0.25.
Thus, by the formula
Figure GDA0003200408410000082
Obtaining an entropy value H (I) for each intervalj) Comprises the following steps:
Figure GDA0003200408410000083
wherein, IjRepresenting a preliminarily divided interval; n represents the length of the time series; m isjIs represented byjThe number of classes with different values contained in the interval; n isjiIs represented byjThe number of the ith data category in the interval is corresponding to the number of the ith data category in the interval;
according to the obtained H (I)j) Determining the entropy h (h) of the whole original time series as:
Figure GDA0003200408410000084
b2, merging any two adjacent areasTo e.g. item IjInterval and Ij+1Interval, resulting in the sum of entropy values H' (H) of all intervals after merging, expressed as:
Figure GDA0003200408410000085
wherein, I'j=Ij+Ij+1
The difference between H' (H) and the entropy value H (H) of the entire original time series is determined.
And B3, iteratively executing B2, merging each iteration once, determining the merged difference of all the cases after the iteration is finished, and when the difference is maximum (namely:
Figure GDA0003200408410000086
) Merging the two merged adjacent intervals;
b4, returning to execute B2 and B3 until the difference value is not changed, ending the iteration, determining the size of the symbol set according to the number of the current interval (also called as a clustering interval), and determining the value range corresponding to each symbol according to the range of the observation value contained in the interval.
In an embodiment of the foregoing time-series two-layer symbolization method, further, the obtaining a feature point sequence of a time series by using a minimum description length criterion and a gradient slope method includes:
compressing the time sequence by a minimum description length criterion, and extracting candidate characteristic points of the time sequence;
analyzing the gradient change of the slope between observation points in the time sequence by a slope gradient method, and taking the observation points larger than a gradient threshold value as the trend change points of the time sequence;
and supplementing the obtained trend change point of the time series into the candidate characteristic points of the time series to obtain the characteristic point series of the time series.
In this embodiment, compressing the continuous time series by using a Minimum Description Length (MDL) criterion, and extracting candidate feature points of the time series may specifically include the following steps:
c1, calculating a distance description between two observation points (the observation points comprise time and an observation value) adjacent to each other in the time series according to the MDL criterion:
Figure GDA0003200408410000091
Figure GDA0003200408410000092
referring to fig. 3 and 4, L (H) is a description length of the hypothetical condition, L (D | H) is a description length of data, and x is a description length of data when the hypothetical condition is satisfiedckRepresents the ck-th candidate feature point len (x) in time seriesckxc(k+1)) Representing a line segment xckxc(k+1)Length of (d); d(xckxc(k+1),xjx(j+1)) Representing a line segment xckxc(k+1)And line segment xjx(j+1)(ck ≦ j ≦ c (k + 1)); dθ(xckxc(k+1),xjx(j+1)) Represents the angular distance thereof;
the process can be analyzed with reference to fig. 4, namely:
Figure GDA0003200408410000093
Figure GDA0003200408410000094
referring to fig. 4, θ represents two observation points x adjacent to each other in the time-series feature point extraction processjAnd xj+1Connected vector LjAnd neighboring two candidate feature points xckAnd xc(k+1)Connected vectors LiX 'of'jIs a point xjAt LiProjected point on;x'(j+1)Is x(j+1)The projected point of (a); l⊥1Denotes xjAnd x'jThe Euclidean distance between; l⊥2Denotes x(j+1)And x'(j+1)The euclidean distance between them.
C2, calculating the point x according to the calculation result in C1c(k+1)Metric cost MDL as candidate feature pointpar(xck,xc(k+1)) And point xc(k+1)Metric cost MDL as non-candidate feature pointsnopar(xck,xc(k+1)) When the current value is smaller than the latter value, the candidate feature point requirement is satisfied, and the candidate feature point can be regarded as a candidate feature point of a time series as shown in fig. 3, wherein the MDL ispar(xck,xc(k+1)) And MDLnopar(xck,xc(k+1)) Expressed as:
MDLpar(xck,xc(k+1))=L(H)+L(D|H)
Figure GDA0003200408410000101
when MDL is satisfiedpar(xck,xc(k+1))<MDLnopar(xck,xc(k+1)) And taking the point as a candidate feature point of the time sequence, otherwise, taking the point as a non-candidate feature point, and simultaneously considering the situation that the next point is the candidate feature point.
In this embodiment, candidate feature points extracted according to the MDL criterion are shown in fig. 5.
In this embodiment, a gradient value is calculated according to adjacent observation points in a time sequence by a slope gradient method, and a certain observation point x is pointed toiDetermining a gradient threshold value by using a gradient mean value and a variance of the whole time sequence, and screening a trend change point of the time sequence, wherein the method specifically comprises the following steps:
d1, determining observation point x in time seriesiAnd observation point xjSlope k betweenijAnd determining observation point xiAnd observation point xj+1Slope k betweeni(j+1)
D2, according to the obtained slope kij、ki(j+1)By the formula Δij=|kij-ki(j+1)J is more than or equal to 1 and less than or equal to n-1, and determining the gradient delta corresponding to the observation points i and jij(ii) a Wherein n represents the length of the time series;
d3, for observation point i, by formula
Figure GDA0003200408410000102
Determining a gradient threshold λ for an observation point ii
D4, judging whether the observation point i is larger than the gradient threshold lambdaiAnd if so, taking the observation point i as a trend change point of the time series.
In this embodiment, the trend change point extracted by the slope gradient method is shown in fig. 6.
In this embodiment, the obtained trend change point of the time series is supplemented to the candidate feature point of the time series obtained by the MDL criterion, so as to obtain the feature point sequence of the time series.
In the foregoing embodiment of the time-series double-layer symbolization method, further, as shown in fig. 7, determining a division point of the time series on the time axis according to a jump relationship of adjacent feature points of the time series in a range of a value range, for completing the first-layer symbolization, specifically, the method may include the following steps:
determining three trend states through three jump relations of rising, falling and stability of two adjacent characteristic points in a time sequence in a value range; wherein the content of the first and second substances,
if two adjacent feature points are located in the same value range, the corresponding subsequence is in a first trend state;
if the value domain grade of two adjacent characteristic points is increased, the corresponding subsequence is in a second trend state;
if the value domain grade of two adjacent characteristic points is reduced, the corresponding subsequence is in a third trend state;
and acquiring the characteristic point of the jump of the value range grade as a division point of the time sequence on a time axis.
In the foregoing specific implementation of the time-series double-layer symbolization method, as shown in fig. 7, further, according to the division point of the time series, determining a start-stop time value of each sub-series in the time series and a symbol corresponding to the value range where the start-stop time value is located, and converting the time series into a symbolized series containing a temporal relationship, for completing the second-layer symbolization, specifically, the method may include the following steps:
acquiring intermediate time values of the second trend state subsequence and the third trend state subsequence, and updating the division points of the time sequence on the time axis according to the relation between the observation value corresponding to the intermediate time value and the adjacent value range to obtain the final division points of the time sequence;
merging the observation points in the same value range according to the updated division points of the time sequence, and acquiring the leftmost time and the rightmost time of all the observation points in the same value range as the starting and stopping time values of the corresponding subsequences;
and obtaining a time sequence symbolized sequence containing a temporal relation according to the starting and stopping time value of each subsequence and the symbol corresponding to the value domain range of each subsequence.
In this embodiment, the obtained symbolized sequence may be expressed as:
[a,(t1,t2)],[b,(t3,t4)],……(t1<t2<t3<t4<…)
wherein a, b,. represents a symbol; (t)1,t2) The starting and ending time corresponding to the symbol state a; (t)3,t4) The starting and ending time corresponding to the b symbol state.
In an embodiment of the foregoing time-series double-layer symbolization method, further, the determining, according to the division point of the time series, a start-stop time value of each subsequence in the time series and a symbol corresponding to a value range in which the start-stop time value and the symbol correspond, and converting the time series into a symbolized series including a temporal relationship includes:
if the observation value corresponding to the middle moment is in the range of the previous value range, updating the right time axis division point of the previous subsequence by the middle moment value to be used as a new right time axis division point of the previous subsequence;
if the observed value corresponding to the middle moment is in the range of the next value range, updating the left time axis division point of the next subsequence by using the middle moment value as a new left time axis division point of the next subsequence;
and if the observation value corresponding to the intermediate time does not change the value range, not updating the time axis division points of the adjacent subsequences.
In this embodiment, as shown in fig. 8 and 9, the time sequence described in this embodiment is double-layered symbolized, and the start-stop time of each sub-sequence can be reserved, so that a specific time interval for which each sub-sequence can last is reserved.
For better understanding of the time-series two-layer symbolization method according to the embodiment of the present invention, it is described in detail with reference to a specific application:
the time-series double-layer symbolization method provided by the embodiment of the invention can be used for space quality data, for example, symbolizing air quality data which comprises 5 attributes, namely PM2.5, PM10, NO2, O3 and SO2, wherein data of each attribute is acquired once every hour, partial data is selected for description, and the form of a data set is as follows:
TABLE 1 air quality part data set
Figure GDA0003200408410000121
Figure GDA0003200408410000131
Sorting the observed values of each attribute respectively, and primarily grouping the time sequence of each attribute;
for the time sequence of each attribute after grouping, a value range (value range grade division) corresponding to each attribute is obtained by adopting self-adaptive clustering of shannon entropy, and corresponding symbols are distributed for each value range at the same time, namely the number of the symbols is equal to the value range grade number, namely, the range of the value range which can be obtained by the self-adaptive clustering is shown in table 2:
TABLE 2 air quality data value range partitioning
Figure GDA0003200408410000132
From the results shown in table 2, it can be seen that the levels capable of acquiring, for example, PM2.5 via the adaptive clustering method are divided into 7 levels, and each level has its corresponding value range.
Then, extracting characteristic points of the time sequence corresponding to each attribute of the air quality data, and representing the time sequence of continuous numerical values by using the characteristic point sequence;
determining the trend change state of the time sequence (specifically: temporarily converting the time sequence into three trend states) according to the jump relation of adjacent characteristic points of the air quality data between the value range grades, adaptively determining the position of a segmentation point of a time axis according to the intermediate time value of a second trend state subsequence and a third trend state subsequence, and further acquiring a specific time interval which can be sustained by each subsequence;
finally, according to the range of the value range in which each sub-sequence number is located, converting each sub-sequence number into a corresponding symbolized sequence, which is the application of the time-series double-layer symbolization method in the air quality data, the symbolization result in table 1 can be obtained according to the above description, as shown in table 3:
TABLE 3 space quality data symbolization procedure table
Figure GDA0003200408410000141
Figure GDA0003200408410000151
The characteristics of the process proposed by the invention can be clearly seen from Table 3The time series can be accurately discretized into an analyzable tokenized series while preserving a specific time range for which each state lasts, where x14The first numerical value of the subscript represents the attribute, the second numerical value represents the value range grade of the measured value at the moment, and the more important point is that the whole symbolization process does not need to set parameters manually, the phenomena of information loss and the like caused by unreasonable parameters are avoided, and in the application aspect of the related method, the specific temporal relation among the item sets in the rules can be obtained except that the symbolized time sequence method can be similar to the traditional symbolization method and the association rule mining algorithms such as apriori and FP-growth are used for obtaining the association rules which can be used for guiding decision making, more detailed rules can be obtained, a series of continuous state transfer conditions can be obtained when the symbolized time sequence method is used for sequence mode mining, and the method has good application to air quality prediction, health guidance of people and the like.
Example two
The present invention further provides a specific embodiment of a time-series double-layer symbolization apparatus for air quality data, which corresponds to the specific embodiment of the time-series double-layer symbolization method for air quality data, and the time-series double-layer symbolization apparatus provided by the present invention can achieve the purpose of the present invention by executing the flow steps in the specific embodiment of the method.
As shown in fig. 10, an embodiment of the present invention further provides a time-series double-layer symbolizing apparatus for air quality data, including:
a grouping module 11, configured to group the time series according to the size of the observed value in the time series;
the first determining module 12 is configured to determine, for the grouped time sequences, a size of a symbol set and a value range corresponding to symbols in the symbol set through shannon entropy adaptive clustering;
an obtaining module 13, configured to obtain a feature point sequence of a time sequence according to a minimum description length criterion and a gradient slope method;
the second determining module 14 is configured to determine a dividing point of the time sequence on the time axis according to a hopping relation of adjacent feature points of the time sequence in the range of the value range;
and the symbolization module 15 is configured to determine, according to the division point of the time sequence, a start-stop time value of each subsequence in the time sequence and a symbol corresponding to the value range where the start-stop time value is located in the time sequence, and convert the time sequence into a symbolized sequence including a temporal relationship.
The time sequence double-layer symbolization device of the air quality data, disclosed by the embodiment of the invention, groups the time sequence according to the size of an observed value in the time sequence; for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering; acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method; determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range; according to the division points of the time sequence, the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range of the starting and stopping time value are determined, the time sequence is converted into a symbolic sequence containing a temporal relation, the starting and stopping time of each subsequence can be reserved, and therefore the specific time interval which each subsequence can last is reserved.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A method for time-series two-level symbolization of air quality data, comprising:
grouping the time series according to the size of the observed value in the time series;
the time sequence is PM2.5, PM10 and NO2、O3、SO2Any one of the time series of air quality data;
for the grouped time sequence, the size of the symbol set and the value range corresponding to the symbols in the symbol set are determined by Shannon entropy self-adaptive clustering, and the method comprises the following steps:
s21, determining the entropy value of the whole original time sequence through a Shannon entropy calculation formula;
s22, combining any two adjacent intervals, determining the sum of entropy values of all the combined intervals, and determining the difference between the sum and the entropy value of the whole original time sequence;
s23, performing iteration S22, merging only once in each iteration, and merging two adjacent intervals merged when the difference is maximum after the iteration is finished;
s24, returning to execute S22 and S23 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval;
the iterative merging of the time sequence interval I obtained by the preliminary division, the calculation of the change of the whole original time sequence and the merged data entropy to obtain the size of the clustering interval determined symbol set, and the determination of the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval specifically include:
b1, according to the formula for calculating the information entropy given by shannon, for any variable x, the information entropy h (i) is expressed as:
Figure FDA0003200408400000011
wherein p (x) represents the probability of occurrence of the variable x;
by the formula
Figure FDA0003200408400000012
Obtaining an entropy value H (I) for each intervalj) Comprises the following steps:
Figure FDA0003200408400000013
wherein, IjRepresenting a preliminarily divided interval; n represents the length of the time series; m isjIs represented byjThe number of classes with different values contained in the interval; n isjiIs represented byjThe number of the ith data category in the interval is corresponding to the number of the ith data category in the interval;
according to the obtained H (I)j) Determining the entropy h (h) of the whole original time series as:
Figure FDA0003200408400000021
and B2, combining any two adjacent intervals to obtain the sum H' (H) of entropy values of all the combined intervals, wherein the sum is represented as:
Figure FDA0003200408400000022
wherein, I'j=Ij+Ij+1
Determining the difference between H' (H) and the entropy value H (H) of the whole original time series;
b3, iteratively executing B2, merging each iteration once, after the iteration is completed, determining the merged difference of all the cases, and when the difference is maximum, namely:
Figure FDA0003200408400000023
merging the two merged adjacent intervals;
b4, returning to execute B2 and B3 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol according to the range of the observation value contained in the interval;
acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method;
determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range;
and according to the division points of the time sequence, determining the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range in which the starting and stopping time value is positioned, and converting the time sequence into a symbolized sequence containing a temporal relation.
2. The method of time-series two-layer symbolization of air quality data according to claim 1, wherein said grouping a time-series according to the magnitude of observations in the time-series comprises:
sequencing the time sequence according to the principle that the observed value in the time sequence is increased progressively;
and grouping the sorted time sequences according to the principle that each interval contains the same kind of data to obtain a plurality of initial intervals.
3. The method of claim 1, wherein the obtaining the time-series feature point sequence by the minimum description length criterion and the gradient slope method comprises:
compressing the time sequence by a minimum description length criterion, and extracting candidate characteristic points of the time sequence;
analyzing the gradient change of the slope between observation points in the time sequence by a slope gradient method, and taking the observation points larger than a gradient threshold value as the trend change points of the time sequence;
and supplementing the obtained trend change point of the time series into the candidate characteristic points of the time series to obtain the characteristic point series of the time series.
4. The method of time-series double-layer symbolization of air quality data according to claim 3, wherein said analyzing the change of slope gradient between observation points in the time series by a slope gradient method, and wherein said regarding observation points larger than a gradient threshold as trend change points of the time series comprises:
determining observation point x in time seriesiAnd observation point xjSlope k betweenijAnd determining observation point xiAnd observation point xj+1Slope k betweeni(j+1)
According to the obtained slope kij、ki(j+1)By the formula Δij=|kij-ki(j+1)J is more than or equal to 1 and less than or equal to n-1, and determining the gradient delta corresponding to the observation points i and jij(ii) a Wherein n represents the length of the time series;
for observation point i, by formula
Figure FDA0003200408400000031
Determining a gradient threshold λ for an observation point ii
Judging whether the observation point i is larger than a gradient threshold lambdaiAnd if so, taking the observation point i as a trend change point of the time series.
5. The method of claim 1, wherein the determining the time-series division point on the time axis according to the jump relationship of the adjacent characteristic points of the time series in the range of the value range comprises:
determining three trend states through three jump relations of rising, falling and stability of two adjacent characteristic points in a time sequence in a value range; wherein the content of the first and second substances,
if two adjacent feature points are located in the same value range, the corresponding subsequence is in a first trend state;
if the value domain grade of two adjacent characteristic points is increased, the corresponding subsequence is in a second trend state;
if the value domain grade of two adjacent characteristic points is reduced, the corresponding subsequence is in a third trend state;
and acquiring the characteristic point of the jump of the value range grade as a division point of the time sequence on a time axis.
6. The method of claim 5, wherein the determining the symbolic representation of the start-stop time value and the corresponding value range of each sub-sequence according to the division point of the time sequence, and the converting the time sequence into the symbolic sequence containing temporal relationship comprises:
acquiring intermediate time values of the second trend state subsequence and the third trend state subsequence, and updating the division points of the time sequence on the time axis according to the relation between the observation value corresponding to the intermediate time value and the adjacent value range to obtain the final division points of the time sequence;
merging the observation points in the same value range according to the updated division points of the time sequence, and acquiring the leftmost time and the rightmost time of all the observation points in the same value range as the starting and stopping time values of the corresponding subsequences;
and obtaining a time sequence symbolized sequence containing a temporal relation according to the starting and stopping time value of each subsequence and the symbol corresponding to the value domain range of each subsequence.
7. The method according to claim 6, wherein the determining the start-stop time value of each sub-sequence in the time sequence and the corresponding symbol of the value range thereof according to the division point of the time sequence, and the converting the time sequence into the symbolized sequence containing temporal relationship comprises:
if the observation value corresponding to the middle moment is in the range of the previous value range, updating the right time axis division point of the previous subsequence by the middle moment value to be used as a new right time axis division point of the previous subsequence;
if the observed value corresponding to the middle moment is in the range of the next value range, updating the left time axis division point of the next subsequence by using the middle moment value as a new left time axis division point of the next subsequence;
and if the observation value corresponding to the intermediate time does not change the value range, not updating the time axis division points of the adjacent subsequences.
8. A time-series two-layer symbolization apparatus for air quality data, comprising:
the grouping module is used for grouping the time sequence according to the size of the observed value in the time sequence;
the time sequence is PM2.5, PM10 and NO2、O3、SO2Any one of the time series of air quality data;
the first determining module is used for determining the size of the symbol set and the value range corresponding to the symbols in the symbol set by Shannon entropy self-adaptive clustering on the grouped time sequences;
the acquisition module is used for acquiring a characteristic point sequence of the time sequence by a minimum description length criterion and a gradient slope method;
the second determining module is used for determining the division point of the time sequence in the time axis according to the hopping relation of the adjacent characteristic points of the time sequence in the range of the value range;
the symbolization module is used for determining a start-stop time value of each subsequence in the time sequence and a symbol corresponding to the value domain range in which the start-stop time value is positioned according to the division point of the time sequence, and converting the time sequence into a symbolized sequence containing a temporal relation;
wherein the first determining module is specifically configured to:
s21, determining the entropy value of the whole original time sequence through a Shannon entropy calculation formula;
s22, combining any two adjacent intervals, determining the sum of entropy values of all the combined intervals, and determining the difference between the sum and the entropy value of the whole original time sequence;
s23, performing iteration S22, merging only once in each iteration, and merging two adjacent intervals merged when the difference is maximum after the iteration is finished;
s24, returning to execute S22 and S23 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval;
the iterative merging of the time sequence interval I obtained by the preliminary division, the calculation of the change of the whole original time sequence and the merged data entropy to obtain the size of the clustering interval determined symbol set, and the determination of the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval specifically include:
b1, according to the formula for calculating the information entropy given by shannon, for any variable x, the information entropy h (i) is expressed as:
Figure FDA0003200408400000051
wherein p (x) represents the probability of occurrence of the variable x;
by the formula
Figure FDA0003200408400000052
Obtaining an entropy value H (I) for each intervalj) Comprises the following steps:
Figure FDA0003200408400000053
wherein, IjRepresenting a preliminarily divided interval; n represents the length of the time series; m isjIs represented byjThe number of classes with different values contained in the interval; n isjiIs represented byjThe number of the ith data category in the interval is corresponding to the number of the ith data category in the interval;
according to the obtained H (I)j) Determining the entropy h (h) of the whole original time series as:
Figure FDA0003200408400000061
and B2, combining any two adjacent intervals to obtain the sum H' (H) of entropy values of all the combined intervals, wherein the sum is represented as:
Figure FDA0003200408400000062
wherein, I'j=Ij+Ij+1
Determining the difference between H' (H) and the entropy value H (H) of the whole original time series;
b3, iteratively executing B2, merging each iteration once, after the iteration is completed, determining the merged difference of all the cases, and when the difference is maximum, namely:
Figure FDA0003200408400000063
merging the two merged adjacent intervals;
b4, returning to execute B2 and B3 until the difference value is not changed, ending the iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol according to the range of the observation value contained in the interval.
CN201910261214.7A 2019-04-02 2019-04-02 Time sequence double-layer symbolization method and device Active CN110032585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910261214.7A CN110032585B (en) 2019-04-02 2019-04-02 Time sequence double-layer symbolization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910261214.7A CN110032585B (en) 2019-04-02 2019-04-02 Time sequence double-layer symbolization method and device

Publications (2)

Publication Number Publication Date
CN110032585A CN110032585A (en) 2019-07-19
CN110032585B true CN110032585B (en) 2021-11-30

Family

ID=67237225

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910261214.7A Active CN110032585B (en) 2019-04-02 2019-04-02 Time sequence double-layer symbolization method and device

Country Status (1)

Country Link
CN (1) CN110032585B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113017628B (en) * 2021-02-04 2022-06-10 山东师范大学 Consciousness and emotion recognition method and system integrating ERP components and nonlinear features
CN116155426B (en) * 2023-04-19 2023-06-30 恩平市奥新电子科技有限公司 Sound console operation abnormity monitoring method based on historical data

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6993377B2 (en) * 2002-02-22 2006-01-31 The Board Of Trustees Of The University Of Arkansas Method for diagnosing heart disease, predicting sudden death, and analyzing treatment response using multifractal analysis
US7617176B2 (en) * 2004-07-13 2009-11-10 Microsoft Corporation Query-based snippet clustering for search result grouping
CN100568278C (en) * 2007-11-16 2009-12-09 中国科学院光电技术研究所 Self-adaption optical image high resolution restoration method in conjunction with frame selection and blind deconvolution
CN101655847B (en) * 2008-08-22 2011-12-28 山东省计算中心 Expansive entropy information bottleneck principle based clustering method
CN101496716A (en) * 2009-02-26 2009-08-05 周洪建 Measurement method for detecting sleep apnoea with ECG signal
CN101707575A (en) * 2009-11-09 2010-05-12 东南大学 Chaotic noise signal estimating method based on symbolic vector dynamics
CN101714192B (en) * 2009-11-13 2012-03-21 航天东方红卫星有限公司 Satellite test data processing system
CN101894125B (en) * 2010-05-13 2012-05-09 复旦大学 Content-based video classification method
CN101894560B (en) * 2010-06-29 2012-08-15 上海大学 Reference source-free MP3 audio frequency definition objective evaluation method
CN101916277B (en) * 2010-08-11 2012-08-08 武大吉奥信息技术有限公司 XML format-based geographic tile multi-pyramid temporal dataset generation method and device thereof
CN102129525B (en) * 2011-03-24 2013-06-12 华北电力大学 Method for searching and analyzing abnormality of signals during vibration and process of steam turbine set
CN103136327A (en) * 2012-12-28 2013-06-05 中国矿业大学 Time series signifying method based on local feature cluster
CN103942425B (en) * 2014-04-14 2017-01-11 中国人民解放军国防科学技术大学 Data processing method and device
US9996444B2 (en) * 2014-06-25 2018-06-12 Vmware, Inc. Automated methods and systems for calculating hard thresholds
US10833954B2 (en) * 2014-11-19 2020-11-10 Battelle Memorial Institute Extracting dependencies between network assets using deep learning
CN105242779B (en) * 2015-09-23 2018-09-04 歌尔股份有限公司 A kind of method and mobile intelligent terminal of identification user action
CN106095787A (en) * 2016-05-30 2016-11-09 重庆大学 A kind of Symbolic Representation method of time series data
CN107358156B (en) * 2017-06-06 2020-05-19 华南理工大学 Feature extraction method for ultrasonic tissue characterization based on Hilbert-Huang transform
CN107991097A (en) * 2017-11-16 2018-05-04 西北工业大学 A kind of Method for Bearing Fault Diagnosis based on multiple dimensioned symbolic dynamics entropy
CN108595528A (en) * 2018-03-29 2018-09-28 重庆大学 A kind of multivariate time series are based on Fourier coefficient symbolism classification set creation method

Also Published As

Publication number Publication date
CN110032585A (en) 2019-07-19

Similar Documents

Publication Publication Date Title
CN110018670B (en) Industrial process abnormal working condition prediction method based on dynamic association rule mining
CN105160181B (en) A kind of digital control system domain of instruction sequence variation data detection method
CN109472088B (en) Shale gas-conditioned production well production pressure dynamic prediction method
CN108985380B (en) Point switch fault identification method based on cluster integration
CN111814897A (en) Time series data classification method based on multi-level shape
CN110444011B (en) Traffic flow peak identification method and device, electronic equipment and storage medium
CN107037980A (en) Many expressions storage of time series data
CN109829162A (en) A kind of text segmenting method and device
CN110032585B (en) Time sequence double-layer symbolization method and device
JP2014194762A (en) Method and device for processing time sequence based on dimensionality reduction
CN111079788A (en) K-means clustering method based on density Canopy
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN115514376A (en) High-frequency time sequence data compression method and device based on improved symbol aggregation approximation
CN113743453A (en) Population quantity prediction method based on random forest
CN116720090A (en) Self-adaptive clustering method based on hierarchy
JP6613937B2 (en) Quality prediction apparatus, quality prediction method, program, and computer-readable recording medium
CN110265151B (en) Learning method based on heterogeneous temporal data in EHR
CN107748892B (en) Human behavior data segmentation method based on Mahalanobis distance
CN111488903A (en) Decision tree feature selection method based on feature weight
CN112331350A (en) Method, system and storage medium for predicting early shift into intensive care unit
CN116884503B (en) Processing method, device and computing equipment of sequence and posterior matrix
CN111108516A (en) Evaluating input data using a deep learning algorithm
CN117421386B (en) GIS-based spatial data processing method and system
CN112989918B (en) On-line electroencephalogram signal prediction method based on kernel recursive least square adaptive tracking algorithm
CN116843368A (en) Marketing data processing method based on ARMA model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant