CN110032585B

CN110032585B - Time sequence double-layer symbolization method and device

Info

Publication number: CN110032585B
Application number: CN201910261214.7A
Authority: CN
Inventors: 王玲; 李俊飞
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2019-04-02
Filing date: 2019-04-02
Publication date: 2021-11-30
Anticipated expiration: 2039-04-02
Also published as: CN110032585A

Abstract

The invention provides a time sequence double-layer symbolization method and device of air quality data, which can keep a specific time interval which can be sustained by each subsequence. The method comprises the following steps: grouping the time series according to the size of the observed value in the time series; for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering; acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method; determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range; and according to the division points of the time sequence, determining the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range in which the starting and stopping time value is positioned, and converting the time sequence into a symbolized sequence containing a temporal relation. The present invention relates to the field of data processing.

Description

Time sequence double-layer symbolization method and device

Technical Field

The invention relates to the field of data processing, in particular to a time-series double-layer symbolization method and device for air quality data.

Background

A time series is made up of a set of observations recorded at a particular time, often with fixed time intervals. However, the time series of continuous numerical values is not easy to analyze in practical application, the symbolization of the time series is a suitable discretization means for effectively obtaining the internal structure of the time series, and the time series symbolization is widely applied in various fields such as engineering, science, sociology, economics and the like. However, in the prior art, most of the methods simply perform clustering or directly define the size of a symbol set to perform symbolization, which easily results in loss of data information and further cannot feedback the duration of different states of data, for example:

in the first prior art, a Symbolic Aggregate Approximation (SAX) is used to artificially define the size of a time sequence symbol set, and equally divide a time sequence value domain according to the number of symbols, and finally, the average value of time sequence subsequences that can be divided between the divided regions is used as a representative symbol of the segment, so as to convert the time sequence into a Symbolic sequence.

In the second prior art, symbolic conversion is performed by combining a clustering algorithm, for example, K initial clustering centers are set by a K-Means (K-Means) clustering algorithm, and K clusters are obtained by continuously iteratively updating the clustering centers, wherein each cluster corresponds to a different symbol, so as to convert a time sequence into a corresponding symbolic sequence.

Although the time series can be discretized into the required symbolized sequence, the discretization process needs to continuously adjust parameters to achieve the optimal result, the symbolization of the time series is an important data preprocessing step, information contained in data should be saved as much as possible in the process except for initial data cleaning, and meanwhile, the method has general applicability, so that the importance of the time series symbolization method can be revealed. In the prior art, the required purpose is achieved based on continuous adjustment of parameters, and for different time sequences, parameters before adjustment are needed again, more importantly, the obtained final symbolization sequence cannot better reflect the duration of each state, and only one sequence of different states can be shown, and in sum, the whole process of the existing time sequence symbolization method depends on artificially set parameters too much, and data information is easy to lose.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a time-series double-layer symbolization method and device for air quality data, so as to solve the problems that parameters need to be set manually in the time-series symbolization process and the duration of each state cannot be reserved in the prior art.

To solve the above technical problem, an embodiment of the present invention provides a time-series double-layer symbolization method for air quality data, including:

grouping the time series according to the size of the observed value in the time series;

the time sequence is PM2.5, PM10 and NO₂、O₃、SO₂Any one of the time series of air quality data;

for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering;

acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method;

determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range;

and according to the division points of the time sequence, determining the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range in which the starting and stopping time value is positioned, and converting the time sequence into a symbolized sequence containing a temporal relation.

Further, the grouping the time series according to the size of the observation values in the time series includes:

sequencing the time sequence according to the principle that the observed value in the time sequence is increased progressively;

and grouping the sorted time sequences according to the principle that each interval contains the same kind of data to obtain a plurality of initial intervals.

Further, the determining the size of the symbol set and the value range corresponding to the symbols in the symbol set by shannon entropy adaptive clustering on the grouped time series includes:

s21, determining the entropy value of the whole original time sequence through a Shannon entropy calculation formula;

s22, combining any two adjacent intervals, determining the sum of entropy values of all the combined intervals, and determining the difference between the sum and the entropy value of the whole original time sequence;

s23, performing iteration S22, merging only once in each iteration, and merging two adjacent intervals merged when the difference is maximum after the iteration is finished;

and S24, returning to execute S22 and S23 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval.

Further, the obtaining of the feature point sequence of the time sequence by the minimum description length criterion and the gradient slope method includes:

compressing the time sequence by a minimum description length criterion, and extracting candidate characteristic points of the time sequence;

analyzing the gradient change of the slope between observation points in the time sequence by a slope gradient method, and taking the observation points larger than a gradient threshold value as the trend change points of the time sequence;

and supplementing the obtained trend change point of the time series into the candidate characteristic points of the time series to obtain the characteristic point series of the time series.

Further, analyzing the slope gradient change between observation points in the time series by a slope gradient method, and taking an observation point larger than a gradient threshold value as a trend change point of the time series includes:

determining observation point x in time series_iAnd observation point x_jSlope k between_ijAnd determining observation point x_iAnd observation point x_j+1Slope k between_i(j+1)；

According to the obtained slope k_ij、k_i(j+1)By the formula Δ_ij＝|k_ij-k_i(j+1)J is more than or equal to 1 and less than or equal to n-1, and determining the gradient delta corresponding to the observation points i and j_ij(ii) a Wherein n represents the length of the time series;

for observation point i, by formula

Determining a gradient threshold λ for an observation point i_i；

Judging whether the observation point i is larger than a gradient threshold lambda_iAnd if so, taking the observation point i as a trend change point of the time series.

Further, the determining the division point of the time sequence on the time axis according to the jump relationship of the adjacent feature points of the time sequence in the range of the value range includes:

determining three trend states through three jump relations of rising, falling and stability of two adjacent characteristic points in a time sequence in a value range; wherein the content of the first and second substances,

if two adjacent feature points are located in the same value range, the corresponding subsequence is in a first trend state;

if the value domain grade of two adjacent characteristic points is increased, the corresponding subsequence is in a second trend state;

if the value domain grade of two adjacent characteristic points is reduced, the corresponding subsequence is in a third trend state;

and acquiring the characteristic point of the jump of the value range grade as a division point of the time sequence on a time axis.

Further, the determining the starting and ending time value of each sub-sequence and the symbolic representation of the value range corresponding to the starting and ending time value according to the division point of the time sequence, and the converting the time sequence into the symbolic sequence containing the temporal relationship includes:

acquiring intermediate time values of the second trend state subsequence and the third trend state subsequence, and updating the division points of the time sequence on the time axis according to the relation between the observation value corresponding to the intermediate time value and the adjacent value range to obtain the final division points of the time sequence;

merging the observation points in the same value range according to the updated division points of the time sequence, and acquiring the leftmost time and the rightmost time of all the observation points in the same value range as the starting and stopping time values of the corresponding subsequences;

and obtaining a time sequence symbolized sequence containing a temporal relation according to the starting and stopping time value of each subsequence and the symbol corresponding to the value domain range of each subsequence.

Further, the determining, according to the division point of the time series, the start-stop time value of each subsequence in the time series and the symbol corresponding to the value range in which the start-stop time value is located, and the converting the time series into the symbolized sequence including the temporal relationship includes:

if the observation value corresponding to the middle moment is in the range of the previous value range, updating the right time axis division point of the previous subsequence by the middle moment value to be used as a new right time axis division point of the previous subsequence;

if the observed value corresponding to the middle moment is in the range of the next value range, updating the left time axis division point of the next subsequence by using the middle moment value as a new left time axis division point of the next subsequence;

and if the observation value corresponding to the intermediate time does not change the value range, not updating the time axis division points of the adjacent subsequences.

The embodiment of the present invention further provides a time-series double-layer symbolization apparatus for air quality data, including:

the grouping module is used for grouping the time sequence according to the size of the observed value in the time sequence;

the first determining module is used for determining the size of the symbol set and the value range corresponding to the symbols in the symbol set by Shannon entropy self-adaptive clustering on the grouped time sequences;

the acquisition module is used for acquiring a characteristic point sequence of the time sequence by a minimum description length criterion and a gradient slope method;

the second determining module is used for determining the division point of the time sequence in the time axis according to the hopping relation of the adjacent characteristic points of the time sequence in the range of the value range;

and the symbolization module is used for determining the starting and stopping time value of each subsequence in the time sequence and the symbol corresponding to the value domain range in which the starting and stopping time value is positioned according to the division point of the time sequence, and converting the time sequence into a symbolized sequence containing a temporal relation.

The technical scheme of the invention has the following beneficial effects:

in the scheme, the time series are grouped according to the size of the observed value in the time series; for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering; acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method; determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range; according to the division points of the time sequence, the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range of the starting and stopping time value are determined, the time sequence is converted into a symbolic sequence containing a temporal relation, the starting and stopping time of each subsequence can be reserved, and therefore the specific time interval which each subsequence can last is reserved.

Drawings

Fig. 1 is a schematic flow chart of a time-series two-layer symbolization method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a time sequence provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of the calculation process of L (H) and L (D | H) according to the embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the working principle of distance measurement according to an embodiment of the present invention;

fig. 5 is a schematic diagram of candidate feature points extracted according to the MDL criterion according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a trend change point extracted by a slope gradient method according to an embodiment of the present invention;

FIG. 7 is a detailed diagram of a double-layer symbolization process according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a symbolization process based on a conventional method according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a two-layer symbolization process provided by an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a time-series two-layer symbolization apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The invention provides a time-series double-layer symbolization method and device for air quality data, aiming at the problems that parameters need to be set manually and the duration of each state cannot be reserved in the existing time-series symbolization process.

Example one

As shown in fig. 1, a time-series two-layer symbolization method for air quality data according to an embodiment of the present invention includes:

s1, grouping the time series according to the size of the observed value in the time series;

s2, determining the size of the symbol set and the value range corresponding to the symbols in the symbol set by Shannon entropy self-adaptive clustering on the grouped time sequences;

s3, acquiring a characteristic point sequence of the time sequence by a minimum description length criterion and a gradient slope method;

s4, determining the division point of the time sequence on the time axis according to the jump relation of the adjacent characteristic points of the time sequence in the range of the value range;

and S5, determining the starting and stopping time value of each subsequence in the time sequence and the symbol corresponding to the value range of each subsequence in the time sequence according to the division points of the time sequence, and converting the time sequence into a symbolized sequence containing a temporal relation.

According to the time sequence double-layer symbolization method of the air quality data, the time sequence is grouped according to the size of an observed value in the time sequence; for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering; acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method; determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range; according to the division points of the time sequence, the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range of the starting and stopping time value are determined, the time sequence is converted into a symbolic sequence containing a temporal relation, the starting and stopping time of each subsequence can be reserved, and therefore the specific time interval which each subsequence can last is reserved.

In an embodiment of the foregoing time-series two-layer symbolization method, further grouping the time series according to the size of the observed value in the time series includes:

In the present embodiment, for example, as shown in fig. 2, the observed values in the time series X of continuous numerical values are arranged in ascending order, and the time series is preliminarily divided into p intervals I₁,I₂,……,I_j,……,I_pI represents the interval of the time series preliminary division, j belongs to [1]The method can specifically comprise the following steps：

S11, assuming that X is a time series of continuous numerical values<(x₁,t₁),(x₂,t₂),...,(x_i,t_i),...,(x_n,t_n)>Wherein x is_iRepresents t_iThe observed value of the time is n according to the observed value x_iThe time series X is arranged from small to large, and X is assumed to be₁＜…＜x_i＜…x_nThen the time sequence after the sorting is<(x₁,t₁),(x₂,t₂),…,(x_i,t_i),…,(x_n,t_n)>。

S12, dividing the sorted time sequence into p intervals, wherein each interval contains the same kind of data, and assuming for convenience of expression that the length is n/p, the data of the jth interval is

In a specific implementation manner of the foregoing time-series double-layer symbolization method, further, the determining, by shannon entropy adaptive clustering, a size of a symbol set and a value range corresponding to a symbol in the symbol set for the grouped time series includes:

and S24, returning to execute S22 and S23 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval, wherein the number of the symbol types contained in the symbol set is determined by the value range, and therefore the number of the symbol types is equal to the number of the interval.

In this embodiment, the iterative combination of the time sequence interval I obtained by the preliminary division, the calculation of the change of the entire original time sequence and the combined data entropy to obtain the size of the clustering interval determined symbol set, and the determination of the value range corresponding to each symbol in the symbol set according to the range of the observed value included in the interval may specifically include the following steps:

b1, according to the formula for calculating information entropy given by shannon, for any variable x, the information entropy h (i) can be expressed as:

where P (x) denotes the probability of occurrence of the variable x, and assuming a sequence of Y ═ 1,2,3,4,5,6,1,2], the probability P (1) of occurrence of the value 1 in this sequence Y is equal to 2/8, i.e., 0.25.

Thus, by the formula

Obtaining an entropy value H (I) for each interval_j) Comprises the following steps:

wherein, I_jRepresenting a preliminarily divided interval; n represents the length of the time series; m is_jIs represented by_jThe number of classes with different values contained in the interval; n is_jiIs represented by_jThe number of the ith data category in the interval is corresponding to the number of the ith data category in the interval;

according to the obtained H (I)_j) Determining the entropy h (h) of the whole original time series as:

b2, merging any two adjacent areasTo e.g. item I_jInterval and I_j+1Interval, resulting in the sum of entropy values H' (H) of all intervals after merging, expressed as:

wherein, I'_j＝I_j+I_j+1；

The difference between H' (H) and the entropy value H (H) of the entire original time series is determined.

And B3, iteratively executing B2, merging each iteration once, determining the merged difference of all the cases after the iteration is finished, and when the difference is maximum (namely:

) Merging the two merged adjacent intervals;

b4, returning to execute B2 and B3 until the difference value is not changed, ending the iteration, determining the size of the symbol set according to the number of the current interval (also called as a clustering interval), and determining the value range corresponding to each symbol according to the range of the observation value contained in the interval.

In an embodiment of the foregoing time-series two-layer symbolization method, further, the obtaining a feature point sequence of a time series by using a minimum description length criterion and a gradient slope method includes:

In this embodiment, compressing the continuous time series by using a Minimum Description Length (MDL) criterion, and extracting candidate feature points of the time series may specifically include the following steps:

c1, calculating a distance description between two observation points (the observation points comprise time and an observation value) adjacent to each other in the time series according to the MDL criterion:

referring to fig. 3 and 4, L (H) is a description length of the hypothetical condition, L (D | H) is a description length of data, and x is a description length of data when the hypothetical condition is satisfied_ckRepresents the ck-th candidate feature point len (x) in time series_ckx_c(k+1)) Representing a line segment x_ckx_c(k+1)Length of (d); d_⊥(x_ckx_c(k+1),x_jx_(j+1)) Representing a line segment x_ckx_c(k+1)And line segment x_jx_(j+1)(ck ≦ j ≦ c (k + 1)); d_θ(x_ckx_c(k+1),x_jx_(j+1)) Represents the angular distance thereof;

the process can be analyzed with reference to fig. 4, namely:

referring to fig. 4, θ represents two observation points x adjacent to each other in the time-series feature point extraction process_jAnd x_j+1Connected vector L_jAnd neighboring two candidate feature points x_ckAnd x_c(k+1)Connected vectors L_iX 'of'_jIs a point x_jAt L_iProjected point on；x'_(j+1)Is x_(j+1)The projected point of (a); l_⊥1Denotes x_jAnd x'_jThe Euclidean distance between; l_⊥2Denotes x_(j+1)And x'_(j+1)The euclidean distance between them.

C2, calculating the point x according to the calculation result in C1_c(k+1)Metric cost MDL as candidate feature point_par(x_ck,x_c(k+1)) And point x_c(k+1)Metric cost MDL as non-candidate feature points_nopar(x_ck,x_c(k+1)) When the current value is smaller than the latter value, the candidate feature point requirement is satisfied, and the candidate feature point can be regarded as a candidate feature point of a time series as shown in fig. 3, wherein the MDL is_par(x_ck,x_c(k+1)) And MDL_nopar(x_ck,x_c(k+1)) Expressed as:

MDL_par(x_ck,x_c(k+1))＝L(H)+L(D|H)

when MDL is satisfied_par(x_ck,x_c(k+1))＜MDL_nopar(x_ck,x_c(k+1)) And taking the point as a candidate feature point of the time sequence, otherwise, taking the point as a non-candidate feature point, and simultaneously considering the situation that the next point is the candidate feature point.

In this embodiment, candidate feature points extracted according to the MDL criterion are shown in fig. 5.

In this embodiment, a gradient value is calculated according to adjacent observation points in a time sequence by a slope gradient method, and a certain observation point x is pointed to_iDetermining a gradient threshold value by using a gradient mean value and a variance of the whole time sequence, and screening a trend change point of the time sequence, wherein the method specifically comprises the following steps:

d1, determining observation point x in time series_iAnd observation point x_jSlope k between_ijAnd determining observation point x_iAnd observation point x_j+1Slope k between_i(j+1)；

D2, according to the obtained slope k_ij、k_i(j+1)By the formula Δ_ij＝|k_ij-k_i(j+1)J is more than or equal to 1 and less than or equal to n-1, and determining the gradient delta corresponding to the observation points i and j_ij(ii) a Wherein n represents the length of the time series;

d3, for observation point i, by formula

Determining a gradient threshold λ for an observation point i_i；

D4, judging whether the observation point i is larger than the gradient threshold lambda_iAnd if so, taking the observation point i as a trend change point of the time series.

In this embodiment, the trend change point extracted by the slope gradient method is shown in fig. 6.

In this embodiment, the obtained trend change point of the time series is supplemented to the candidate feature point of the time series obtained by the MDL criterion, so as to obtain the feature point sequence of the time series.

In the foregoing embodiment of the time-series double-layer symbolization method, further, as shown in fig. 7, determining a division point of the time series on the time axis according to a jump relationship of adjacent feature points of the time series in a range of a value range, for completing the first-layer symbolization, specifically, the method may include the following steps:

In the foregoing specific implementation of the time-series double-layer symbolization method, as shown in fig. 7, further, according to the division point of the time series, determining a start-stop time value of each sub-series in the time series and a symbol corresponding to the value range where the start-stop time value is located, and converting the time series into a symbolized series containing a temporal relationship, for completing the second-layer symbolization, specifically, the method may include the following steps:

In this embodiment, the obtained symbolized sequence may be expressed as:

[a,(t₁,t₂)],[b,(t₃,t₄)],……(t₁＜t₂＜t₃＜t₄＜…)

wherein a, b,. represents a symbol; (t)₁,t₂) The starting and ending time corresponding to the symbol state a; (t)₃,t₄) The starting and ending time corresponding to the b symbol state.

In an embodiment of the foregoing time-series double-layer symbolization method, further, the determining, according to the division point of the time series, a start-stop time value of each subsequence in the time series and a symbol corresponding to a value range in which the start-stop time value and the symbol correspond, and converting the time series into a symbolized series including a temporal relationship includes:

In this embodiment, as shown in fig. 8 and 9, the time sequence described in this embodiment is double-layered symbolized, and the start-stop time of each sub-sequence can be reserved, so that a specific time interval for which each sub-sequence can last is reserved.

For better understanding of the time-series two-layer symbolization method according to the embodiment of the present invention, it is described in detail with reference to a specific application:

the time-series double-layer symbolization method provided by the embodiment of the invention can be used for space quality data, for example, symbolizing air quality data which comprises 5 attributes, namely PM2.5, PM10, NO2, O3 and SO2, wherein data of each attribute is acquired once every hour, partial data is selected for description, and the form of a data set is as follows:

TABLE 1 air quality part data set

Sorting the observed values of each attribute respectively, and primarily grouping the time sequence of each attribute;

for the time sequence of each attribute after grouping, a value range (value range grade division) corresponding to each attribute is obtained by adopting self-adaptive clustering of shannon entropy, and corresponding symbols are distributed for each value range at the same time, namely the number of the symbols is equal to the value range grade number, namely, the range of the value range which can be obtained by the self-adaptive clustering is shown in table 2:

TABLE 2 air quality data value range partitioning

From the results shown in table 2, it can be seen that the levels capable of acquiring, for example, PM2.5 via the adaptive clustering method are divided into 7 levels, and each level has its corresponding value range.

Then, extracting characteristic points of the time sequence corresponding to each attribute of the air quality data, and representing the time sequence of continuous numerical values by using the characteristic point sequence;

determining the trend change state of the time sequence (specifically: temporarily converting the time sequence into three trend states) according to the jump relation of adjacent characteristic points of the air quality data between the value range grades, adaptively determining the position of a segmentation point of a time axis according to the intermediate time value of a second trend state subsequence and a third trend state subsequence, and further acquiring a specific time interval which can be sustained by each subsequence;

finally, according to the range of the value range in which each sub-sequence number is located, converting each sub-sequence number into a corresponding symbolized sequence, which is the application of the time-series double-layer symbolization method in the air quality data, the symbolization result in table 1 can be obtained according to the above description, as shown in table 3:

TABLE 3 space quality data symbolization procedure table

The characteristics of the process proposed by the invention can be clearly seen from Table 3The time series can be accurately discretized into an analyzable tokenized series while preserving a specific time range for which each state lasts, where x₁₄The first numerical value of the subscript represents the attribute, the second numerical value represents the value range grade of the measured value at the moment, and the more important point is that the whole symbolization process does not need to set parameters manually, the phenomena of information loss and the like caused by unreasonable parameters are avoided, and in the application aspect of the related method, the specific temporal relation among the item sets in the rules can be obtained except that the symbolized time sequence method can be similar to the traditional symbolization method and the association rule mining algorithms such as apriori and FP-growth are used for obtaining the association rules which can be used for guiding decision making, more detailed rules can be obtained, a series of continuous state transfer conditions can be obtained when the symbolized time sequence method is used for sequence mode mining, and the method has good application to air quality prediction, health guidance of people and the like.

Example two

The present invention further provides a specific embodiment of a time-series double-layer symbolization apparatus for air quality data, which corresponds to the specific embodiment of the time-series double-layer symbolization method for air quality data, and the time-series double-layer symbolization apparatus provided by the present invention can achieve the purpose of the present invention by executing the flow steps in the specific embodiment of the method.

As shown in fig. 10, an embodiment of the present invention further provides a time-series double-layer symbolizing apparatus for air quality data, including:

a grouping module 11, configured to group the time series according to the size of the observed value in the time series;

the first determining module 12 is configured to determine, for the grouped time sequences, a size of a symbol set and a value range corresponding to symbols in the symbol set through shannon entropy adaptive clustering;

an obtaining module 13, configured to obtain a feature point sequence of a time sequence according to a minimum description length criterion and a gradient slope method;

the second determining module 14 is configured to determine a dividing point of the time sequence on the time axis according to a hopping relation of adjacent feature points of the time sequence in the range of the value range;

and the symbolization module 15 is configured to determine, according to the division point of the time sequence, a start-stop time value of each subsequence in the time sequence and a symbol corresponding to the value range where the start-stop time value is located in the time sequence, and convert the time sequence into a symbolized sequence including a temporal relationship.

The time sequence double-layer symbolization device of the air quality data, disclosed by the embodiment of the invention, groups the time sequence according to the size of an observed value in the time sequence; for the grouped time sequences, the size of a symbol set and a value range corresponding to symbols in the symbol set are determined through Shannon entropy self-adaptive clustering; acquiring a characteristic point sequence of a time sequence by a minimum description length criterion and a gradient slope method; determining a division point of the time sequence in a time axis according to the hopping relation of adjacent characteristic points of the time sequence in a value range; according to the division points of the time sequence, the starting and stopping time value of each subsequence in the time sequence and the symbols corresponding to the value range of the starting and stopping time value are determined, the time sequence is converted into a symbolic sequence containing a temporal relation, the starting and stopping time of each subsequence can be reserved, and therefore the specific time interval which each subsequence can last is reserved.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for time-series two-level symbolization of air quality data, comprising:

for the grouped time sequence, the size of the symbol set and the value range corresponding to the symbols in the symbol set are determined by Shannon entropy self-adaptive clustering, and the method comprises the following steps:

s24, returning to execute S22 and S23 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval;

the iterative merging of the time sequence interval I obtained by the preliminary division, the calculation of the change of the whole original time sequence and the merged data entropy to obtain the size of the clustering interval determined symbol set, and the determination of the value range corresponding to each symbol in the symbol set according to the range of the observed value contained in the interval specifically include:

b1, according to the formula for calculating the information entropy given by shannon, for any variable x, the information entropy h (i) is expressed as:

wherein p (x) represents the probability of occurrence of the variable x;

by the formula

and B2, combining any two adjacent intervals to obtain the sum H' (H) of entropy values of all the combined intervals, wherein the sum is represented as:

wherein, I'_j＝I_j+I_j+1；

Determining the difference between H' (H) and the entropy value H (H) of the whole original time series;

b3, iteratively executing B2, merging each iteration once, after the iteration is completed, determining the merged difference of all the cases, and when the difference is maximum, namely:

merging the two merged adjacent intervals;

b4, returning to execute B2 and B3 until the difference value is not changed, ending iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol according to the range of the observation value contained in the interval;

2. The method of time-series two-layer symbolization of air quality data according to claim 1, wherein said grouping a time-series according to the magnitude of observations in the time-series comprises:

3. The method of claim 1, wherein the obtaining the time-series feature point sequence by the minimum description length criterion and the gradient slope method comprises:

4. The method of time-series double-layer symbolization of air quality data according to claim 3, wherein said analyzing the change of slope gradient between observation points in the time series by a slope gradient method, and wherein said regarding observation points larger than a gradient threshold as trend change points of the time series comprises:

for observation point i, by formula

Determining a gradient threshold λ for an observation point i_i；

5. The method of claim 1, wherein the determining the time-series division point on the time axis according to the jump relationship of the adjacent characteristic points of the time series in the range of the value range comprises:

6. The method of claim 5, wherein the determining the symbolic representation of the start-stop time value and the corresponding value range of each sub-sequence according to the division point of the time sequence, and the converting the time sequence into the symbolic sequence containing temporal relationship comprises:

7. The method according to claim 6, wherein the determining the start-stop time value of each sub-sequence in the time sequence and the corresponding symbol of the value range thereof according to the division point of the time sequence, and the converting the time sequence into the symbolized sequence containing temporal relationship comprises:

8. A time-series two-layer symbolization apparatus for air quality data, comprising:

the symbolization module is used for determining a start-stop time value of each subsequence in the time sequence and a symbol corresponding to the value domain range in which the start-stop time value is positioned according to the division point of the time sequence, and converting the time sequence into a symbolized sequence containing a temporal relation;

wherein the first determining module is specifically configured to:

wherein p (x) represents the probability of occurrence of the variable x;

by the formula

wherein, I'_j＝I_j+I_j+1；

merging the two merged adjacent intervals;

b4, returning to execute B2 and B3 until the difference value is not changed, ending the iteration, determining the size of the symbol set according to the number of the current interval, and determining the value range corresponding to each symbol according to the range of the observation value contained in the interval.