CN106384128A

CN106384128A - Method for mining time series data state correlation

Info

Publication number: CN106384128A
Application number: CN201610814387.3A
Authority: CN
Inventors: 王文青; 王徐华; 杨天社; 鲍军鹏; 赵静; 李辉; 张海龙; 齐勇
Original assignee: Xian Jiaotong University; China Xian Satellite Control Center
Current assignee: Xian Jiaotong University; China Xian Satellite Control Center
Priority date: 2016-09-09
Filing date: 2016-09-09
Publication date: 2017-02-08

Abstract

A method for mining time-series data state associations. Firstly, the time-series data variables are preprocessed, including wild value removal, equal interval interpolation, and normalization operations; The comprehensive eigenvectors of all windows are clustered, and the windows of different clusters represent different states. All clusters are sorted according to size, and each window is represented by the character corresponding to its cluster, so that the original numerical data is converted into a string form. , which is to obtain the state string of each variable; then align the state strings of all variables to obtain a multivariate state matrix; use the Apriori algorithm to mine the association rules between different variable states and give the formal expression and its association strength; Finally, association rules are reduced to remove redundant information; the invention has the ability to resist noise interference, and is suitable for carefully analyzing the correlation of state values for small parameter sets, and digging out the state value mapping relationship.

Description

A Method for Mining State Association of Time Series Data

技术领域technical field

本发明属于智能信息处理和计算机技术领域，具体涉及一种挖掘时序数据状态关联的方法。The invention belongs to the field of intelligent information processing and computer technology, and in particular relates to a method for mining time series data state correlation.

背景技术Background technique

在大型复杂系统中，变量状态之间包含一定的关联关系。这种关联关系受到系统内部规律的作用，在异常数据上就会有某种体现。关联性在时空上可以表现为共现关系、因果关系、先兆关系、相关性等等。当系统状态发生变化时，将引起不同变量的相应变化。系统在正常状态和异常状态下运行规律不同，反映在异常数据上表现为变量的变化形式不同。通过分析多个变量异常数据的变化规律，挖掘出不同变量状态之间的关联性，对于总结系统运行规律，发现潜在故障知识具有重要作用。In a large complex system, there are certain correlations among variable states. This correlation is affected by the internal laws of the system, and it will be reflected in abnormal data. Correlation can be manifested as co-occurrence relationship, causality relationship, precursor relationship, correlation and so on in time and space. When the state of the system changes, it will cause corresponding changes in different variables. The operating rules of the system are different under normal state and abnormal state, which is reflected in the abnormal data as the variation of variables. By analyzing the variation law of abnormal data of multiple variables, mining the correlation between the states of different variables plays an important role in summarizing the operation law of the system and discovering potential fault knowledge.

发明内容Contents of the invention

本发明的目的在于提供一种挖掘时序数据状态关联的方法，该方法综合运用了特征提取技术、聚类学习理论来挖掘单个变量的状态，然后利用Apriori算法挖掘不同变量的状态关联规则并给出形式化表示及关联强度，最后对关联规则进行约简消除冗余信息；本发明考虑到变量取值的模糊性或不确定性，具有抗噪音干扰能力，适合于对小参数集合细致地分析其状态取值关联性，挖掘出状态值映射关系。The purpose of the present invention is to provide a method for mining the state association of time series data. The method comprehensively uses feature extraction technology and clustering learning theory to mine the state of a single variable, and then uses the Apriori algorithm to mine the state association rules of different variables and gives Formal representation and association strength, and finally reduce the association rules to eliminate redundant information; the present invention considers the ambiguity or uncertainty of variable values, has the ability to resist noise interference, and is suitable for carefully analyzing small parameter sets. State value correlation, dig out the state value mapping relationship.

为达到上述目的，本发明采用的技术方案是：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种挖掘时序数据状态关联的方法，实现该方法的系统包括数据预处理模块、特征提取模块、动态划分聚类模块、多变量状态矩阵生成模块、Apriori状态关联挖掘模块和关联规则约简模块，其具体步骤是：A method for mining time-series data state association, the system for realizing the method includes a data preprocessing module, a feature extraction module, a dynamic partition clustering module, a multivariate state matrix generation module, an Apriori state association mining module and an association rule reduction module, Its specific steps are:

1)首先，数据预处理模块对原始时序数据进行去野值、等间隔插值、归一化操作，得到有效数据形式；1) First, the data preprocessing module performs outlier removal, equal interval interpolation, and normalization operations on the original time series data to obtain an effective data form;

2)其次，特征提取模块将时间序列变量的有效数据划分成长度相等的窗口，对每个窗口数据提取特征，包括傅里叶特征、统计特征、小波特征构成特征向量；2) Secondly, the feature extraction module divides the effective data of time series variables into windows of equal length, and extracts features for each window data, including Fourier features, statistical features, and wavelet features to form feature vectors;

3)然后，动态划分聚类模块对单个变量所有窗口的特征向量进行动态划分聚类，将聚类得到的簇按照大小排序，最大的簇用字符‘a’表示，次大的簇用‘b’表示，依次类推，小于给定阈值2的簇则视为噪声，用‘？’表示，将每个窗口用其所在簇对应的字符表示，这样原始数值型数据被转化成字符串形式，即该变量的状态字符串；3) Then, the dynamic clustering module dynamically clusters the eigenvectors of all windows of a single variable, sorts the clusters obtained by clustering according to size, the largest cluster is represented by the character 'a', and the next largest cluster is represented by 'b' 'Represents, and so on, clusters smaller than a given threshold 2 are considered noise, use'? 'Indicates that each window is represented by the character corresponding to its cluster, so that the original numerical data is converted into a string form, that is, the state string of the variable;

4)多变量状态矩阵生成模块将所有变量的状态字符串按照时间对齐，形成状态矩阵；4) The multi-variable state matrix generation module aligns the state character strings of all variables according to time to form a state matrix;

5)Apriori状态关联挖掘模块用Apriori算法对多变量状态矩阵进行频繁项集和关联规则挖掘；5) The Apriori state association mining module uses the Apriori algorithm to mine frequent itemsets and association rules for the multivariate state matrix;

6)最后，关联规则约简模块对检测到的关联规则进行约简，消除冗余信息，得到最终的多变量状态关联规则。6) Finally, the association rule reduction module reduces the detected association rules, eliminates redundant information, and obtains the final multi-variable state association rules.

所述的数据预处理模块去野值的步骤包括：计算每个窗口的均值和标准差，判断每个数据点与其所在观察窗口均值之差是否大于5倍的观察窗口的标准差，若大于，则该数据点为野值剔除；对去野值后的时间序列进行等间隔插值，设间隔为△t，起始时刻是T，则等间隔插值后的时间集合应为{T+n*△t n＝0，1，2，3…}，T+i*△t时刻对应的值为原始序列中离该时刻最近的小于T+i*△t时刻所对应的值，即原始序列中第一个大于T+i*△t时刻的前一个时刻所对应的观测值；对等间隔插值操作后的数据进行线性归一化，首先扫描一遍时间序列，获得观测值的最大值(max)和最小值(min)，根据如下公式计算每个观测点归一化后的数值，将原始时间序列取值范围转换到[0,1]区间上；The step of removing the wild value of the data preprocessing module includes: calculating the mean value and standard deviation of each window, judging whether the difference between each data point and the mean value of the observation window where it is located is greater than the standard deviation of the observation window of 5 times, if greater than, Then the data point is outlier elimination; perform equal interval interpolation on the time series after outlier removal, set the interval as △t, and the starting time is T, then the time set after equal interval interpolation should be {T+n*△ t n=0, 1, 2, 3...}, the value corresponding to the moment T+i*△t in the original sequence is smaller than the value corresponding to the moment T+i*△t, that is, the first value in the original sequence The observation value corresponding to the previous time greater than T+i*△t time; linearly normalize the data after equal interval interpolation operation, first scan the time series once, and obtain the maximum value (max) and minimum value of the observation value Value (min), calculate the normalized value of each observation point according to the following formula, and convert the value range of the original time series to the [0,1] interval;

${x x}_{i i} = = \frac{{x x}_{i i} - - m m i i n no}{Δ Δ}$

其中，x_i表示第i个观测点数值；△＝max-min。Among them, x _i represents the value of the i-th observation point; △=max-min.

特征提取模块：首先，用设定的窗口对单变量数据进行切割；其次，对每个窗口内的数据进行特征提取，包括统计特征、傅里叶特征、小波特征；v1＝[均值，方差]，v1表示统计特征，其中均值反映了一个窗口内数据的平均水平，方差则描述了窗口内数据的波动程度；v2＝[傅里叶系数1，傅里叶频率1，傅里叶系数2，傅里叶频率2]，v2表示频域特征，通过傅里叶变换得到一系列的傅里叶系数，对傅里叶系数按照绝对值从大到小进行排序，选取前两个最大的傅里叶系数及其所对应的频率；v3＝[小波系数细节系数1，…小波细节系数n]，v3表示时频域特征，对每个窗口进行离散小波变换，得到n个小波细节系数，将这三方面特征综合起来，构成窗口特征向量v，v＝[v1，v2，v3]。Feature extraction module: firstly, use the set window to cut the univariate data; secondly, perform feature extraction on the data in each window, including statistical features, Fourier features, wavelet features; v1=[mean value, variance] , v1 represents a statistical feature, wherein the mean value reflects the average level of the data in a window, and the variance describes the fluctuation degree of the data in the window; v2=[Fourier coefficient 1, Fourier frequency 1, Fourier coefficient 2, Fourier frequency 2], v2 represents the frequency domain feature, a series of Fourier coefficients are obtained through Fourier transform, the Fourier coefficients are sorted from large to small in absolute value, and the first two largest Fourier coefficients are selected Leaf coefficient and corresponding frequency thereof; v3=[wavelet coefficient detail coefficient 1, ... wavelet detail coefficient n], v3 represents time-frequency domain feature, carries out discrete wavelet transform to each window, obtains n wavelet detail coefficients, this The features of the three aspects are combined to form a window feature vector v, v=[v1, v2, v3].

所述的动态划分聚类模块采用动态划分聚类方法对单个变量所有窗口的特征向量进行聚类其过程如下：Described dynamic division clustering module adopts dynamic division clustering method to cluster the feature vectors of all windows of a single variable and its process is as follows:

1)第一个窗口单独成簇，簇心为该窗口的综合特征向量；1) The first window is clustered separately, and the cluster center is the comprehensive feature vector of the window;

2)簇的初始划分过程，根据如下公式计算第2个窗口和第1个簇簇心之间的相似度：2) In the initial division process of clusters, the similarity between the second window and the first cluster center is calculated according to the following formula:

$c c o o s the s (({v v}_{i i k k},, {v v}_{j j k k})) = = \frac{{v v}_{i i k k} \cdot &Center Dot; {v v}_{j j k k}}{| | {v v}_{i i k k} | | \times \times | | {v v}_{j j k k} | |}$

式中：cos(v_ik，v_jk)表示窗口i的v_k(k＝1，2，3)向量和窗口j的v_k向量间的余弦相似度；cos(v_ik，v_jk)∈[-1，+1]，利用如下公式，计算两个窗口之间的距离；In the formula: cos(v _ik , v _jk ) represents the cosine similarity between the v _k (k=1, 2, 3) vector of window i and the v _k vector of window j; cos(v _ik , v _jk )∈[ -1, +1], use the following formula to calculate the distance between two windows;

$d d (({v v}_{i i k k},, {v v}_{j j k k})) = = 11 - - \frac{c c o o s the s (({v v}_{i i k k},, {v v}_{j j k k})) + + 11}{22}$

$d d i i s the s t t (({v v}_{i i},, {v v}_{j j})) = = \frac{{Σ Σ}_{k k = = 11}^{33} d d i i s the s t t (({v v}_{i i k k},, {v v}_{j j k k}))}{33}$

式中：dist(v_i，v_j)表示窗口i和j之间的距离，dist(v_i，v_j)∈[0，1]；In the formula: dist(v _i , v _j ) represents the distance between window i and j, dist(v _i , v _j )∈[0,1];

若dist<d，d＝0.2)，则将2号窗口并入第一个簇并且根据如下公式立即更换簇心：If dist<d, d=0.2), merge window 2 into the first cluster and replace the cluster center immediately according to the following formula:

${cv cv}_{k k} = = \frac{{Σ Σ}_{i i = = 11}^{n no} {v v}_{i i k k}}{n no},, ((k k = = 11,, 22,, 33))$

式中：cv_k表示簇心c的第k个特征向量，它等于簇内所有窗口第k个特征向量的均值，一个簇的簇心c就是cv₁，cv₂，cv₃的组合，即簇内所有窗口综合特征向量的平均值；若dist≥d，2号窗口单独成簇，依次处理其余窗口：计算第i个窗口和所有已经产生的簇簇心之间的距离，挑选出和其最近的簇c的距离dist，若dist<d，将i号窗口并入簇c，否则单独成簇；In the formula: cv _k represents the kth eigenvector of the cluster center c, which is equal to the mean value of the kth eigenvector of all windows in the cluster, and the cluster center c of a cluster is the combination of cv ₁ , cv ₂ , and cv ₃ , that is, the cluster The average value of the comprehensive eigenvectors of all windows in the window; if dist≥d, window No. 2 is clustered separately, and the remaining windows are processed in turn: calculate the distance between the i-th window and all the cluster centers that have been generated, and select the closest to it The distance dist of the cluster c, if dist<d, merge the i window into the cluster c, otherwise form a separate cluster;

3)簇的调整过程：取出第i号窗口(i＝1，2，…m)，计算其与所有簇的簇心距离dist，挑选出最小的dist及其对应的簇c，若dist≤d，且窗口i不在簇c中，则将窗口i从原来的簇移到簇c；若dist≤d，且窗口i在簇c中，则窗口i不进行操作；若dist>d，则将窗口i从原来的簇中移除，单独成簇；重复上述过程，直到处理完所有窗口，计算所有簇的簇心。若存在一个簇心发生了变化，则重复簇的调整过程，即步骤3)直至所有簇的簇心不再变化；若所有簇心都不变，则执行步骤4)；3) Cluster adjustment process: take out the i-th window (i=1, 2,...m), calculate the distance dist between it and all cluster centers, select the smallest dist and its corresponding cluster c, if dist≤d , and window i is not in cluster c, move window i from the original cluster to cluster c; if dist≤d, and window i is in cluster c, then window i does not operate; if dist>d, move window i is removed from the original cluster and clustered separately; repeat the above process until all windows are processed, and the cluster centers of all clusters are calculated. If there is a cluster center that has changed, repeat the cluster adjustment process, that is, step 3) until the cluster centers of all clusters do not change; if all cluster centers are unchanged, then perform step 4);

4)簇的合并过程：计算任意两个簇的簇心距离，选出距离最近的两个簇c_i，c_j，及其对应的距离dist，若dist≤α，α＝0.3)，则合并簇c_i，c_j，并且计算合并后新簇的簇心，重复合并过程4)，若dist>α，表示不存在两个足够接近的簇可以合并，则退出合并过程，聚类算法结束，聚类结果中同一个簇内的窗口特征近似，被认为是一种状态，不同簇的窗口代表不同的状态；将所有簇按照大小排序，最大的簇用字符‘a’表示，次大的簇用‘b’表示，依次类推，小于给定阈值的簇则视为噪声，用‘？’表示，将每个窗口用其所在簇对应的字符表示，这样原始数值型数据被转化成字符串形式，即获取每个变量的状态字符串。4) Cluster merging process: Calculate the distance between the cluster centers of any two clusters, select the two closest clusters c _i , c _j , and their corresponding distance dist, if dist≤α, α=0.3), then merge Clusters c _i , c _j , and calculate the cluster center of the new cluster after merging, repeat the merging process 4), if dist>α, it means that there are no two clusters that are close enough to be merged, then exit the merging process, and the clustering algorithm ends, In the clustering results, the window characteristics in the same cluster are similar, which is considered as a state, and the windows of different clusters represent different states; all clusters are sorted by size, the largest cluster is represented by the character 'a', and the second largest cluster Expressed by 'b', and so on, clusters smaller than a given threshold are regarded as noise, and '? 'Indicates that each window is represented by the character corresponding to its cluster, so that the original numerical data is converted into a string form, that is, the status string of each variable is obtained.

所述的多变量状态矩阵生成模块是将所有变量的状态字符串按照时间对齐，假设有n个变量，每个变量的起始观测时间和截止时间相同，那么它们的窗口数目必然相同，状态字符串长度也相同，假设状态字符串长度为m，则生成n*m的多变量状态矩阵。The multi-variable state matrix generation module is to align the state character strings of all variables according to time, assuming that there are n variables, and the starting observation time and deadline time of each variable are the same, so their window numbers must be the same, and the state characters The length of the string is also the same, assuming that the length of the state string is m, a multivariate state matrix of n*m is generated.

所述的Apriori状态关联挖掘模块是利用Apriori算法对多变量状态矩阵进行频繁项集和关联规则挖掘：Apriori算法挖掘频繁项集流程如下：首先，通过扫描事务记录，找出所有的频繁1项集，该集合记做L₁，然后利用L₁寻找频繁2项集L₂，如此下去，直到不能再找到任何频繁k项集；每次迭代分为两个步骤：第一，通过连接步和剪枝步产生候选项集；第二，计算每个候选项的支持度，大于最小支持度阈值0.001的项被认为是频繁项，在频繁项集的基础上挖掘关联规则，具体如下：首先，对于每个频繁项集L,产生L的所有非空子集；其次，对于L的每个非空子集s，产生一个候选规则“s→(L-s)”,其中(L-s)表示L中除去s之后所剩的内容；最后如果该候选规则的置信度大于给定阈值0.5，则输出该规则，否则抛弃该规则，规则的置信度按下式计算：The Apriori state association mining module utilizes the Apriori algorithm to mine frequent itemsets and association rules of the multivariate state matrix: the Apriori algorithm mining frequent itemsets process is as follows: first, find out all frequent 1-itemsets by scanning transaction records , the set is denoted as L ₁ , and then use L ₁ to find the frequent 2-itemset L ₂ , and so on until no more frequent k-itemset can be found; each iteration is divided into two steps: first, through the connection step and the clipping step The branch step generates candidate item sets; secondly, calculate the support degree of each candidate item, and items greater than the minimum support threshold 0.001 are considered frequent items, and association rules are mined on the basis of frequent item sets, as follows: first, for For each frequent itemset L, generate all non-empty subsets of L; secondly, for each non-empty subset s of L, generate a candidate rule "s→(Ls)", where (Ls) represents all non-empty subsets in L after removing s Finally, if the confidence of the candidate rule is greater than the given threshold 0.5, the rule is output, otherwise the rule is discarded, and the confidence of the rule is calculated as follows:

$C C f f ((L L,, s the s)) = = \frac{S S p p ((L L))}{S S p p ((s the s))}$

其中，Cf(L,s)表示规则“s→(L-s)”的置信度,Sp(L)表示L的支持度，Sp(s)表示s的支持度。Among them, Cf(L,s) represents the confidence degree of the rule "s→(L-s)", Sp(L) represents the support degree of L, and Sp(s) represents the support degree of s.

所述的关联规则约简模块对产生的冗余规则进行合并或删除，约简步骤为：The association rule reduction module merges or deletes the generated redundant rules, and the reduction steps are:

1)对于得到的关联规则，按照置信度从大到小进行排序；1) For the obtained association rules, sort according to the confidence degree from large to small;

2)对于每一个K阶频繁项(K>1)，只保留置信度最大的关联规则；2) For each K-order frequent item (K>1), only keep the association rule with the highest confidence;

3)如果两条关联规则的前件相同，则对后件进行比较，如果后件存在包含关系，在置信度相差很小的前提下，删除被包含的后件所属于的规则；3) If the antecedents of the two association rules are the same, compare the latter, and if the latter has an inclusion relationship, delete the rule to which the included latter belongs under the premise that the confidence difference is small;

4)如果两条关联规则的后件相同，则对前件进行比较，如果前件存在包含关系，在置信度相差很小的前提下，删除前件比较多的规则，保留前件比较少的规则；4) If the latter parts of the two association rules are the same, compare the former parts. If there is an inclusion relationship between the former parts, on the premise that the difference in confidence is small, delete the rule with more former parts and keep the rule with less former parts rule;

5)为了确保知识的一致性，避免出现循环推理，需要检测关联规则中是否存在环，对环的检测通过有向无环图实现，用一个节点表示前件，一个节点表示后件，二者用有向边连接，逐一检测按照置信度降序排列的关联规则。5) In order to ensure the consistency of knowledge and avoid circular reasoning, it is necessary to detect whether there is a cycle in the association rules. The detection of the cycle is realized through a directed acyclic graph. One node represents the antecedent, and one node represents the latter. Use directed edge connections to detect association rules in descending order of confidence one by one.

相对于现有技术，本发明首先，检测每个变量的状态，采用动态划分聚类方法对所有窗口特征向量进行聚类，同一个簇内的窗口特征近似，代表同一种状态，不同簇内的窗口表示不同的状态。将所有变量的状态字符串按照时间对齐，得到多变量状态矩阵。利用Apriori算法挖掘不同变量状态取值之间的频繁共现关系，从而得到多个变量不同取值之间的关联，并给出形式化表达及其关联强度。最后用对产生的关联规则进行约简去除冗余信息。本发明考虑到了变量取值的模糊性或不确定性，具有抗噪音干扰能力，适合于对小参数集合细致地分析其状态取值关联性，挖掘出状态值映射关系。Compared with the prior art, the present invention first detects the state of each variable, and clusters all window feature vectors using a dynamic partition clustering method. The window features in the same cluster are similar, representing the same state, and the window feature vectors in different clusters are similar. Windows represent different states. Align the state strings of all variables according to time to obtain a multivariate state matrix. The Apriori algorithm is used to mine the frequent co-occurrence relationship between different variable state values, so as to obtain the correlation between different values of multiple variables, and give the formal expression and its correlation strength. Finally, redundant information is removed by reducing the generated association rules. The present invention takes into account the fuzziness or uncertainty of variable values, has anti-noise interference capability, and is suitable for carefully analyzing the relationship between state values of small parameter sets and digging out the state value mapping relationship.

附图说明Description of drawings

图1是本发明系统的模块框架图。Fig. 1 is a block diagram of the system of the present invention.

图2是本发明动态划分聚类模块流程图。Fig. 2 is a flowchart of the dynamic division and clustering module of the present invention.

表1是本发明示例时序变量的状态字符串。Table 1 is the state character string of the example time series variable of the present invention.

表2是本发明示例时序变量的状态关联规则挖掘结果。Table 2 is the state association rule mining results of the example time series variables of the present invention.

图3是本发明部分示例时序变量状态关联规则示意图。Fig. 3 is a schematic diagram of some example temporal variable state association rules in the present invention.

图4是本发明部分示例时序变量状态关联规则示意图。Fig. 4 is a schematic diagram of some example temporal variable state association rules in the present invention.

具体实施方式detailed description

下面结合附图对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.

参见图1，实现本发明的系统包括数据预处理模块1-1、特征提取模块1-2、动态划分聚类模块1-3、多变量状态矩阵生成模块1-4、Apriori状态关联挖掘模块1-5和关联规则约简模块1-6；本发明方法的具体步骤是：Referring to Fig. 1, the system realizing the present invention comprises data preprocessing module 1-1, feature extraction module 1-2, dynamic division clustering module 1-3, multivariate state matrix generation module 1-4, Apriori state association mining module 1 -5 and association rule reduction module 1-6; The concrete steps of the inventive method are:

1)首先，数据预处理模块1-1对原始时序数据进行去野值、等间隔插值、归一化操作，得到有效数据形式；1) First, the data preprocessing module 1-1 performs outlier removal, equal interval interpolation, and normalization operations on the original time series data to obtain an effective data form;

去野值的步骤包括：计算每个窗口的均值和标准差，判断每个数据点与其所在观察窗口均值之差是否大于5倍的观察窗口的标准差，若大于，则该数据点为野值剔除；对去野值后的时间序列进行等间隔插值，设间隔为△t，起始时刻是T，则等间隔插值后的时间集合应为{T+n*△t n＝0，1，2，3…}，T+i*△t时刻对应的值为原始序列中离该时刻最近的小于T+i*△t时刻所对应的值，即原始序列中第一个大于T+i*△t时刻的前一个时刻所对应的观测值；对等间隔插值操作后的数据进行线性归一化，首先扫描一遍时间序列，获得观测值的最大值(max)和最小值(min)，根据如下公式计算每个观测点归一化后的数值，将原始时间序列取值范围转换到[0,1]区间上；The step of removing the outlier includes: calculating the mean and standard deviation of each window, and judging whether the difference between each data point and the mean value of the observation window in which it is located is greater than 5 times the standard deviation of the observation window, if greater, then the data point is an outlier Removal; perform equal-interval interpolation on the time series after wild value removal, set the interval as △t, and start time T, then the time set after equal-interval interpolation should be {T+n*△t n=0, 1, 2 , 3...}, the value corresponding to the moment T+i*△t in the original sequence is less than the value corresponding to the moment T+i*△t closest to this moment, that is, the first one in the original sequence greater than T+i*△t The observation value corresponding to the previous moment at time t; linearly normalize the data after the equal interval interpolation operation, first scan the time series once, and obtain the maximum value (max) and minimum value (min) of the observation value, according to the following The formula calculates the normalized value of each observation point, and converts the value range of the original time series to the [0,1] interval;

${x x}_{i i} = = \frac{{x x}_{i i} - - m m i i n no}{Δ Δ}$

2)其次，特征提取模块1-2将时间序列变量的有效数据划分成长度相等的窗口，对每个窗口数据提取特征，包括傅里叶特征、统计特征、小波特征构成特征向量；2) Next, the feature extraction module 1-2 divides the effective data of the time series variable into equal-length windows, and extracts features for each window data, including Fourier features, statistical features, and wavelet features to form feature vectors;

首先，用设定的窗口对单变量数据进行切割；其次，对每个窗口内的数据进行特征提取，包括统计特征、傅里叶特征、小波特征；v1＝[均值，方差]，v1表示统计特征，其中均值反映了一个窗口内数据的平均水平，方差则描述了窗口内数据的波动程度；v2＝[傅里叶系数1，傅里叶频率1，傅里叶系数2，傅里叶频率2]，v2表示频域特征，通过傅里叶变换得到一系列的傅里叶系数，对傅里叶系数按照绝对值从大到小进行排序，选取前两个最大的傅里叶系数及其所对应的频率；v3＝[小波系数细节系数1，…小波细节系数n]，v3表示时频域特征，对每个窗口进行离散小波变换，得到n个小波细节系数，将这三方面特征综合起来，构成窗口特征向量v，v＝[v1，v2，v3]。First, use the set window to cut the univariate data; secondly, perform feature extraction on the data in each window, including statistical features, Fourier features, and wavelet features; v1=[mean, variance], v1 means statistics Features, where the mean value reflects the average level of the data in a window, and the variance describes the degree of fluctuation of the data in the window; v2=[Fourier coefficient 1, Fourier frequency 1, Fourier coefficient 2, Fourier frequency 2], v2 represents the frequency domain feature, a series of Fourier coefficients are obtained through Fourier transform, the Fourier coefficients are sorted from large to small in absolute value, and the first two largest Fourier coefficients and their Corresponding frequency; v3=[wavelet coefficient detail coefficient 1, ... wavelet detail coefficient n], v3 represents time-frequency domain feature, carries out discrete wavelet transform to each window, obtains n wavelet detail coefficients, and these three aspects characteristics are synthesized Get up to form a window feature vector v, v=[v1, v2, v3].

3)然后，动态划分聚类模块1-3对单个变量所有窗口的特征向量进行动态划分聚类，将聚类得到的簇按照大小排序，最大的簇用字符‘a’表示，次大的簇用‘b’表示，依次类推，小于给定阈值2的簇则视为噪声，用‘？’表示，将每个窗口用其所在簇对应的字符表示，这样原始数值型数据被转化成字符串形式，即该变量的状态字符串；3) Then, the dynamic partition clustering module 1-3 dynamically partitions and clusters the feature vectors of all windows of a single variable, and sorts the clusters obtained by clustering according to the size. The largest cluster is represented by the character 'a', and the second largest cluster Expressed by 'b', and so on, the clusters smaller than the given threshold 2 are regarded as noise, and '? 'Indicates that each window is represented by the character corresponding to its cluster, so that the original numerical data is converted into a string form, that is, the state string of the variable;

首先执行步骤2-1，第一个窗口单独成簇，簇心为该窗口的综合特征向量。执行步骤2-2，取下一条数据。执行步骤2-3，计算该数据与所有簇心的距离。执行步骤2-4，挑选出和其最近的簇c，它们之间的距离记作dist：First execute step 2-1, the first window is clustered separately, and the cluster center is the integrated feature vector of the window. Execute step 2-2 to get the next piece of data. Execute steps 2-3 to calculate the distance between the data and all cluster centers. Execute steps 2-4, select the closest cluster c, and record the distance between them as dist:

$c c o o s the s (({v v}_{i i k k},, {v v}_{j j k k})) = = \frac{{v v}_{i i k k} \cdot \cdot {v v}_{j j k k}}{| | {v v}_{i i k k} | | \times \times | | {v v}_{j j k k} | |}$

式中：cos(v_ik，v_jk)表示窗口i的v_k(k＝1，2，3)向量和窗口j的v_k向量间的余弦相似度。cos(v_ik，v_jk)∈[-1，+1]，利用如下公式，计算两个窗口之间的距离。In the formula: cos(v _ik , v _jk ) represents the cosine similarity between the v _k (k=1, 2, 3) vector of window i and the v _k vector of window j. cos(v _ik , v _jk )∈[-1, +1], use the following formula to calculate the distance between two windows.

式中：dist(v_i，v_j)表示窗口i和j之间的距离，dist(v_i，v_j)∈[0，1]。In the formula: dist(v _i , v _j ) represents the distance between windows i and j, and dist(v _i , v _j )∈[0,1].

执行步骤2-5，判断dist和给定阈值d，d＝0.2)的关系，若dist≤d，执行步骤2-6，将该数据并入簇c中并且立即更换簇心；Execute step 2-5, judge the relationship between dist and a given threshold d, d=0.2), if dist≤d, execute step 2-6, merge the data into cluster c and replace the cluster center immediately;

式中：cv_k表示簇心c的第k个特征向量，它等于簇内所有窗口第k个特征向量的均值，一个簇的簇心c就是cv₁，cv₂，cv₃的组合，即簇内所有窗口综合特征向量的平均值；In the formula: cv _k represents the kth eigenvector of the cluster center c, which is equal to the mean value of the kth eigenvector of all windows in the cluster, and the cluster center c of a cluster is the combination of cv ₁ , cv ₂ , and cv ₃ , that is, the cluster The average value of the comprehensive eigenvectors of all windows in the window;

若dist>d，执行步骤2-7，该数据单独成簇并且作为簇心。执行步骤2-8，判断是否处理完所有数据。若没有处理完，执行步骤2-2，取下一条数据；否则执行步骤2-9，取第一条数据。执行步骤2-10，计算该数据与所有簇的簇心距离dist。执行步骤2-11，挑选出和其最近的簇c，它们之间的距离记作dist。执行步骤2-12，判断dist和给定阈值d的关系，若dist≤d，执行步骤2-13，判断该数据在不在簇c中，若该数据不存在簇c中，执行步骤2-14，将该数据移入簇c中；否则执行步骤2-15，不对该数据进行操作。若dist>d，执行步骤2-16，该数据单独成簇。执行步骤2-17，判断是否处理完所有数据。若没有处理完，执行步骤2-18，取下一条数据；否则执行步骤2-19，计算所有簇的簇心。执行步骤2-20，判断簇心是否有变化，即聚类结果是否发生变化。若发生变化，则执行步骤2-9；否则执行步骤2-21，选出距离最近的两个簇。执行步骤2-22，判断这两个簇心之间的距离dist和给定阈值α，α＝0.3)的大小。若dist<α，合并这两个簇并且计算合并后新簇的簇心，接着执行步骤2-21；若dist≥α,则退出聚类过程。聚类结果中同一个簇内的窗口特征近似，被认为是一种状态，不同簇的窗口代表不同的状态；将所有簇按照大小排序，最大的簇用字符‘a’表示，次大的簇用‘b’表示，依次类推，小于给定阈值2的簇则视为噪声，用‘？’表示，将每个窗口用其所在簇对应的字符表示，这样原始数值型数据被转化成字符串形式，即获取每个变量的状态字符串。If dist>d, perform steps 2-7, and the data is clustered separately and used as the cluster center. Execute steps 2-8 to determine whether all data has been processed. If not, go to step 2-2 to get the next piece of data; otherwise go to step 2-9 to get the first piece of data. Execute steps 2-10 to calculate the cluster center distance dist between the data and all clusters. Execute steps 2-11, select the closest cluster c, and record the distance between them as dist. Execute step 2-12, judge the relationship between dist and the given threshold d, if dist≤d, execute step 2-13, judge whether the data is in cluster c, if the data does not exist in cluster c, execute step 2-14 , move the data into cluster c; otherwise, execute step 2-15 and do not operate on the data. If dist>d, perform steps 2-16, and the data is clustered separately. Execute steps 2-17 to determine whether all data has been processed. If not, go to step 2-18 to get the next piece of data; otherwise go to step 2-19 to calculate the cluster centers of all clusters. Execute steps 2-20 to determine whether the cluster center has changed, that is, whether the clustering result has changed. If there is a change, go to step 2-9; otherwise go to step 2-21 to select the two closest clusters. Execute steps 2-22 to determine the distance dist between the two cluster centers and the given threshold α, α=0.3). If dist<α, merge the two clusters and calculate the cluster center of the merged new cluster, then execute step 2-21; if dist≥α, exit the clustering process. In the clustering results, the window characteristics in the same cluster are similar, which is considered as a state, and the windows of different clusters represent different states; all clusters are sorted by size, the largest cluster is represented by the character 'a', and the second largest cluster Expressed by 'b', and so on, the clusters smaller than the given threshold 2 are regarded as noise, and '? 'Indicates that each window is represented by the character corresponding to its cluster, so that the original numerical data is converted into a string form, that is, the status string of each variable is obtained.

参照表1，其为动态划分聚类模块对所有示例变量的状态字符串挖掘结果。对每个变量，聚类结果中同一个簇内的窗口特征近似，被认为是一种状态，不同簇的窗口代表不同的状态；将所有簇按照大小排序，最大的簇用字符‘a’表示，次大的簇用‘b’表示，依次类推，小于给定阈值的簇则视为噪声，用‘？’表示，将每个窗口用其所在簇对应的字符表示，这样原始数值型数据被转化成字符串形式，即获取每个变量的状态字符串。Referring to Table 1, it is the state string mining results of all example variables by the dynamic partition clustering module. For each variable, the window characteristics in the same cluster in the clustering result are similar, which is considered as a state, and the windows of different clusters represent different states; all clusters are sorted by size, and the largest cluster is represented by the character 'a' , the next largest cluster is represented by 'b', and so on, the clusters smaller than a given threshold are regarded as noise, and '? 'Indicates that each window is represented by the character corresponding to its cluster, so that the original numerical data is converted into a string form, that is, the status string of each variable is obtained.

表1Table 1

4)多变量状态矩阵生成模块1-4将所有变量的状态字符串按照时间对齐，形成状态矩阵；4) The multi-variable state matrix generation module 1-4 aligns the state character strings of all variables according to time to form a state matrix;

多变量状态矩阵生成模块1-4是将所有变量的状态字符串按照时间对齐，假设有n个变量，每个变量的起始观测时间和截止时间相同，那么它们的窗口数目必然相同，状态字符串长度也相同，假设状态字符串长度为m，则生成n*m的多变量状态矩阵。Multi-variable state matrix generation modules 1-4 are to align the state strings of all variables according to time. Assuming that there are n variables, and the start observation time and cut-off time of each variable are the same, then their window numbers must be the same, and the state characters The length of the string is also the same, assuming that the length of the state string is m, a multivariate state matrix of n*m is generated.

5)Apriori状态关联挖掘模块1-5用Apriori算法对多变量状态矩阵进行频繁项集和关联规则挖掘；5) Apriori state association mining modules 1-5 use the Apriori algorithm to mine frequent itemsets and association rules for the multivariate state matrix;

所述的Apriori状态关联挖掘模块是利用Apriori算法对多变量状态矩阵进行频繁项集和关联规则挖掘：Apriori算法挖掘频繁项集流程如下：首先，通过扫描事务记录，找出所有的频繁1项集，该集合记做L₁，然后利用L₁寻找频繁2项集L₂，如此下去，直到不能再找到任何频繁k项集；每次迭代分为两个步骤：第一，通过连接步和剪枝步产生候选项集；第二，计算每个候选项的支持度，大于最小支持度阈值0.001的项被认为是频繁项，在频繁项集的基础上挖掘关联规则，具体如下：首先，对于每个频繁项集L,产生L的所有非空子集；其次，对于L的每个非空子集s，产生一个候选规则“s→(L-s)”,其中(L-s)表示L中除去s之后所剩的内容；最后如果该候选规则的置信度大于给定阈值0.5,则输出该规则，否则抛弃该规则，规则的置信度按下式计算：The Apriori state association mining module utilizes the Apriori algorithm to mine frequent itemsets and association rules of the multivariate state matrix: the Apriori algorithm mining frequent itemsets process is as follows: first, find out all frequent 1-itemsets by scanning transaction records , the set is denoted as L ₁ , and then use L ₁ to find the frequent 2-itemset L ₂ , and so on until no more frequent k-itemset can be found; each iteration is divided into two steps: first, through the connection step and the clipping step The branch step generates candidate item sets; secondly, calculate the support degree of each candidate item, and items greater than the minimum support threshold 0.001 are considered frequent items, and association rules are mined on the basis of frequent item sets, as follows: first, for For each frequent itemset L, generate all non-empty subsets of L; secondly, for each non-empty subset s of L, generate a candidate rule "s→(Ls)", where (Ls) represents all non-empty subsets in L after removing s The remaining content; finally, if the confidence of the candidate rule is greater than the given threshold 0.5, the rule is output, otherwise the rule is discarded, and the confidence of the rule is calculated as follows:

$C C f f ((L L,, s the s)) = = \frac{S S p p ((L L))}{S S p p ((s the s))}$

6)最后，关联规则约简模块1-6对检测到的关联规则进行约简，消除冗余信息，得到最终的多变量状态关联规则。6) Finally, the association rule reduction module 1-6 reduces the detected association rules, eliminates redundant information, and obtains the final multi-variable state association rules.

2)对于每一个K阶频繁项(K>1)，只保留置信度最大的关联规则；例如：二阶频繁项(A，B)，如果产生关联规则A→B和B→A，只保留置信度最大的一个；2) For each K-order frequent item (K>1), only keep the association rule with the highest confidence; for example: second-order frequent item (A, B), if association rules A→B and B→A are generated, only keep the one with the greatest confidence;

3)如果两条关联规则的前件相同，则对后件进行比较，如果后件存在包含关系，在置信度相差很小的前提下，删除被包含的后件所属于的规则；例如：A→B和A→B，C，如果|Cf(A→B)-Cf(A→B，C)|<δ，则删除A→B，保留A→B，C；3) If the antecedents of the two association rules are the same, compare the latter, and if the latter has an inclusion relationship, delete the rule to which the included latter belongs under the premise that the confidence difference is small; for example: A →B and A→B, C, if |Cf(A→B)-Cf(A→B, C)|<δ, delete A→B, keep A→B, C;

4)如果两条关联规则的后件相同，则对前件进行比较，如果前件存在包含关系，在置信度相差很小的前提下，删除前件比较多的规则，保留前件比较少的规则；例如：A→B和A，C→B，如果|Cf(A→B)-Cf(A，C→B)|<δ，则删除A，C→B，保留A→B；4) If the latter parts of the two association rules are the same, compare the former parts. If there is an inclusion relationship between the former parts, on the premise that the difference in confidence is small, delete the rule with more former parts and keep the rule with less former parts Rules; for example: A→B and A, C→B, if |Cf(A→B)-Cf(A, C→B)|<δ, delete A, C→B, keep A→B;

5)为了确保知识的一致性，避免出现循环推理，需要检测关联规则中是否存在环，对环的检测通过有向无环图实现，用一个节点表示前件，一个节点表示后件，二者用有向边连接，逐一检测按照置信度降序排列的关联规则。假定当前考虑A→B，首先看有向无环图中从B是否能到达A，即检测在B的所有后继节点中是否包含A，如果B的后继节点包含A，即有向图包含B→A，则删除A→B。如果B的后继节点中不包含A，那么将A→B添进有向图中。5) In order to ensure the consistency of knowledge and avoid circular reasoning, it is necessary to detect whether there is a cycle in the association rules. The detection of the cycle is realized through a directed acyclic graph. One node represents the antecedent, and one node represents the latter. Use directed edge connections to detect association rules in descending order of confidence one by one. Assuming that A→B is currently considered, first check whether A can be reached from B in the directed acyclic graph, that is, check whether A is contained in all the successor nodes of B, if B's successor nodes contain A, that is, the directed graph contains B→ A, then delete A → B. If B's successor node does not contain A, then add A→B into the directed graph.

参照表2，其为变量状态关联规则约简后的结果。Referring to Table 2, it is the result of the reduction of the variable state association rules.

表2Table 2

参照图3，其为关联规则“P002＝b→P004＝b”的示意图，图中虚线(上部分)表示变量P002的‘b’状态，其中有一部分空白，空白并不是没有数据，而是该窗口对应其他状态，不是‘b’状态。实线(下部分)为变量P004的‘b’状态。该条规则的置信度为47/50＝0.940，表示在146条记录中变量P002和P004同时出现状态‘b’的次数为47，变量P002单独出现‘b’状态的次数为50。With reference to Fig. 3, it is the schematic diagram of association rule " P002=b→P004=b ", and dotted line (upper part) in the figure represents the 'b' state of variable P002, wherein has a part blank, and blank is not that there is no data, but the The window corresponds to other states, not the 'b' state. The solid line (lower part) is the 'b' state of variable P004. The confidence level of this rule is 47/50=0.940, which means that among the 146 records, the number of times that variables P002 and P004 appear in state 'b' at the same time is 47, and the number of times that variable P002 appears in state 'b' alone is 50.

参照图4，其为关联规则“P073＝b→P075＝b”的示意图，其中虚线(上半部分)表示变量P075的‘b’状态，实线(下半部分)表示变量P073的‘b’状态。可以看到P073的‘b’状态不尽相同，有的顶部出现尖峰，有的顶部是平行直线。实际上，在挖掘变量状态时，对该变量的所有窗口特征向量进行聚类，同一个簇内的窗口特征向量近似(并不一定完全相同)，同一个簇内的窗口表示一种状态，所以一个状态对应的窗口内数据的特征、形态近似，但是不完全相同。该条规则的置信度为44/44＝1.0，表示在146条记录中变量P073和P075同时出现状态‘b’的次数为44，变量P073单独出现‘b’状态的次数也为44。这说明变量P073出现‘b’状态时总伴随着P075的‘b’状态。With reference to Fig. 4, it is the schematic diagram of association rule " P073=b→P075=b ", wherein dotted line (upper part) represents the 'b' state of variable P075, and solid line (lower part) represents the 'b' of variable P073 state. It can be seen that the 'b' state of P073 is different, some peaks appear on the top, and some tops are parallel straight lines. In fact, when mining the state of a variable, all the window eigenvectors of the variable are clustered, the window eigenvectors in the same cluster are approximate (not necessarily identical), and the windows in the same cluster represent a state, so The characteristics and shapes of the data in the window corresponding to a state are similar, but not identical. The confidence level of this rule is 44/44=1.0, which means that among the 146 records, the number of times that variables P073 and P075 simultaneously appear in state 'b' is 44, and the number of times that variable P073 appears in state 'b' alone is also 44. This shows that variable P073 is always accompanied by the 'b' state of P075 when the 'b' state appears.

Claims

1. A method for digging time-series data state association is characterized in that, the system realizing the method comprises data preprocessing module (1-1), feature extraction module (1-2), dynamic division clustering module (1-3 ), multivariate state matrix generation module (1-4), Apriori state association mining module (1-5) and association rule reduction module (1-6), the specific steps are:

1) First, the data preprocessing module (1-1) performs wild value removal, equal interval interpolation, and normalization operations on the original time series data to obtain effective data forms;

2) Next, the feature extraction module (1-2) divides the effective data of time series variables into windows of equal length, and extracts features for each window data, including Fourier features, statistical features, and wavelet features to form feature vectors;

3) Then, the dynamic clustering module (1-3) dynamically clusters the eigenvectors of all windows of a single variable, sorts the clusters obtained by clustering according to size, the largest cluster is represented by the character 'a', and the second largest cluster is The clusters are represented by 'b', and so on, the clusters smaller than the given threshold 2 are regarded as noise, and '? 'Indicates that each window is represented by the character corresponding to its cluster, so that the original numerical data is converted into a string form, that is, the state string of the variable;

4) The multi-variable state matrix generation module (1-4) aligns the state character strings of all variables according to time to form a state matrix;

5) The Apriori state association mining module (1-5) uses the Apriori algorithm to mine frequent itemsets and association rules for the multivariate state matrix;

6) Finally, the association rule reduction module (1-6) reduces the detected association rules, eliminates redundant information, and obtains the final multivariate state association rules.

2. the method for mining sequential data state association according to claim 1, is characterized in that: the step of described data preprocessing module (1-1) removing wild value comprises: calculating the mean value and standard deviation of each window, Determine whether the difference between each data point and the mean value of the observation window is greater than 5 times the standard deviation of the observation window. If it is greater, the data point is outlier elimination; perform equal interval interpolation on the time series after the outlier removal, and set the interval is △t, and the starting time is T, then the time set after equal interval interpolation should be {T+n*△t n=0, 1, 2, 3...}, and the value corresponding to T+i*△t time is the original The value corresponding to the moment closest to this moment in the sequence is less than T+i*△t moment, that is, the observation value corresponding to the first moment in the original sequence greater than T+i*△t moment; interpolation operation at equal intervals The final data is linearly normalized. First scan the time series to obtain the maximum value (max) and minimum value (min) of the observed value. Calculate the normalized value of each observation point according to the following formula, and convert the original time series The value range is converted to the [0,1] interval;

{x x}_{i i} = = \frac{{x x}_{i i} - - m m i i n no}{Δ Δ}

Among them, x _i represents the value of the i-th observation point; △=max-min.

3. the method for digging time-series data state correlation according to claim 1, is characterized in that, feature extraction module (1-2): at first, cut single-variable data with the window of setting; Secondly, for each window Feature extraction is performed on the data in the window, including statistical features, Fourier features, and wavelet features; v1=[mean value, variance], v1 represents statistical features, where the mean value reflects the average level of data in a window, and the variance describes the window. The degree of fluctuation of the data; v2=[Fourier coefficient 1, Fourier frequency 1, Fourier coefficient 2, Fourier frequency 2], v2 represents the frequency domain feature, and a series of Fourier values are obtained through Fourier transform Leaf coefficients, sort the Fourier coefficients from large to small according to their absolute values, and select the first two largest Fourier coefficients and their corresponding frequencies; v3=[Wavelet coefficient detail coefficient 1,...Wavelet detail coefficient n] , v3 represents the time-frequency domain feature, and discrete wavelet transform is performed on each window to obtain n wavelet detail coefficients, and these three aspects of features are combined to form a window feature vector v, v=[v1, v2, v3].

4. the method for digging time-series data state correlation according to claim 1, is characterized in that: described dynamic division clustering module (1-3) adopts dynamic division clustering method to carry out aggregation to the feature vector of all windows of single variable The process of class is as follows:

1) The first window is clustered separately, and the cluster center is the comprehensive feature vector of the window;

2) In the initial division process of clusters, the similarity between the second window and the first cluster center is calculated according to the following formula:

c c o o s the s (({v v}_{i i k k},, {v v}_{j j k k})) = = \frac{{v v}_{i i k k} \cdot &Center Dot; {v v}_{j j k k}}{| | {v v}_{i i k k} | | \times \times | | {v v}_{j j k k} | |}

In the formula: cos(v _ik , v _jk ) represents the cosine similarity between the v _k (k=1, 2, 3) vector of window i and the v _k vector of window j; cos(v _ik , v _jk )∈[ -1, +1], use the following formula to calculate the distance between two windows;

d d (({v v}_{i i k k},, {v v}_{j j k k})) = = 11 - - \frac{c c o o s the s (({v v}_{i i k k},, {v v}_{j j k k})) + + 11}{22}

d d i i s the s t t (({v v}_{i i},, {v v}_{j j})) = = \frac{{Σ Σ}_{k k = = 11}^{33} d d i i s the s t t (({v v}_{i i k k},, {v v}_{j j k k}))}{33}

In the formula: dist(v _i , v _j ) represents the distance between window i and j, dist(v _i , v _j )∈[0,1];

If dist<d, d=0.2), merge window 2 into the first cluster and replace the cluster center immediately according to the following formula:

{cv cv}_{k k} = = \frac{{Σ Σ}_{i i = = 11}^{n no} {v v}_{i i k k}}{n no},, ((k k = = 11,, 22,, 33))

In the formula: cv _k represents the kth eigenvector of the cluster center c, which is equal to the mean value of the kth eigenvector of all windows in the cluster, and the cluster center c of a cluster is the combination of cv ₁ , cv ₂ , and cv ₃ , that is, the cluster The average value of the comprehensive eigenvectors of all windows in the window; if dist≥d, window No. 2 is clustered separately, and the remaining windows are processed in turn: calculate the distance between the i-th window and all the cluster centers that have been generated, and select the closest to it The distance dist of the cluster c, if dist<d, merge the i window into the cluster c, otherwise form a separate cluster;

3) Cluster adjustment process: take out the i-th window (i=1, 2,...m), calculate the distance dist between it and all cluster centers, select the smallest dist and its corresponding cluster c, if dist≤d , and window i is not in cluster c, move window i from the original cluster to cluster c; if dist≤d, and window i is in cluster c, then window i does not operate; if dist>d, move window i is removed from the original cluster and clustered separately; repeat the above process until all windows are processed, and the cluster centers of all clusters are calculated. If there is a cluster center that has changed, repeat the cluster adjustment process, that is, step 3) until the cluster centers of all clusters do not change; if all cluster centers are unchanged, then perform step 4);

4) Cluster merging process: Calculate the distance between the cluster centers of any two clusters, select the two closest clusters c _i , c _j , and their corresponding distance dist, if dist≤α, α=0.3), then merge Clusters c _i , c _j , and calculate the cluster center of the new cluster after merging, repeat the merging process 4), if dist>α, it means that there are no two clusters that are close enough to be merged, then exit the merging process, and the clustering algorithm ends, In the clustering results, the window characteristics in the same cluster are similar, which is considered as a state, and the windows of different clusters represent different states; all clusters are sorted by size, the largest cluster is represented by the character 'a', and the second largest cluster Expressed by 'b', and so on, clusters smaller than a given threshold are regarded as noise, and '? 'Indicates that each window is represented by the character corresponding to its cluster, so that the original numerical data is converted into a string form, that is, the status string of each variable is obtained.

5. the method for digging time-series data state correlation according to claim 1, is characterized in that: described multivariate state matrix generating module (1-4) is to align the state character strings of all variables according to time, assuming that there are n variables, and the start observation time and cut-off time of each variable are the same, then their number of windows must be the same, and the length of the state string is also the same. Assuming the length of the state string is m, a multivariate state matrix of n*m is generated.

6. the method for mining sequential data state association according to claim 1, is characterized in that: described Apriori state association mining module (1-5) is to utilize Apriori algorithm to carry out frequent itemsets and association rules to multivariate state matrix Mining: The Apriori algorithm mining frequent itemset process is as follows: First, find out all frequent 1-itemsets by scanning transaction records, and record this set as L ₁ , then use L ₁ to find frequent 2-itemsets L ₂ , and so on until Can no longer find any frequent k-itemsets; each iteration is divided into two steps: first, generate candidate item sets through the connection step and pruning step; second, calculate the support of each candidate, which is greater than the minimum support threshold Items of 0.001 are considered frequent items, and association rules are mined on the basis of frequent itemsets, as follows: First, for each frequent itemset L, generate all non-empty subsets of L; secondly, for each non-empty subset of L Set s to generate a candidate rule "s→(Ls)", where (Ls) represents the remaining content in L after removing s; finally, if the confidence of the candidate rule is greater than the given threshold 0.5, then output the rule, otherwise Abandoning this rule, the confidence of the rule is calculated as follows:

C C f f ((L L,, s the s)) = = \frac{S S p p ((L L))}{S S p p ((s the s))}

Among them, Cf(L,s) represents the confidence degree of the rule "s→(L-s)", Sp(L) represents the support degree of L, and Sp(s) represents the support degree of s.

7. The method for mining time series data state association according to claim 1, characterized in that: said association rule reduction module (1-6) merges or deletes redundant rules generated, and the reduction step is:

1) For the obtained association rules, sort according to the confidence degree from large to small;

2) For each K-order frequent item (K>1), only keep the association rule with the highest confidence;

3) If the antecedents of the two association rules are the same, compare the latter, and if the latter has an inclusion relationship, delete the rule to which the included latter belongs under the premise that the confidence difference is small;

4) If the latter parts of the two association rules are the same, compare the former parts. If there is an inclusion relationship between the former parts, on the premise that the difference in confidence is small, delete the rule with more former parts and keep the rule with less former parts rule;

5) In order to ensure the consistency of knowledge and avoid circular reasoning, it is necessary to detect whether there is a cycle in the association rules. The detection of the cycle is realized through a directed acyclic graph. One node represents the antecedent, and one node represents the latter. Use directed edge connections to detect association rules in descending order of confidence one by one.