CN104050070B

CN104050070B - High-dimensional flow data changing point detection method in distributed system

Info

Publication number: CN104050070B
Application number: CN201410243426.XA
Authority: CN
Inventors: 赵丽; 刘欣然; 曹玮; 付戈; 刘谦
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2014-03-28
Filing date: 2014-03-28
Publication date: 2017-02-22
Anticipated expiration: 2034-03-28
Also published as: CN104050070A

Abstract

The invention provides a high-dimensional flow data changing point detection method in a distributed system. The method comprises the following steps: obtaining standardized high-dimensional original flow data in the distributed system; carrying out dimension reduction on the high-dimensional original flow data; clustering ordered sequence data characterized by main components and determining non-trivial points of main component data; judging whether original flow data of each dimension is obviously changed or not at the corresponding non-trivial point. The method is used for detecting a changing data of the high-dimensional flow data in the distributed system, and is further used for helping a manager to better monitor and analyze the flow data in the distributed system.

Description

Higher-dimension data on flows change point detecting method in a kind of distributed system

Technical field

The present invention relates to a kind of detection method in data mining technology field, high in particular to a kind of distributed system Dimension data on flows change point detecting method.

Background technology

Administrative staff can be helped quickly to grasp different in system answering from analysis the monitoring of data on flows in distributed system Loading condition, and then the reasonability of analyzing software system structure and real-time detection abnormal conditions.The flow of distributed system The investigation that the analysis of data also can help to website visiting temperature, access the information such as content focus, user's access habits.

However, because Services in Distributed System device quantity is big, on each server, the application program of deployment constantly produces greatly Amount data on flows, higher higher-dimension (higher-dimension the refers to two dimension and above dimension) data on flows of the data on flows dimension of generation, and number According to having periodically, administrative staff are difficult to directly data be observed and analyze.For example, for a http-server, Page click volume often has periodically, and that is, the data volume on daytime is more much bigger than the data of night；If the data on certain day daytime Amount is obviously reduced very but still ratio is bigger at night, and administrative staff are likely to monitor this change.Produce and data periodic wave Dynamic different change is referred to as non-trivial change, and the data point producing non-trivial change is referred to as non-trivial change point, and as data becomes Change point.Further, since Services in Distributed System device quantity is big, the data on flows of generation is many, and administrative staff are relatively fewer, directly Connect these data costs of observation very big or even infeasible.The detection side to the change of higher-dimension data on flows is not suggested that in prior art Method, proposes a kind of effective data on flows change point detecting method very necessary.

Technology according to the present invention includes PCA (PCA), serial specimen culstering method and the F method of inspection.

The principal component analytical method original higher-dimension data on flows of minority principal component feature interpretation, to reduce feature space Dimension simultaneously retains the purpose of the topmost information of sample.The principle of principal component analysis be by one-component may be related higher-dimension to Amount x, is projected in the new orthogonal space being characterized by principal component by eigenvectors matrix, the order of principal component is by original number Size order according to the variance projecting to this principal component determines, is characterized as low-dimensional vector y with some principal components ranking anteposition Original high dimensional data, and only have lost some secondary information.Meanwhile, according to low-dimensional principal component vector and characteristic vector square Battle array, can reconstruct corresponding original high dimension vector substantially.

Optimum segmentation algorithm (also referred to as " serial specimen culstering method ") is to carry out optimum segmentation to ordered sample sequence.Optimum Partitioning algorithm basic ideas are a given sample sequence and classification number, by searching for all possible splitting scheme, find section A kind of minimum scheme of the summation of interior sum of squares of deviations is as final splitting scheme.Total deviation square due to a data sequence With equal to sum of squares of deviations in section and intersegmental sum of squares of deviations sum, therefore in section, sum of squares of deviations minimum means that intersegmental deviation is put down Square and maximum, that is, in every section, have the most uniform physical property, and intersegmental reach maximum difference, therefore be optimal dividing.? The early complexity that proposed in 1958 by Fisher is O (n²) optimum segmentation algorithm, minimum former with all kinds of internal specimen differences Then, ordered sample is classified.

F inspection is a kind of statistical method, and also referred to as " homogeneity test of variance ", the method is the side by checking two groups of samples Whether difference has significant difference, i.e. homogeneity of variance to judge whether two groups of samples have significant difference.Mainly by comparing two groups The F statistic that the between-group variance of data and intra-class variance obtain, if ratio is more than F distribution critical value then it is assumed that having notable Difference, then not thinks there is significant difference if less than F distribution critical value.F distribution critical value is relevant with the free degree and confidence level, Can be obtained by looking into F distribution tables of critical values.

Content of the invention

For overcoming above-mentioned the deficiencies in the prior art, the invention provides higher-dimension data on flows change in a kind of distributed system Point detecting method.

Realizing the solution that above-mentioned purpose adopted is：

Higher-dimension data on flows change point detecting method in a kind of distributed system, it thes improvement is that：Methods described bag Include following steps：

I, the original data on flows of higher-dimension of the described distributed system Plays of acquisition；

II, higher-dimension original flow Data Dimensionality Reduction；

The ordered sample data clusters that III, principal component characterize, determine the non-trivial point of number of principal components evidence；

Whether IV, the original each dimension data on flows of judgement occur non-trivial to change at non-trivial point.

Further, described step I includes：

S101, the server of described distributed system are provided with water flow collection device, and in the acquisition unit interval, application program is former Beginning data on flows；

S102, the original flow tables of data of different for the same time obtaining described servers is shown as high dimension vector, different The original flow data group of time point becomes original traffic matrix

In formula, x_jT () is the data volume that j-th application program of t-th sampling time point produces, the t row of matrix represents The data volume producing in all applications of t-th time point, jth row represent the number that j-th application program produces in all time points According to amount；

S103, described original traffic matrix X is standardized process, obtain the original traffic matrix of standard

In formula,

Further, in described step II, with PCA, principal component analysis is carried out to original data on flows, and Determine the principal component of the original flow of described standard.

Further, in described step III, the principal component that described step II is obtained is as described higher-dimension data on flows Feature is clustered；It is flow number with the best cutting point that periodic serial specimen culstering method determines ordered data sample According to non-trivial point.

Further, in described step IV, according to described non-trivial point, to often one-dimensional primary flow amount data described non-flat The both sides data on flows of all points carries out periodic homogeneity test of variance, judges that described non-trivial point whether there is non-trivial and becomes Change, if F statistical value exceeds F and checks critical value, there is non-trivial change, otherwise there is not non-trivial change.

Further, described step III comprises the following steps：

S301, principal component component η (t) include the PC component y ' of one or more dimensions_kT (), b (n, m) represents n orderly sample Product are divided into m class, b (n, m)：G₁={ i₁,i₁+1,...,i₂-1},G₂={ i₂,i₂+1,...,i₃-1},...,G_m={ i_m,i_m+ 1 ..., n }, its branch is 1=i₁＜ i₂＜ ... ＜ i_m＜ i_m+1- 1, i_m+1=n+1；

S302, number of principal components evidence are periodic data, and the side-play amount in setting cycle is s, and the cycle is t_p, G_kClass bias internal Amount is represented with the sample average of s：

Wherein, s.t.t_p| t-s represents and meets t-s by t_pThe constraint divided exactly；

S303, such as following formula obtain sum of squares of deviations in periodic class：

Wherein, s.t.t_p| t-s represents and meets t-s by t_pThe constraint divided exactly, T represents the transposition of vector；Define loss function For：

S304, determine non-trivial point with dynamic programming method.

Further, described step IV comprises the following steps：

S401, setting H₀Represent that the data on flows that application program produces does not have non-trivial change, H in sliced time point₁Table Show that the data on flows that application program produces has non-trivial change in sliced time point；

S402, obtained by change SSE in change SSA between class and class and determine F statistical value, including：

S403, given level of signifiance α, determine the F that confidence level is α_αValue, if F is ＞ F_αThen it is assumed that x '_jT () deposits in moment point t In non-trivial change, otherwise x '_jThere is not non-trivial change in moment point t in (t).

Further, described step S402 comprises the following steps：

S4021, change SSA as described in following formula determines between class：

Wherein,

Change SSE in S4022, class as described in following formula determines：

S4023, total sum of squares of deviations

S4024, following formula determine as described in F statistical value：Wherein, f_SSAAnd f_SSEIt is respectively SSA and SSE The free degree.

Compared with prior art, the invention has the advantages that：

1st, the method that the present invention provides adopts principal component analytical method to original flow Data Dimensionality Reduction, principal component analytical method This correlation can be represented, characterize the Main change of sample data by minority principal component, when producing non-trivial change, Initial data no longer obeys the rule of correlation, and principal component also can produce corresponding change, the therefore observation to minority principal component It will be seen that the change of initial data.

2nd, data on flows has periodically in a distributed system, and the method that the present invention provides is directed to the expansion of periodic samples The serial specimen culstering method of exhibition, and calculate the best cutting point of ordered data using dynamic programming algorithm, i.e. data on flows Non-trivial change point；The method has redefined the loss function that n ordered sample point is divided into k class, is calculating similar sample During the inter- object distance of point, the sampled point of different cycles same offset is sued for peace after calculating inter- object distance respectively again.

3rd, the method that the present invention provides is directed to the data on flows that orderly principal component characterizes, using serial specimen culstering method Sample sequence is divided, it is to avoid using when not considering the methods such as Kmeans and DBscan of order by same period The sample point that non-trivial change does not inside occur is in inhomogeneity.

4th, the method that the present invention provides is directed to periodic data on flows it is proposed that being directed to the F of the extension of periodic data The method of inspection, and F statistic is redefined according to traditional F method of inspection, accurately find to produce the primary flow of non-trivial change Amount；According to the non-trivial change point obtaining, with the F method of inspection of extension, original data on flows is checked whether one by one in non-trivial There were significant differences for change point both sides, to obtain the definite original flow changing.

Brief description

The flow chart that Fig. 1 changes point detecting method for higher-dimension data on flows in the distributed system of the present invention；

Fig. 2 is the application structure exemplary plot of the inventive method.

Specific embodiment

Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.

As shown in figure 1, Fig. 1 changes the flow process of point detecting method for higher-dimension data on flows in the distributed system of the present invention Figure, in distributed system, higher-dimension data on flows change point detecting method comprises the following steps：

Step one, the standardized original data on flows obtaining in described distributed system；

Step 2, higher-dimension original flow Data Dimensionality Reduction；

The ordered sample data clusters that step 3, principal component characterize, determine the non-trivial point of original data on flows；

Step 4, the data on flows of acquisition non-trivial point.

In step one, obtain original data on flows in distributed system, and it is standardized process, specifically include：

In S101, distributed system, by the flow collection client that is deployed on monitored server with certain when Between interval acquiring application program produce data volume；

S102, the data of the generation of acquired same sampling time different application is expressed as high dimension vector, no Form primary flow amount data matrix, such as following formula with the data of time point：

S103, described original flow data matrix X is standardized process, obtain standardized original data on flows square Battle array, such as following formula：

In formula,

In step 2, original data on flows is carried out with PCA to described standard primary flow with principal component analysis Moment matrix dimensionality reduction, obtains coefficient matrix and principal component matrix；Determine the principal component of the original traffic matrix of described standard.

PCA is that original possible related multidimensional data is mapped to new orthogonal being characterized by principal component Space in, the order of principal component projected to by initial data the variance of this principal component size order determine, before generally ranking The variance of the minority principal component of position is maximum, represents topmost information.

Step 2 specifically includes following steps：

Original data on flows X ' is multiplied with eigenvectors matrix A and obtains principal component matrix Y ', be i.e. Y '=X ' A；Wherein, A For p × p rank matrix.

Initial data x'(t in original flow data matrix)=[x'₁(t),x'₂(t),...,x'_p(t)] be converted to by Principal component y'(t)=[y'₁(t),y'₂(t),...,y'_p(t)] the new vector that constitutes, i.e. y'(t)=x'(t) A；

Wherein, y'_k(t)=x'₁(t)a_1k+x'₂(t)a_2k+...+x'_p(t)a_pk；Y ' is n × p rank matrix, and λ represents The variance of row, i.e. λ_k=var (y '_k).Y ' middle first row y'₁Variance λ₁Maximum, last arranges y'_pVariance λ_PMinimum.

Parameter Y ', A and λ can be obtained by analysis of covariance scheduling algorithm.

In step 3, according to step 2 using obtained rank anteposition minority principal component new as higher-dimension data on flows Feature, it is considered to the periodicity of data on flows, provides new ordered sample on the basis of existing serial specimen culstering method and gathers Class method, and the best cutting point of ordered data sample is determined using dynamic programming algorithm, the as non-trivial point of data on flows.

Step 3 specifically includes following steps：

S301, the selected principal component representation in components ranking anteposition is η (1), η (2) ..., η (n) is n time point Ordered data, η (t) be one or more dimensions PC component y '_k(t), such as { y '₁(t),y′₂(t)}；

A certain point-score n Ordered Sample being divided into m class is represented with b (n, m), is designated as：

b(n,m)：G₁={ i₁,i₁+1,...,i₂-1},G₂={ i₂,i₂+1,...,i₃-1},...,G_m={ i_m,i_m+ 1 ..., n }, its branch is 1=i₁＜ i₂＜ ... ＜ i_m＜ n=i_m+1- 1, i_m+1=n+1；

S302, the data for principal component are periodic data, and the side-play amount in setting cycle is s, and the cycle is t_p, determine G_k Class bias internal amount is that the sample average of s is：

Wherein, a | b represents that b can be divided exactly by a；s.t.t_p| t-s represents and meets t-s by t_pThe constraint divided exactly.

In S303, the periodic class of acquisition, sum of squares of deviations is：

Wherein, s.t.t_p| t-s represents and meets t-s by t_pThe constraint divided exactly, T represents the transposition of vector；Define loss function For：When n and m determines, all kinds of sum of squares of deviations of the less expression of L [b (n, m)] is more Little, classify more reasonable, P (n, m)={ G₁,G₂,…G_mIt is expressed as optimal classification；

S304, the dynamic programming method being proposed in 1958 using Fisher are searching algorithm, determine non-trivial point.

In step 4, according to non-trivial point, to often one-dimensional primary flow amount data in the both sides flow number of described non-trivial point According to carrying out periodic homogeneity test of variance, the present invention considers the periodicity of data on flows, has done one to homogeneity test of variance Step is improved, and judges whether described non-trivial point has significant change with this.

Step 4 specifically includes following steps：

S401, hypothesis are directed to the delivery flow rate of each application in sliced time point point i_k" there is not non-trivial change " occurs " exist non-trivial change " two kinds of situations it is assumed that：

H₀：The data on flows vector x that application program j is produced_jIn time point i_kThere is not non-trivial change；H₁：To application The data on flows vector x that program j produces_jIn time point i_kThere is non-trivial change；

To a time changing point i_k, test G_k-1And G_kIn sample data whether dramatically different.By calculating F statistics Value, if F statistical value exceeds F and checks critical value, is judged as there is non-trivial change, otherwise it is assumed that there is not non-trivial change.

S402, F statistical value computing formula is passed through " change between class " SSA and " change in class " SSE and is obtained.

S4021 has periodic SSA computing formula：

To in classification G_k-1With classification G_kData offset s,For x '_jThe mean value of (t), that is,：

It is S4022, bigger apart from public center deviation due to having the sample of the different classifications of identical data side-play amount, The value of SSA is also bigger, then it is more likely that different classes, on the contrary, SSA is more little more be probably identical class.

Impact with stochastic error is compared, and that is, in class, deviation SSE computing formula is：

S4023, withRepresent total sum of squares of deviations, such as following formula calculate total deviation and square：

Then have

S4024, such as following formula calculate F statistic：

Wherein, f_SSAAnd f_SSEIt is the free degree of SSA and SSE, f in the present embodiment_SSA=(I-1) × t_p, f_SSE=N-I × t_p.

I represents that certain factor has I level, and in the present invention, the data of change point both sides should be 2 levels, and N is to be tested Number of samples, therefore, in the present embodiment, WithIt is respectively classification G_k-1With classification G_kSample Number, I=2.

S4025, given level of signifiance α, can find the F for α corresponding to confidence level by looking into F table_(I-1,N-I)Value, that is, F_(I-1,N-I),α.

If F is ＞ F_(I-1,N-I),αThen it is assumed that G_k-1And G_kIt is dramatically different, x '_jT () has non-trivial to change in moment t, no Then think x '_jT () does not have non-trivial to change in moment t.

Finally it should be noted that:Above example is merely to illustrate the technical scheme of the application rather than to its protection domain Restriction, although being described in detail to the application with reference to above-described embodiment, those of ordinary skill in the art should Understand:Those skilled in the art read the application after still can to application specific embodiment carry out a variety of changes, modification or Person's equivalent, but these changes, modification or equivalent, are all applying within pending claims.

Claims

1. in a kind of distributed system higher-dimension data on flows change point detecting method it is characterised in that：Methods described includes following Step：

II, higher-dimension original flow Data Dimensionality Reduction；

Whether IV, the original each dimension data on flows of judgement occur non-trivial to change at non-trivial point；

In described step III, the principal component that described step II is obtained is clustered as the feature of described higher-dimension data on flows； It is the non-trivial point of data on flows with the best cutting point that periodic serial specimen culstering method determines ordered data sample.

2. the method for claim 1 it is characterised in that：Described step I includes：

S101, the server of described distributed system are provided with water flow collection device, obtain the primary flow of application program in the unit interval Amount data；

S102, the original flow tables of data of different for the same time obtaining described servers is shown as high dimension vector, different time The original flow data group of point becomes original traffic matrix

In formula, x_jT () is the data volume that j-th application program of t-th sampling time point produces, the t row of matrix represents in t The data volume that all applications of individual time point produce, jth row represent the data volume that j-th application program produces in all time points；

In formula,

P represents total p application program, and n represents total n sampling time point.

3. the method for claim 1 it is characterised in that：In described step II, with PCA to primary flow Amount data carries out principal component analysis, and determines the principal component of the original flow of standard.

4. the method for claim 1 it is characterised in that：In described step IV, according to described non-trivial point, to often one-dimensional Original data on flows carries out periodic homogeneity test of variance in the both sides data on flows of described non-trivial point, judges described non-flat All points whether there is non-trivial and change, if F statistical value exceeds F and checks critical value, there is non-trivial change, otherwise do not exist non- Ordinary change.

5. the method for claim 1 it is characterised in that：Described step III comprises the following steps：

S301, principal component component η (t) include the PC component y ' of one or more dimensions_kT (), b (n, m) represents and divides n Ordered Sample For m class, b (n, m)：G₁={ i₁,i₁+1,…,i₂-1},G₂={ i₂,i₂+1,…,i₃-1},…,G_m={ i_m,i_m+ 1 ..., n }, its Branch is 1=i₁＜ i₂＜ ... ＜ i_m＜ i_m+1- 1, i_m+1=n+1, i_mRepresent first sample of m class；

S302, number of principal components evidence are periodic data, and the side-play amount in setting cycle is s, and the cycle is t_p, G_kClass bias internal amount s Sample average represent：

D (i_{k}, i_{k + 1} - 1) = Σ_{s = 0}^{t_{p} - 1} Σ_{t = i_{k}, s . t . t_{p} | t - s}^{i_{k + 1} - 1} (η (t) - {\overset{&OverBar;}{η}}_{G_{k}}^{(s)}) {(η (t) - {\overset{&OverBar;}{η}}_{G_{k}}^{(s)})}^{T};

Wherein, s.t.t_p| t-s represents and meets t-s by t_pThe constraint divided exactly, T represents the transposition of vector；Defining loss function is：

S304, determine non-trivial point with dynamic programming method.

6. method as claimed in claim 4 it is characterised in that：Described step IV comprises the following steps：

S401, setting H₀Represent that the data on flows that application program produces does not have non-trivial change, H in sliced time point₁Representing should There is non-trivial change in sliced time point in the data on flows being produced with program；

S403, given level of signifiance α, determine the F that confidence level is α_αValue, if F is ＞ F_αThen it is assumed that x '_jT () exists non-in moment point t Ordinary change, otherwise x '_jThere is not non-trivial change in moment point t in (t)；

Quadratic sum between change SSA is for the class of two class samples of non-trivial change point both sides between class, in class, change SSE is in sample class Quadratic sum.

7. method as claimed in claim 5 it is characterised in that：Described step S402 comprises the following steps：

S4021, change SSA as described in following formula determines between class：

{SSA}_{{G_{k - 1}, G_{k}}} = Σ_{s = 0}^{t_{p} - 1} [n_{G_{k - 1}}^{(s)} {({\overset{&OverBar;}{x_{j}^{'}}}_{G_{k - 1}}^{(s)} - {\overset{&OverBar;}{\overset{&OverBar;}{x_{j}^{'}}}}_{{G_{k - 1, G_{k}}}}^{(s)})}^{2} + n_{G_{k}}^{(s)} {({\overset{&OverBar;}{x_{j}^{'}}}_{G_{k}}^{(s)} - {\overset{&OverBar;}{\overset{&OverBar;}{x_{j}^{'}}}}_{{G_{k - 1, G_{k}}}}^{(s)})}^{2}],

Wherein,

Change SSE in S4022, class as described in following formula determines：

Wherein, s.t.t_p| T-s table | s shows and meets t-s by t_pThe constraint divided exactly；

S4023, total sum of squares of deviations

S4024, following formula determine as described in F statistical value：Wherein, f_SSAAnd f_SSEBe respectively SSA and SSE from By spending；.