CN111191742A

CN111191742A - Sliding window length self-adaptive adjustment method for multi-source heterogeneous data stream

Info

Publication number: CN111191742A
Application number: CN202010087229.9A
Authority: CN
Inventors: 王为; 张梦君
Original assignee: Tianjin Normal University
Current assignee: Tianjin University; Tianjin Normal University
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2020-05-22

Abstract

The invention discloses a sliding window length self-adaptive adjusting method for multi-source heterogeneous data streams. Firstly, an expansion model of a Gaussian input limited Boltzmann machine under matrix input and multi-modal input is provided to realize feature extraction and feature fusion of heterogeneous data streams. And secondly, judging the probability distribution change of the data stream through a free energy function generated by the limited Boltzmann machine and an expansion model thereof. Finally, Hough's boundary is adopted to ensure that the data stream change between adjacent windows can be detected in time. The invention measures the change of the data stream by comparing the free energy values of the data stream between adjacent windows, adaptively adjusts the length of the sliding window, and divides the data stream into data blocks with different sizes for batch processing.

Description

Sliding window length self-adaptive adjustment method for multi-source heterogeneous data stream

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a sliding window adaptive adjustment algorithm for multi-source heterogeneous data streams.

Background

With the continuous growth of heterogeneous sensor networks, a large amount of data is continuously generated, the data size is huge, and the heterogeneous sensor networks have various characteristics such as dynamics, diversity and high-dimensional property. Such as: li Na and the like adopt a sliding window technology in a data stream clustering algorithm, and provide an incremental data stream processing method and a data stream clustering algorithm; lughofer proposes an incremental data stream clustering algorithm, and adjusts the clustering number by adopting separation and combination strategies in the process of continuously increasing data stream clustering. The sliding window-based data stream processing technology can relieve the problem of large number of data streams to a certain extent, but a fixed sliding window is often adopted, dynamic change of the data streams is ignored, and direct influence on analysis of the data streams is generated due to the change of the data streams.

When the input data is isomorphic data, the above method can adjust the sliding window by analyzing the changing characteristics of the data stream. However, most of data in the heterogeneous sensor network comes from a plurality of data sources, the data types thereof are various, and different data types contain different features, for example, text data contains a large number of discrete character vectors and picture data is composed of pixel features. Therefore, sliding window adjustment under multi-source heterogeneous input data requires fusion analysis of the data first and then adjustment of the change of the sliding window.

To this end, the present invention applies adaptive adjustment of sliding window length to multi-source heterogeneous data streams. The invention adopts a data fusion technology, fuses heterogeneous data streams, analyzes the change condition of the data streams, and introduces a Hough boundary to detect the change of the data streams in time, thereby realizing the self-adaptive adjustment of the length of a sliding window. In addition, window adjustment errors caused by single-mode information loss can be effectively avoided through fusion analysis of all mode information of the data stream.

Disclosure of Invention

The invention aims to detect the change in the multi-source heterogeneous data stream and divide the data stream in real time, and therefore, the invention provides a sliding window length self-adaptive adjustment algorithm for the multi-source heterogeneous data stream.

In order to achieve the purpose, the invention discloses the following technical contents:

a sliding window length self-adaptive adjusting method for multi-source heterogeneous data streams is characterized by comprising the following steps:

step 1, performing feature extraction and feature fusion on heterogeneous data streams through an extended model of a Gauss input limited Boltzmann machine under matrix variables and multi-modal input; wherein each node of the visible layer of the matrix variable limited Boltzmann machine follows Gaussian distribution with variance of

The energy function of which can be expressed as

The representation matrix variables may see the layer input data,

represents the output data of the hidden layer(s),

for the input data after the normalization, the data is processed,

the outer product of (a) may result in a weight matrix between the visible layer and the hidden layer,

in order to be able to see the biasing of the layers,

a bias to a hidden layer; according to the matrix input, the probability of each node of the hidden layer can be calculated

Obtaining an activation output matrix of the hidden layer node according to the probability, wherein the process is a characteristic extraction process; the principle of the multi-mode input matrix variable limited Boltzmann machine is similar to that of the matrix variable limited Boltzmann machine, and hidden layer output can be obtained by two visible layer inputs

Namely, feature fusion is realized, wherein

For the input of the visible layer or layers,

in order to hide the layer output,

the connection weights between the visible layer and the hidden layer are synthesized separately,

an offset for the hidden layer;

step 2, constructing a free energy function of the multi-source heterogeneous data stream through the model obtained by training in the step 1, and measuring the probability distribution change of the data stream according to the free energy change rate between adjacent windows;

the probability distributions of free energy and data flow are related as follows:

wherein

Representing input data

The probability density distribution of (a) is,

the free energy of the input data is represented,

for a distribution function, also called a normalization constant, for a trained restricted boltzmann machine, the value is kept unchanged, so that the free energy of input data can reflect the probability distribution condition of the input data;

and 3, comparing the free energy change rate between adjacent windows with a threshold value determined by a Hough boundary, and adjusting the length of the sliding window according to the result.

The training of the matrix variable limited boltzmann machine under the gaussian input in the step 1 comprises the following steps:

(1) calculating the probability of each node of the hidden layer according to the matrix input

Obtaining an activation output matrix of the hidden layer node according to the probability;

(2) calculating the outer product of the matrix input value and the activation output value, and defining the outer product as a positive gradient;

(3) reconstructing the activation value of the visual layer according to the activation output matrix obtained in the step (1)

And repeating the step (1) to obtain the corresponding activation output of the hidden layer;

(4) calculating the outer product of the reconstructed input value of the visual layer and the output matrix of the corresponding hidden layer, and defining the outer product as a negative gradient;

(5) and updating the parameters of the model at a certain learning rate according to the difference value of the positive gradient and the negative gradient.

The training process of the multi-modal input matrix variable limited Boltzmann machine in the step 1 is similar to the training process of the matrix variable limited Boltzmann machine under 2-Gauss input, and only the probability distribution of a hidden layer of the matrix variable limited Boltzmann machine needs to be determined by two characteristic inputs, namely

The free energy function in step 2 of the present invention can be obtained from the following relationship:

the probability distributions of free energy and input data are related as follows:

the probability distribution of the input data can be expressed as

Thus the free energy function can be expressed as

Wherein

Which represents the input data, is,

the output data is represented by a representation of,

representing the corresponding probability distribution of the data stream,

in order to be a function of the allocation,

is the free energy function corresponding to the input function,

as a function of the energy of the entire model

The free energy change rate in step 2 is defined as the ratio of the difference of the free energy of the data streams in two adjacent windows to the free energy value of the previous window; when the data between adjacent windows has no obvious change, the probability distribution of the data streams is similar, and the free energy values are not greatly different, so that the free energy change rate is close to zero, otherwise, the free energy change rate is a larger value.

The threshold value of window adjustment in step 3 of the invention can be obtained through a Hough boundary:

according to the Hough inequality in

Should be less than or equal to the window-adjusted threshold at confidence level ofWherein

Represents the maximum variation range of the free energy variation rate,

is the total number of samples.

The adaptive adjustment of the length of the sliding window in step 3 comprises the following two conditions:

(1) when the free energy change rate is less than or equal to the threshold value, the data stream between adjacent windows is not different or the change of the data stream is in a negligible range, and when the maximum size set by the algorithm is not exceeded, the window is expanded to be twice of the original size;

(2) when the free energy change rate is greater than the threshold, indicating a significant difference in data flow between adjacent windows, the window size should be reduced to the minimum window size set by the algorithm.

The invention further discloses the application of the sliding window length self-adaptive adjustment algorithm for the multi-source heterogeneous data stream in the aspect of data blocking; the experimental result proves that compared with single-mode data, the fusion analysis of the multi-mode data has higher sliding window adjustment accuracy.

The invention partitions the data stream by self-adaptive adjustment of the length of the sliding window of the multi-source heterogeneous data stream, and can improve the efficiency of data stream processing in the heterogeneous data network: when the data flow is abnormal, the window size is reduced, and abnormal data can be processed in time; when the data stream is not obviously changed, the window size is continuously enlarged, similar data can be uniformly processed at the moment, and the data processing efficiency is improved.

The invention is described in more detail below:

step 1, preprocessing multi-source heterogeneous signals respectively, converting picture signals into matrix data according to pixel values, and converting sound signals into vector data according to Mel frequency cepstrum coefficients;

step 2, respectively extracting characteristics of heterogeneous data by using a matrix variable limited Boltzmann machine and a matrix variable limited Boltzmann machine under Gaussian input to obtain corresponding characteristic vectors;

step 3, performing feature fusion on the features extracted from the heterogeneous data by adopting a multi-mode input limited Boltzmann machine;

step 4, constructing a free energy function of the heterogeneous data stream according to the models obtained in the step 2 and the step 3, wherein the free energy function can measure the probability distribution change condition of the data stream;

and 5, calculating the free energy change rate of the data stream in the adjacent window according to the free energy function, and combining the free energy change rate with the Hough boundary to realize the self-adaptive adjustment of the length of the sliding window, thereby realizing the blocking of the data stream.

In step 2, the matrix variable limited boltzmann machine under the gaussian input is an extension of the gaussian limited boltzmann machine, and an input layer and a hidden layer of the matrix variable limited boltzmann machine are both matrix variables, so that a space structure of high-dimensional data can be well stored. Similar to a limited Boltzmann machine, a visible layer and a hidden layer in the model are connected through a weight matrix, and a process of obtaining output data of the hidden layer according to input data of the visible layer is a process of feature extraction. And the training of the matrix variable limited Boltzmann machine under the Gaussian input is completed by adopting a CD algorithm.

In step 3, the multi-modal input matrix variable restricted boltzmann machine comprises two visible layers and a hidden layer, the visible layers and the hidden layer are connected through two independent weights, and a process of obtaining hidden layer data from input layer data is a process of feature fusion. Similar to the matrix variable limited boltzmann machine, the training under the multi-modal input also adopts a CD algorithm, and the only difference is that the output value of the hidden layer is determined by two input values together.

In step 4, the probability distribution of the free energy and the input data has the following relationship:

the probability distribution of the input data can be expressed as

The free energy function can thus be expressed as

Wherein

Which represents the input data, is,

the output data is represented by a representation of,

representing the corresponding probability distribution of the data stream,

in order to be a function of the allocation,

is the free energy function corresponding to the input function,

as a function of the energy of the entire model. For a trained model, the value of the partition function remains unchanged, so that the variation of the probability distribution can be measured by the free energy value of the input data.

In step 5, the hough-b-t boundary can be obtained according to the hough-t inequality. The Hough's inequality gives the probability upper bound of the sum of random variables and its expected deviation, and for a given confidence, the difference between the free energy rate of change and its average rate of change, which can be derived from the concept of confidence interval, has the following relationship:

the threshold value of the sliding window adjustment should be less than or equal to

Wherein

Represents the maximum variation range of the free energy variation rate,

is the total number of samples. That is, when the free energy change rate is larger than the threshold, the window size is reduced, otherwise, the window size is continuously enlarged.

Drawings

FIG. 1 is a schematic diagram of heterogeneous data feature extraction and feature fusion;

FIG. 2 is a diagram of sliding window length adaptive rectification;

FIG. 3 is an overall process diagram of the algorithm;

FIGS. 4 (a) and (b) show the free energy variation of the data stream and the adjustment of the length of the sliding window under the voice input, respectively;

fig. 5 (a) and (b) show free energy variation of data stream under image input and length adjustment of sliding window, respectively;

FIGS. 6 (a) and (b) illustrate free energy variation of data stream and adjustment of length of sliding window in voice-image input, respectively;

fig. 7 is a comparison of sliding window length adaptive adjustment accuracy obtained after five independent experiments.

Detailed Description

The invention is described below by means of specific embodiments. Unless otherwise specified, the technical means used in the present invention are well known to those skilled in the art. In addition, the embodiments should be considered illustrative, and not restrictive, of the scope of the invention, which is defined solely by the claims. It will be apparent to those skilled in the art that various changes and modifications can be made in these embodiments without departing from the spirit and scope of the invention. The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Example 1

Fig. 1 is a schematic diagram of feature extraction and feature fusion of heterogeneous data streams, in which two-dimensional picture data and one-dimensional voice data are shown as an example, and the specific steps are as follows:

step 1, training a matrix variable limited Boltzmann machine under Gaussian input, and performing feature extraction on matrix data by using a trained model;

step 2, inputting a training gauss into a restricted Boltzmann machine, and performing feature extraction on vector data by using a trained model;

and 3, training a matrix variable limited Boltzmann machine under multi-modal input, and performing feature fusion by taking the feature vectors obtained in the step 1 and the step 2 as input.

In step 1, each node of the visible layer of the matrix variable limited Boltzmann machine follows Gaussian distribution with variance of

The energy function of which can be expressed as

Wherein

in order to be able to see the biasing of the layers,

in order to hide the bias of the layers,

(ii) a The specific training process comprises the following steps:

(4) calculating the outer product of the reconstructed input value of the visual layer and the output matrix of the corresponding hidden layer, and defining the outer product as a negative gradient; (5) and updating the parameters of the model at a certain learning rate according to the difference value of the positive gradient and the negative gradient.

In step 3, the training process of the multi-modal input matrix variable limited boltzmann machine is similar to step 1, except that the probability distribution of the hidden layer of the multi-modal input matrix variable limited boltzmann machine needs to be jointly determined by two characteristic inputs, namely

Wherein

is the offset of the hidden layer.

Fig. 2 is a flowchart of adaptive adjustment of the length of the sliding window, which specifically includes the following steps:

step 1, setting the maximum size Nmax and the minimum size Nmin of a window in an algorithm, and setting the size of a current window as N;

step 2, calculating the free energy value of the data in the previous window according to the free energy function, and defining the free energy value as energy 1;

step 3, calculating the free energy value of the data in the next window according to the free energy function, and defining the free energy value as energy 2;

step 4, calculating the free energy change rate of the current window relative to the previous window according to the values obtained in the step 2 and the step 3, and comparing the free energy change rates with a threshold value, wherein the energy change rate is determined by the ratio of the difference of the free energy values in adjacent windows to the free energy value of the previous window, and the threshold value is obtained according to the Hough boundary;

step 5, adjusting the window, mainly including two conditions:

(1) when the free energy change rate is less than or equal to the threshold value, it indicates that the data stream between adjacent windows is not different or the change is within a negligible range, and the window size should be enlarged, specifically the window size is

(2) When the free energy change rate is greater than the threshold, indicating a significant difference in data flow between adjacent windows, the window size should be reduced to its minimum size Nmin.

Fig. 3 is an overall process diagram of the algorithm. As shown in fig. 3, for a multi-source heterogeneous data stream in a heterogeneous sensor network, firstly, a part of the multi-source heterogeneous data stream is taken as training data, and a matrix variable limited boltzmann machine and a multi-mode matrix input limited boltzmann machine under gaussian input are trained according to the process shown in fig. 1, so that feature extraction and feature fusion of the heterogeneous data stream are completed; and then, constructing a free energy function according to the model obtained by training, and completing the self-adaptive adjustment of the length of the sliding window according to the algorithm shown in FIG. 2.

Example 2

The testing of the algorithm on the CUAVE data set is given below:

the CUAVE data set is composed of 10 numbers which are respectively read from 0 to 9 by 36 persons for a plurality of times, wherein the 10 numbers comprise two kinds of heterogeneous data of images and voice. The picture data can be converted into a matrix input of 75 × 50 according to the pixel values, and the sound data can be converted into vector data of 1 × 534 according to mel-frequency cepstrum coefficients. In the experiment, the sizes of hidden layers of a Gauss input limited Boltzmann machine and a Gauss input matrix limited Boltzmann machine are set as

The maximum and minimum sizes of the sliding window were set to 60 and 20, respectively, and the experiment was performed with 95% confidence. Fig. 4, 5, 6 and 7 show graphs of the results of the experiment.

Fig. 4, 5, and 6 (a) and (b) show the energy change rate of the current window relative to the previous window and the size of the sliding window under the voice input, the image input, and the voice-image input, respectively. Comparing the free energy rate of change between adjacent windows to a threshold in plot (a), and when the rate of change exceeds the threshold, increasing the window size in plot (b) up to the maximum window size set by the algorithm; reducing the window size in fig. (b) to a minimum size when the rate of change is below a threshold. In the figure, black vertical lines indicate that the data stream changes at this point. Through comparison, the overall change of the data stream under voice input is very small, so that the abnormity occurring in the data stream cannot be detected in time, and the length of the sliding window is misadjusted; under the condition of image input, the data flow fluctuation is large, the change rate of the abnormal free energy of the data flow is suddenly increased, the size of a sliding window is reduced when the change rate exceeds a threshold value, and when the data are not changed, misadjustment can also occur due to algorithms such as external factors; although the sliding window size can also be misadjusted under the voice-image input, the method has higher accuracy compared with the voice input and the image input, because the free energy change rate of the data stream under the multi-modal input is jointly determined by information in the voice data and the image data, the method can not only compensate the phenomenon that the sliding window is missed to be adjusted under the voice input, but also avoid the phenomenon that the sliding window is misadjusted under the single image input.

Fig. 7 shows that the accuracy of the sliding window size adaptive adjustment in the case of voice input, image input, and voice-image input is compared with the accuracy of the sliding window size adaptive adjustment in the case of selecting five different sets of data, and it can be seen from fig. 7 that the accuracy of the sliding window length adaptive adjustment can be improved by performing fusion and re-analysis on heterogeneous data from different data sources in the case of multi-modal input.

Claims

1. A sliding window length self-adaptive adjusting method for multi-source heterogeneous data streams is characterized by comprising the following steps:

The energy function of which can be expressed as

The representation matrix variables may see the layer input data,

represents the output data of the hidden layer(s),

for the input data after the normalization, the data is processed,

in order to be able to see the biasing of the layers,

Namely, feature fusion is realized, wherein

For the input of the visible layer or layers,

in order to hide the layer output,

an offset for the hidden layer;

wherein

Representing the probability density distribution of the input data,

the free energy of the input data is represented,

for a distribution function, also called a normalization constant, for a trained limited boltzmann machine,

the free energy of the input data can reflect the probability distribution condition of the input data;

2. The method of claim 1, wherein the training of the matrix variable restricted boltzmann machine under gaussian input in step 1 comprises the steps of:

And obtaining an activation output matrix of the hidden layer node according to the probability;

(3) according to the activation output matrix obtained in the step (1)Reconstructing activation values for a visual layer

3. The method of claim 1, wherein the training process of the multi-modal input matrix variable limited boltzmann machine in step 1 is similar to the training process of the matrix variable limited boltzmann machine under 2 gauss input, except that the probability distribution of the hidden layer thereof needs to be determined by two feature inputs, namely

。

4. The method of claim 1, wherein the free energy function in step 2 is derived from the relationship:

the probability distribution of the input data can be expressed as

Thus the free energy function can be expressed as

Wherein

Representing input data, representing output data,

representing the corresponding probability distribution of the data stream,

in order to be a function of the allocation,

is the free energy function corresponding to the input function,

as a function of the energy of the entire model.

5. The method of claim 1, wherein the free energy change rate in step 2 is defined as a ratio of a difference between free energy of data streams in two adjacent windows to a free energy value of a previous window; when the data between adjacent windows has no obvious change, the probability distribution of the data streams is similar, and the free energy values are not greatly different, so that the free energy change rate is close to zero, otherwise, the free energy change rate is a larger value.

6. The method of claim 1, wherein the threshold value for window adjustment in step 3 is obtained by a huffman boundary:

according to the Hough inequality, under the confidence coefficient, the threshold value of window adjustment should be less than or equal to

Wherein

Represents the maximum variation range of the free energy variation rate,

is the total number of samples.

7. The method of claim 1, wherein the adaptive adjustment of the length of the sliding window in step 3 comprises the following two cases:

8. The sliding window length adaptive adjustment method for multi-source heterogeneous data streams, according to claim 1, in combination with fusion of multi-source heterogeneous data, can fully consider information contained in each data source, and has a better adjustment accuracy rate compared with a single data source.