CN109299185B

CN109299185B - Analysis method for convolutional neural network extraction features aiming at time sequence flow data

Info

Publication number: CN109299185B
Application number: CN201811216349.3A
Authority: CN
Inventors: 周同明; 汪卫; 邢宏岩; 刁广州; 杨勇; 秦嘉岷; 姜军; 王旭
Original assignee: Shanghai Shipbuilding Technology Research Institute
Current assignee: Shanghai Shipbuilding Technology Research Institute
Priority date: 2018-10-18
Filing date: 2018-10-18
Publication date: 2023-04-07
Anticipated expiration: 2038-10-18
Also published as: CN109299185A

Abstract

The invention discloses an analysis method for extracting characteristics of a convolutional neural network aiming at time sequence flow data, which comprises the following steps of preprocessing corresponding flow data, preprocessing such as data cleaning, data integration, data transformation, data merging, data remodeling and the like, and ensuring the accuracy of subsequent flow data analysis; then sampling the flow data, usually selecting a mode of attenuation window for sampling, and generating an analysis sample; carefully analyzing the data characteristics and the incidence relation of different dimensions, if the data characteristics and the incidence relation of different dimensions do not have the correlation or the correlation is not large, trying to adopt a dimension-divided convolutional neural network structure to carry out mining analysis, and not only keeping the time sequence stream data characteristics, but also finding out the combination characteristics among different dimensions. The method is favorable for finding a better method for processing the stream data by applying the convolutional neural network.

Description

Analysis method for convolutional neural network extraction features aiming at time sequence flow data

Technical Field

The invention relates to the field of stream data analysis, in particular to an analysis method for extracting characteristics of a convolutional neural network aiming at time sequence stream data.

Background

Thousands of data are being produced every moment, resulting in explosive growth of data every moment. The data explosion type growth flow, the huge and stable data storage and the data application availability in the industry field endow abundant original materials for the artificial intelligence era.

Also because we are in an era of data explosion, a powerful data processing and analyzing tool is more urgently needed, so that we can find information which is not valued by us once, find valuable information and even find knowledge about human survival from massive time-series flow data.

Disclosure of Invention

The invention provides an analysis method for extracting characteristics of a convolutional neural network aiming at time sequence flow data, which aims to solve the technical problems that the conventional model and method have difficulties or cannot effectively extract implicit characteristics, and the time sequence characteristics and the dimensional characteristics are difficult to be considered simultaneously in a convolutional neural network structure model in deep learning.

In order to solve the technical problems, the invention provides the following technical scheme:

the invention provides an analysis method for convolutional neural network extraction characteristics aiming at time sequence flow data, which comprises the following steps:

s1: preprocessing stream data;

s2, selecting a sample by an attenuation window method;

s3, designing and building a convolutional neural network model architecture;

s4, extracting features by dimensionality by adopting a convolution model;

s5, displaying and comparing the effect graphs generated by the deep learning logs;

s6, visualizing a deep learning effect graph;

aiming at time sequence characteristics and dimension information characteristics in stream data, a dimension-based convolutional neural network model is set up and adopted, strong characteristics and strong rules contained in basic information in the data are extracted, and time sequence characteristics of the stream data are considered; after the feature extraction and the reinforcement of the multidimensional data, a model which comprises the time sequence feature and the dimension feature is synthesized.

In the step S1, stream data preprocessing is carried out according to the characteristics of the stream data, including data key information identification and redundant attribute identification, factors which have the greatest influence on results are artificially screened out, the stream data of all the factors is preprocessed, and abnormal items, missing items, redundant items and difference items of historical data are supplemented by utilizing preprocessing means of data cleaning, data integration, data transformation, data merging and data remodeling; and carefully observing sample data obtained by preprocessing, carrying out digital description on screening important information, and manually establishing dimensionality of the screening target characteristics.

As a preferred technical solution of the present invention, in the step S1, the abnormal data preprocessing mode includes:

the data loss is filtered while the level of the data is increased and the data volume is reduced;

data is abnormal, and preprocessing is performed by adopting a data deletion, integrated analysis and substitution combined with an integral model and considered as missing value equivalence filling, so that the deviation degree between the processed abnormal value and other values is minimized;

data redundancy, if the correlation between two attributes of the data is large, an unimportant attribute is removed from the two attributes;

data normalization, wherein data of each dimension is not in a uniform range, and in cross-dimension calculation, the weight swings up and down too much to be beneficial to adjustment and calculation;

tag and time stamp: and (4) supervising learning aiming at the classification problem, labeling the data set and simultaneously labeling the data set with a timestamp.

As a preferred technical solution of the present invention, the step S2 specifically includes: acquiring a streaming data sample, filtering the streaming data and acquiring the streaming data;

in the acquisition of a stream data sample, in a general sampling problem, the stream data consists of a series of n field tuples, and a subset of the fields is called as key fields; assuming that the sample size after sampling is a/b, hashing the key value of each tuple to one of b buckets, and then putting the tuple of which the hash value is less than a into a sample; if there is more than one key field, the hash function combines the values of these fields to form a single hash value; the finally obtained sample is composed of all tuples of certain specific key values; the ratio of the number of the selected key values to the total number of the key values in the stream is a/b;

in the flow data filtering, a bloom filter is adopted, the bloom filter comprises an array consisting of n bits, the initial value of each bit is 0, and a series of hash functions h1, h2 \8230, each hash function maps a key value to a set S consisting of m key values in n buckets; the bloom filter allows all the flow elements with key values in S to pass through, and blocks most flow elements with key values not in S;

the stream data is obtained by adopting a method of a decay window to obtain the stream data and calculate a smooth accumulated value, wherein the adopted weight is constantly decayed and is called an exponential decay window which is marked as

Wherein a is ₁ For the first arriving element, a _t Let c =10 as the current element ^-9 。

As a preferred technical solution of the present invention, in the step S3, input data is input into a model, a first layer of the model is a convolutional layer, and an input of the layer is filtered stream data, unlike a conventional fully-connected layer, an input of each node in the convolutional layer is only a small block of a neural network in a previous layer; the convolutional layer analyzes each small block in the neural network more deeply so as to obtain the characteristic with higher abstraction degree; the node matrix processed by the convolutional layer becomes deeper, and the depth of the node matrix after the convolutional layer is increased; the second layer is a pooling layer, and the neural network of the pooling layer does not change the depth of the matrix but reduces the size of the matrix; the pooling operation is to convert a high-resolution picture into a low-resolution picture (the data size is reduced but the data characteristics are still reserved); through the pooling layer, the number of nodes in the last full-connection layer can be further reduced, so that the number of parameters in the whole neural network is reduced; after processing of the convolutional layers and the pooling layers, giving a final classification result by 1 to 2 fully-connected layers at the end of the convolutional neural network; after several rounds of processing of convolutional and pooling layers, the information in the data has been abstracted into more information-rich features; the convolutional layer and the pooling layer are processes for automatically extracting features, and after the feature extraction is completed, a classification task is completed by using a fully connected layer.

As a preferred technical solution of the present invention, in the step S4, for each dimension, independent convolution is performed; and (3) respectively extracting the feature of each dimension by independently extracting the form of the dimension feature of each dimension, respectively strengthening the feature of each dimension, finally integrating the strongest features of each dimension, and judging the final classification result by combination.

As a preferred embodiment of the present invention, in step S5, after the deep learning is performed, the accuracy of classification or prediction after the deep learning is calculated, and a log and an accuracy result are generated for improving, modifying a model and debugging.

The invention has the following beneficial effects: in the data processing and extraction, the invention can extract the strong characteristics and the strong rules contained in the basic information, and simultaneously considers the novel analysis method of the self-contained time sequence characteristics of the stream data, realizes the method of adopting the convolutional neural network to the time sequence stream data, extracts the strong characteristics and the strong rules in the basic information, and is compatible with the self-contained time sequence characteristics of the stream data.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

In the drawings:

FIG. 1 is an overall flow chart of the multidimensional model construction of the present invention;

FIG. 2 is a diagram of the convolutional neural network structure of the present invention;

FIG. 3 is a block diagram of a multidimensional convolutional neural network of the present invention

FIG. 4 is a flow diagram of the present invention's multidimensional convolutional neural network structure;

FIG. 5 is an example of data samples of the present invention, exemplified by financial time series flow data;

FIG. 6 is an exemplary sampling method for streaming data filtering, as exemplified by financial time series streaming data, in accordance with the present invention;

FIG. 7 illustrates a general neural network convolution approach to the financial time series flow data of the present invention;

FIG. 8 illustrates a multidimensional neural network convolution approach, such as financial time-series flow data, in accordance with the present invention;

FIG. 9 is a graph of the accuracy effect of the present invention on a multidimensional convolutional neural network architecture, exemplified by financial time-series flow data;

FIG. 10 is a graph illustrating the training effect of the present invention on a multidimensional convolutional neural network structure, which is illustrated by financial time series flow data;

FIG. 11 is a parameter comparison example of various model algorithms of the present invention, using financial time series flow data as an example.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it should be understood that they are presented herein only to illustrate and explain the present invention and not to limit the present invention.

Example (b): as shown in fig. 1 to 11, the present invention provides an analysis method for extracting features of a convolutional neural network for time-series stream data, comprising the following steps:

s1: preprocessing stream data;

s2, selecting a sample by an attenuation window method;

s3, designing and building a convolutional neural network model architecture;

s4, extracting features by dimensionality by adopting a convolution model;

s6, visualizing a deep learning effect graph;

aiming at the time sequence characteristics and the dimension information characteristics in the stream data, a dimension-based convolutional neural network model is built and adopted to extract the strong characteristics and the strong rules contained in the basic information in the data, and the self-contained time sequence characteristics of the stream data are considered; after the feature extraction and the reinforcement of the multidimensional data, a model which comprises the time sequence feature and the dimension feature is synthesized.

Further, in step S1, the abnormal data preprocessing method includes:

data loss is filtered out while the level of the data is increased and the data volume is reduced;

if the data is abnormal, preprocessing the data in a mode of deleting the data, combining the data with an integral model for comprehensive analysis and substitution and regarding the data as missing value equivalent filling so as to minimize the deviation degree between the processed abnormal value and other values;

tag and mark timestamp: and (4) supervising learning aiming at the classification problem, labeling the data set and simultaneously labeling the data set with a timestamp.

Further, the specific process in step S2 is as follows: acquiring a streaming data sample, filtering the streaming data and acquiring the streaming data;

in the acquisition of a stream data sample, in a general sampling problem, the stream data consists of a series of n field tuples, and a subset of the fields is called as key fields; assuming that the sample size after sampling is a/b, hashing a key value of each tuple to one of b buckets, and then putting the tuple of which the hash value is smaller than a into a sample; if there is more than one key field, the hash function combines the values of these fields to form a single hash value; the finally obtained sample consists of all tuples of certain specific key values; the ratio of the number of the selected key values to the total number of the key values in the stream is a/b;

in the flow data filtering, a bloom filter is adopted, wherein the bloom filter comprises an array consisting of n bits, the initial value of each bit is 0, and a series of hash functions h1, h2 \8230areadopted; the bloom filter allows all the flow elements with key values in S to pass through, and blocks most flow elements with key values not in S;

Wherein a is ₁ For the first arriving element, a _t Let c be a very small constant, e.g. 10, for the current element ^-9 。

Further, in the step S3, input data is input into the model, the first layer of the model is a convolutional layer, and the input of this layer is filtered stream data, unlike the conventional fully-connected layer, the input of each node in the convolutional layer is only a small block of the neural network in the previous layer; each small block in the neural network is analyzed more deeply by the convolutional layer so as to obtain the characteristic with higher abstraction degree; the node matrix processed by the convolutional layer becomes deeper, and the depth of the node matrix after the convolutional layer is increased; the second layer is a pooling layer, and the neural network of the pooling layer does not change the depth of the matrix but reduces the size of the matrix; the pooling operation is to convert a high-resolution picture into a low-resolution picture (the data size is reduced but the data characteristics are still kept); through the pooling layer, the number of nodes in the last full-connection layer can be further reduced, so that the number of parameters in the whole neural network is reduced; after processing of the convolutional layers and the pooling layers, giving a final classification result by 1 to 2 fully-connected layers at the end of the convolutional neural network; after several rounds of processing of convolutional and pooling layers, the information in the data has been abstracted into more information-rich features; the convolutional layer and the pooling layer are processes for automatically extracting features, and after the feature extraction is completed, a classification task is completed by using a fully connected layer.

Further, in the step S4, for each dimension, independent convolution is performed; and (3) respectively extracting the feature of each dimension by independently extracting the form of the dimension feature of each dimension, respectively strengthening the feature of each dimension, finally integrating the strongest features of each dimension, and judging the final classification result by combination.

Further, in step S5, after the deep learning is performed, the accuracy of classification or prediction after the deep learning is calculated, and a log and an accuracy result are generated for improving, modifying the model and debugging.

Specifically, the method comprises the following steps: in step S1, the stream data is preprocessed in a preprocessing manner, where the preprocessing manner includes data loss, data exception, data redundancy, data normalization, labeling, and time stamping.

In the step S2, the stream data rotation sample in the step S1 is mainly subjected to operation by adopting an attenuation window operation method, the main processes are acquisition of the stream data, filtering of the stream data and operation of the attenuation window, and the dimension for screening the target characteristics is manually established according to the target data set and the project characteristics; the reason for selecting samples using the attenuation window method is: in the time series flow data, the flow data which occurs in the near future can generate influence on the current flow data and the data at a short time in the future, and the influence factor is determined according to the actual situation and the actual data. Typically, there is an important correlation between adjacently generated stream data, while far apart stream data has a much smaller correlation to the data generation that occurs at the moment. So at sample acquisition, the operations of flow data filtering and attenuation window are employed. Taking financial time series prediction as an example, the experiment used a contiguous 90 time window, i.e., past 270 minutes of data (about one day of transaction time) as the basis for future three minute price predictions.

The meaning of the attenuation window is illustrated by taking an exponential moving average line in the time sequence flow data in the financial field as an example: the weighting factor of the price per day is reduced in an exponential equal proportion mode. The closer the time is to the present moment, the greater the weight of the system is, which shows that the weight ratio of the exponential moving average line to the recent price is strengthened, and the recent price fluctuation condition can be reflected more timely. The exponential moving average line is more valuable than the moving average line.

Similarly, in the analog time sequence data, the closer the time is to the current moment, the greater the weight is given to the time sequence data, namely different weights can be distributed to the past time sequence data generated by the recent data, the weight ratio of the recent data is strengthened, and the condition of recent value fluctuation can be reflected more timely.

In an application experiment in financial field time series flow data, all K-line data are time stamped respectively, then 240 minutes of each day are decomposed into 80 buckets in time sequence, the transaction flow data are hashed into the 80 "buckets", a bloom filter is adopted, the bloom filter comprises an array of 80 time-position bits, and each hash function maps a "key" value (time represented by data sample) to a set S of n buckets (daily transaction time period) as described above. The bloom filter passes stream elements of the data sample in S, while blocking stream elements for which most key values are not in S. Such operation plays a role in filtering out a part of data near the closing time and near the opening time, and avoids the possible deviation of data caused by the inactive market liquidity. In the data source which can show the short-term trend and the volume price index of the data by selecting the moving average line, an operation mode of a decay window is introduced, because the closer the time is to the current moment in the time sequence flow data, the more the weight is given, different weights can be distributed to the generation of the recent data, the weight ratio of the recent flow data is strengthened, and the condition of the recent numerical value fluctuation can be reflected more timely. Similarly, in the financial field time sequence flow data, the exponential moving average line EMA is adopted, the EMA effectively reduces the weight of data far away from the current moment, increases the weight of factors with large recent influence, and can enable machine learning to better learn the potential law.

In step S3, the convolutional neural network is a variation of the multilayer perceptron, and is the first learning algorithm for successfully training the structure of the multilayer neural network in the true sense. The weight sharing network structure of the convolutional neural network makes the structure more similar to a biological neural network, greatly reduces the complexity of a model, reduces the number of weights and simplifies the complexity of calculation. Convolution is a mathematical operation on two real variable functions, expressed as: s (t) = (x w) (t); where w must be an effective probability density function, otherwise the output is no longer a weighted average; x is an input; the parameter w is a kernel function, the output sometimes referred to as a feature map; t is the time axis. In discrete form is:

in machine learning, the input is data of a multidimensional array, and the kernel is a parameter of the multidimensional array optimized by a learning algorithm. We often perform convolution operations in multiple dimensions: />

In step S4, individual convolution is performed for each dimension. And respectively extracting the features of each dimension in a form of independently extracting the features of the dimensions, respectively strengthening the features of each dimension, finally integrating the strongest features of each dimension, and judging the final classification result in a combined manner. In the data processing process, all input column vectors are respectively and independently output to carry out dimension-dividing convolution operation. Therefore, the time sequence characteristics of single dimension are kept, the dimension characteristics can be captured, and the most important characteristics of each dimension are finally integrated. In step S5, after the deep learning is performed, the deep learning is performedClassifying results or calculating the predicted accuracy, and generating a log and an accuracy result for improving, modifying a model and debugging; and simultaneously, if the optimization is needed, the step S6 is carried out, otherwise, the step S3 is carried out repeatedly.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that various changes, modifications and substitutions can be made without departing from the spirit and scope of the invention as defined by the appended claims. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An analysis method for extracting features of a convolutional neural network aiming at time sequence flow data is characterized by comprising the following steps of:

s1: preprocessing stream data;

s2, selecting a sample by an attenuation window method;

s3, designing and building a convolutional neural network model architecture;

s4, extracting features by dimensionality by adopting a convolution model;

s6, visualizing a deep learning effect graph;

aiming at the time sequence characteristics and the dimension information characteristics in the stream data, a dimension-based convolutional neural network model is built and adopted to extract the strong characteristics and the strong rules contained in the basic information in the data, and the self-contained time sequence characteristics of the stream data are considered; after the feature extraction and the reinforcement of the multidimensional data, a model which comprises a time sequence feature and a dimension feature is synthesized;

the specific process of the step S2 is as follows: acquiring a streaming data sample, filtering the streaming data and acquiring the streaming data;

in the acquisition of a stream data sample, in a general sampling problem, the stream data consists of a series of n field tuples, and a subset of the fields is called as key fields; assuming that the sample size after sampling is a/b, hashing a key value of each tuple to one of b buckets, and then putting the tuple of which the hash value is smaller than a into a sample; if there is more than one key field, the hash function combines the values of these fields to form a single hash value; the finally obtained sample is composed of all tuples of certain specific key values; the ratio of the number of the selected key values to the total number of the key values in the stream is a/b;

the stream data acquisition has important correlation between the adjacent generated stream data, and the earlier the element appears in the stream, the smaller the correlation, so the attenuation window method is adopted to extract the stream data and calculate a smooth accumulated value, wherein the adopted weight is continuously attenuated, which is called exponential attenuation window and is marked as

Wherein a is ₁ For the first arriving element, a _t Let c =10 for the current element ^-9 ；

In the step S3, input data is input into the model, the first layer of the model is a convolutional layer, the input of the layer is flow data after being screened, and unlike the traditional full connection layer, the input of each node in the convolutional layer is only a small block of the neural network of the previous layer; the convolutional layer analyzes each small block in the neural network more deeply so as to obtain the characteristic with higher abstraction degree; the node matrix processed by the convolutional layer becomes deeper, and the depth of the node matrix after the convolutional layer is increased; the second layer is a pooling layer, and the neural network of the pooling layer does not change the depth of the matrix but reduces the size of the matrix; the pooling operation is to convert a high-resolution picture into a low-resolution picture, and the data size is reduced while the data characteristics are still kept; through the pooling layer, the number of nodes in the last full-connection layer can be further reduced, so that the number of parameters in the whole neural network is reduced; after processing of the convolutional layers and the pooling layers, giving a final classification result by 1 to 2 fully-connected layers at the end of the convolutional neural network; after several rounds of processing of convolutional and pooling layers, the information in the data has been abstracted into more information-rich features; convolutional and pooling layers are processes that automatically extract features, and after feature extraction is complete, classification tasks are completed using fully-connected layers.

2. The analysis method for extracting features of the convolutional neural network for time-series flow data according to claim 1, wherein in step S1, according to the characteristics of the flow data, flow data preprocessing is performed, including data key information identification and redundancy attribute identification, factors which have the greatest influence on the result are artificially screened out, the flow data of all the factors are preprocessed, and abnormal items, missing items, redundant items and difference items of the historical data are supplemented by preprocessing means of data cleaning, data integration, data transformation, data merging and data remodeling; and carefully observing sample data obtained by preprocessing, carrying out digital description on screening important information, and manually establishing dimensionality of the screening target characteristics.

3. The analysis method for extracting features of the convolutional neural network for time-series flow data according to claim 2, wherein in step S1, abnormal data preprocessing is performed in a manner that includes:

4. The analysis method for extracting features of the convolutional neural network for time-series flow data according to claim 1, wherein in step S4, independent convolution is performed for each dimension; and (3) respectively extracting the feature of each dimension by independently extracting the form of the dimension feature of each dimension, respectively strengthening the feature of each dimension, finally integrating the strongest features of each dimension, and judging the final classification result by combination.

5. The analysis method for extracting features of the convolutional neural network for time-series flow data according to claim 1, wherein in step S5, after deep learning is performed, the accuracy of classification or prediction after deep learning is calculated, and a log and an accuracy result are generated for improving, modifying a model and debugging.