Disclosure of Invention
The invention aims to provide a big data intelligent analysis method based on deep learning so as to solve the problems in the background technology.
In order to achieve the above purpose, the invention provides a big data intelligent analysis method based on deep learning, comprising the following steps:
s1, acquiring original data through a big data acquisition module, and accurately preprocessing the original data under the conditions of missing values, abnormal values and noise by adopting a big data preprocessing algorithm;
s2, obtaining data features by using a deep learning model, and selecting various types of feature vectors by using a feature selection algorithm to combine to obtain a prediction result;
s3, on a deep learning network architecture based on an attention mechanism, training and classifying data by using a stacked self-encoder model;
and S4, carrying out data analysis by adopting a data compression algorithm, and displaying analysis results by adopting various visual modes on the data.
As a further improvement of the technical scheme, when the original data is acquired, the big data acquisition module in the S1 automatically acquires various types of data by adopting a web crawler technology, and stores and sorts the data;
the web crawler technology comprises the following steps:
determining a target website and analyzing the website structure of the target website;
writing a web crawler program to process data grabbing and extracting of a target website;
acquiring target data according to the previously analyzed website layout and element information;
and storing the crawled data into a database.
As a further improvement of the present technical solution, the big data preprocessing algorithm in S1 includes the following steps: data is collected and checked, missing values are processed, abnormal values are processed, noise is processed, data sampling reduces the size of original data, data is reduced and transformed, data is standardized, a data set is split, and a model is verified.
As a further improvement of the technical scheme, the deep learning model in the step S2 uses a convolutional neural network model to perform data analysis in the feature extraction process, and extracts features;
the convolutional neural network model comprises the following steps:
extracting the original data preprocessed by the big data preprocessing algorithm;
defining a CNN model;
training a model;
and extracting the characteristics of the data by using the trained CNN model.
As a further improvement of the present technical solution, in the feature selection algorithm in S2, the feature selection process includes the following two steps:
selecting the most relevant features from each type of features to form a new feature vector;
the selected new feature vectors are classified by a learner.
As a further improvement of the present technical solution, the deep learning network architecture of the attention mechanism in S3 is implemented by the following algorithm:
let the input feature be { c×h×w }, where C represents the number of channels, H and W represent the height and width of the feature map, respectively, assuming that there are K attention headers, each header needs to output different weights, the weight of each header is defined as { k×c }, and input as x goes through K different convolution operations and corresponding weights to obtain K attention transformation results, namely:
{D*H*W}= Conv(x,wj), j=1,2,...,K
wherein D represents the output depth of each head, set to C/K, and splice the K outputs to obtain a weighted feature representation:
Concat(V1,V2,...,VK)={D1*H*W}
wherein d1=d×k, and the weighted features are sent to the subsequent layer for training, where the expression is:
alpha { i, j } = e (i, j)/sum of ownership scores
e(i,j)=f(hi,hj)
Where alpha { i, j } represents the attention weight of the ith row and jth column in the attention mechanism, e (i, j) is the element of the ith row and jth column in the attention score matrix, and f (hi, hj) represents the weighted sum of the feature vector of position i and the feature vector of position j in the input sequence.
As a further improvement of the present technical solution, the stacking self-encoder model in S3 includes the following steps:
dividing the data into a training set, a verification set and a test set according to the proportion according to the characteristics of the data weighting;
constructing a plurality of self-encoders, including an input layer, an encoding layer and a decoding layer, and performing feature learning in an unsupervised mode;
forming a deep neural network model from a plurality of self-encoders;
model training is carried out by using a training data set, and an optimization algorithm and a regularization method of self-adaptive learning rate are used for avoiding overfitting in the training process;
performing feature extraction on the data by using the trained stacked self-encoder model, and constructing a classification model;
and verifying the classification model by using the verification set test set, selecting a proper model and parameters, and predicting new data by using the trained model.
As a further improvement of the present technical solution, the data compression algorithm in S4 includes the following steps:
carrying out data coding on the predicted data and the analyzed data;
counting the occurrence frequency of symbols, sequencing the frequencies from small to large, combining the two smallest frequencies each time, constructing a tree, wherein the smaller frequency is represented by 0, the larger frequency is represented by 1, and outputting codes according to the sequence of leaf nodes;
and compressing and transmitting the encoded data.
As a further improvement of the present technical solution, the visualization manner in S4 includes: scatter plots, bar charts, histograms, bin charts, heat maps, and global relationship maps.
Compared with the prior art, the invention has the beneficial effects that:
according to the deep learning-based big data intelligent analysis method, all links of data processing, feature extraction, model training and result analysis are covered, a novel model structure and feature selection method are adopted, the objectivity of the result is guaranteed, excellent data processing tools such as a data compression algorithm are also adopted, the high efficiency of data analysis is guaranteed, the final result is clear and visual, and the practicability is further improved.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-6, embodiment 1 of the present invention provides a big data intelligent analysis method based on deep learning, comprising the following steps:
s1, acquiring original data through a big data acquisition module, accurately preprocessing the original data under the conditions of missing values, abnormal values and noise by adopting a big data preprocessing algorithm, and realizing operations such as sampling, reducing and transforming the data so as to ensure the high quality of the data and be suitable for deep learning training, reduce the occupation of the data in subsequent analysis, reduce the operation load and enable the data analysis to be more stable;
when the original data is acquired, the big data acquisition module in the S1 automatically acquires various types of data by adopting a web crawler technology, and stores and sorts the data;
as shown in fig. 2, the web crawler technology includes the following steps:
determining a target website, analyzing the website structure, knowing the page structure, elements, class names, labels, CSS selectors and the like of the website, wherein the information is helpful for creating programs to extract required data;
writing a web crawler program to process data capture and extraction of a target website, wherein Python is a common language, and a plurality of crawler frames and libraries are available for selection;
obtaining target data from previously analyzed web site layout and element information, which typically involves parsing an HTML page using regular expressions and obtaining the necessary data therefrom;
and storing the crawled data into a database.
As shown in fig. 3, the big data preprocessing algorithm in S1 includes the following steps:
collecting and checking data: firstly, raw data are required to be collected, and are subjected to carding and checking, so that the size, format, quality and the like of the data are known, wherein the situations include data loss, abnormal values, noise and the like;
missing value processing: for missing values, missing data records can be deleted and interpolation methods can be used for processing, such as methods of mean, median, model fitting and the like;
outlier processing: for outliers, two steps of detection and removal may be used. Detection uses methods such as statistical analysis, machine learning and the like, and cleaning can be performed in a modified or deleted mode;
noise treatment: for noise, filters may be used to smooth the data, such as mean filtering, median filtering, gaussian filtering, etc.;
data sampling reduces the original data size: when the data set is too large, a data sampling algorithm is used to reduce the original data size, in an attempt to keep the data distributed and reduce noise, such as random sampling and hierarchical sampling;
data reduction and transformation: for high-dimensional data, feature space and computational complexity can be reduced by using data dimension reduction, common data dimension reduction algorithms comprise PCA, LDA, t-SNE and the like, and in addition, original data can be transformed by using feature scaling, feature selection, feature transformation and the like, and model effect is improved.
Data normalization: normalization is a technique of scaling feature data to the same range, typically using statistical z-score transformations or minimum, maximum normalization, etc.;
splitting a data set: in real-scene applications, we need to divide the data set into a training set for building the model and a test set for verifying the model;
and (3) verifying a model: it is critical to verify the quality of the model, which can use various metrics such as classification accuracy, regression error, F1 score, ROC curve, etc. to determine the accuracy of the model.
S2, obtaining data features by using a deep learning model, obtaining high-quality interpretable features as input data, selecting various types of feature vectors by using a feature selection algorithm, combining the feature vectors to obtain a prediction result, and improving accuracy;
the deep learning model in the S2 uses a convolutional neural network model to conduct data analysis in the feature extraction process, and extracts features;
as shown in fig. 4, the convolutional neural network model includes the steps of:
extracting the original data preprocessed by the big data preprocessing algorithm;
defining a CNN model: the method comprises a convolution layer, an activation function, a pooling layer, a full connection layer and the like, and is designed according to the type of the characteristics and the characteristics of data;
model training: the method comprises model compiling, super parameter selecting, model training and the like, wherein the training aims at finding out the optimal weight and bias value;
the data is extracted by using a trained CNN model, and usually, the output of certain convolution layers can be selected as the needed characteristics, and then the required characteristics are sent into a full connection layer for classification or regression tasks.
As shown in fig. 5, in the feature selection algorithm in S2, let the sample set be d= { (x 1, y 1), (x 2, y 2), (xn, yn) }, where xi represents the feature vector of sample i, yi represents the corresponding category label, and xi contains m different types of features: xi= (x { i1}, x { i2}, x { im }, the process of feature selection comprises the following two steps:
selecting k most relevant features from each type of features to form a new feature vector, namely respectively selecting k most relevant features from m different types of features to generate a new feature vector tide { x = (tide { x }1, tide { x }2, tide { x } k), wherein tide { x }1, tide { x }2, tide { x } k respectively represent the selection results of m different types of features;
classifying the selected new feature vector tille { x } by a learner (such as a decision tree, a support vector machine, etc.);
in particular, the k features with highest correlation in each type of features can be selected by using a Wrapper method, and meanwhile, the importance degree between different features can be calculated and compared according to specific evaluation indexes (such as information gain, coefficient of kunity, average precision and the like) of the learner.
S3, on a deep learning network architecture based on an attention mechanism, data training and classification are carried out by using a stacked self-encoder model, and the architecture introduces the attention mechanism on the basis of a traditional deep learning network and can weight all the features so as to improve training precision and model popularization effect;
the deep learning network architecture of the attention mechanism in the step S3 is realized by the following algorithm:
let the input feature be { c×h×w }, where C represents the number of channels, H and W represent the height and width of the feature map, respectively, assuming that there are K attention headers, each header needs to output different weights, the weight of each header is defined as { k×c }, and input as x goes through K different convolution operations and corresponding weights to obtain K attention transformation results, namely:
{D*H*W}= Conv(x,wj), j=1,2,...,K
wherein D represents the output depth of each head, set to C/K, and splice the K outputs to obtain a weighted feature representation:
Concat(V1,V2,...,VK)={D1*H*W}
wherein d1=d×k, and the weighted features are sent to the subsequent layer for training, where the expression is:
alpha { i, j } = e (i, j)/sum of ownership scores
e(i,j)=f(hi,hj)
Wherein alpha { i, j } represents the attention weight of the ith row and jth column in the attention mechanism, e (i, j) is an element of the ith row and jth column in the attention score matrix, f (hi, hj) represents the weighted summation of the feature vector of the position i and the feature vector of the position j in the input sequence, the sum of all weight scores is used for normalizing the weight scores, the sum of all weights is ensured to be 1, the attention weight represents the contribution weight of different input positions to the target position, the weighted summation is used for calculating weighted average feature vectors, the most relevant information is extracted, and the attention mechanism network can adaptively adjust the weight of each position, so that the training precision and the model popularization effect are improved.
The stacking self-encoder model in S3 includes the steps of:
dividing the data into a training set, a verification set and a test set according to the proportion according to the characteristics of the data weighting, and carrying out pretreatment operations such as standardization, normalization and the like;
constructing a plurality of self-encoders, including an input layer, an encoding layer and a decoding layer, and performing feature learning in an unsupervised mode;
forming a deep neural network model by a plurality of self-encoders, namely stacking the self-encoders;
model training is carried out by using a training data set, and an optimization algorithm and a regularization method of self-adaptive learning rate are used for avoiding overfitting in the training process;
performing feature extraction on the data by using the trained stacked self-encoder model, and constructing a classification model, such as a machine learning model of SVM, KNN, logistic regression and the like or a neural network model;
the classification model is verified by using the verification set test set, a proper model and parameters are selected, new data are predicted by using the trained model, the controllability and stability of each link are ensured, and the problems of time and resource waste caused by repeated training for many times are avoided.
S4, data analysis is carried out by adopting a data compression algorithm, so that the bottleneck of large-scale data transmission and processing is avoided, the stability and processing speed of the system are improved, the high efficiency of data analysis is guaranteed, the analysis result is displayed by adopting various visual modes, the problems faced by large-scale data analysis can be comprehensively and effectively solved, the innovation is realized, the miniaturization and whole-course visualization are realized, and the high authorization rate and the practicability are realized;
as shown in fig. 6, the data compression algorithm in S4 includes the following steps:
performing data coding on the predicted data and the analyzed data, wherein the data coding comprises a series of codes based on information theory, such as error correction coding, markov chain coding, huffman coding, compressed index coding, entropy coding, lempel-Ziv-Welch coding and the like;
counting the occurrence frequency of symbols, sequencing the frequencies from small to large, combining the two smallest frequencies each time, and constructing a tree, wherein the smaller frequency is represented by 0, the larger frequency is represented by 1, and outputting codes according to the leaf node sequence, so that the volume of data transmission can be effectively reduced, and the transmission efficiency is improved;
the coded data is compressed and transmitted, common compression modes comprise lossless compression and lossy compression, wherein the lossless compression is to encode the data and remove redundant and repeated parts of the data according to redundancy and repeatability of the data, such as LZW, gzip and other lossless compression algorithms, the lossy compression is to allow partial data information to be lost in the compression process so as to replace higher compression ratio, and common lossy compression algorithms comprise JPEG, MPEG and the like, so that the high efficiency of data analysis is guaranteed through the data compression algorithm, the occupation of a system is reduced, and the operation stability is improved.
The visualization mode in S4 may better reveal the features and modes of the data, and extract useful information, including: a scatter plot showing the relationship between two numerical variables, wherein each point represents an observation; a bar graph for comparing different category variables, wherein each bar represents a category, and the height thereof represents a corresponding numerical value; the histogram is used for showing the distribution condition of the numerical variables, dividing the numerical range into a plurality of intervals, and the height of each interval represents the observed number falling into the interval; box line diagram: the method is used for displaying the distribution condition of numerical variables, five statistics of a data set are displayed in the form of boxes, the upper end and the lower end of each box respectively represent a first quartile and a third quartile, the median in each box is represented by a line, and points outside each box are abnormal values; heat map: for showing the relationship between the variables, the data is shown as different shades of color. Typically for analysis of multivariate data, there are typically two variables used to draw a graph, such as time and value; overall relationship diagram: for exhibiting an association of a plurality of variables, wherein each variable is represented as a different node and the relationship between variables is represented as an edge; map visualization: for presenting the distribution and relevance of geographic data, wherein each geographic location is represented as a point or area, and different colors, sizes, labels and symbols represent different features;
in conclusion, the visual presentation of the analysis result is carried out, so that the final result presentation is clearer and more visual, and the practicability is further improved.
In summary, the invention is represented according to the following examples (news data, mobile device user data):
(1) in S1, collecting news data on a network, processing the data by using a denoising and deduplication processing module, in S2, extracting text information of the news data by using a convolutional neural network model, in S3, performing data training and classification by using a stacked self-encoder model, classifying the news data according to a theme, in S4, performing visual analysis on a classification result by using an interactive visual technology, and enhancing a classification effect by adjusting model parameters;
(2) in S1, collecting mobile equipment user data, preprocessing such as data de-duplication and filtering, in S2, extracting mobile equipment user data characteristics such as user behaviors and position information by adopting a convolutional neural network model, in S3, training and classifying the data by using a stacked self-encoder model to realize accurate mobile end advertisement delivery, in S4, performing visual display on categories and analysis results by using an interactive visual technology, and continuously optimizing the model to improve advertisement delivery accuracy;
the invention has high innovativeness, provides a large data intelligent analysis method based on deep learning, has wide application range, can be widely applied to a plurality of fields such as mobile advertisement, financial risk control, medical image analysis and the like, provides a complete set of intelligent, all-dimensional and all-dimensional technical scheme, performs all-dimensional coverage on each link of data processing, feature extraction, model training and result analysis, adopts a novel model structure and feature selection method, improves the objectivity of the result, and also adopts excellent data processing tools such as a data compression algorithm and the like, thereby providing guarantee for the high efficiency of data analysis, and the final result is more clear and visual in presentation and further improves the practicability.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.