CN117421684B - Abnormal data monitoring and analyzing method based on data mining and neural network - Google Patents

Abnormal data monitoring and analyzing method based on data mining and neural network Download PDF

Info

Publication number
CN117421684B
CN117421684B CN202311718358.3A CN202311718358A CN117421684B CN 117421684 B CN117421684 B CN 117421684B CN 202311718358 A CN202311718358 A CN 202311718358A CN 117421684 B CN117421684 B CN 117421684B
Authority
CN
China
Prior art keywords
data
abnormal
neural network
model
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311718358.3A
Other languages
Chinese (zh)
Other versions
CN117421684A (en
Inventor
林明
胡琴
卢山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yizhigu Technology Group Co ltd
Original Assignee
Yizhigu Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yizhigu Technology Group Co ltd filed Critical Yizhigu Technology Group Co ltd
Priority to CN202311718358.3A priority Critical patent/CN117421684B/en
Publication of CN117421684A publication Critical patent/CN117421684A/en
Application granted granted Critical
Publication of CN117421684B publication Critical patent/CN117421684B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/26Discovering frequent patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to the technical field of abnormal data monitoring, in particular to a method for monitoring and analyzing abnormal data based on data mining and a neural network, which comprises the following steps: the method comprises the steps of preprocessing real-time communication data, encoding the data, detecting and identifying the abnormal communication data by using an abnormal data detection method, carrying out standardized processing on the abnormal communication data, accurately identifying and controlling abnormal behaviors of equipment by using the abnormal detection of an enhanced weighted random forest algorithm, identifying the classified abnormal data by using an enhanced stream type abnormal detection algorithm and a trained neural network model, detecting and early warning the abnormal data in real time by using an adaptive method, and adjusting by using an adaptive learning rate according to an adaptive gradient adjustment factor. The method has the advantages of good effectiveness and accuracy, high efficiency, low energy consumption, high intelligence and the like.

Description

Abnormal data monitoring and analyzing method based on data mining and neural network
Technical Field
The invention relates to the technical field of abnormal data monitoring, in particular to a method for monitoring and analyzing abnormal data based on data mining and a neural network.
Background
Anomaly detection is a hot topic in various fields at present, and is widely applied to the fields of health care, intelligent transportation, large-scale production systems, network security and the like. Abnormality detection targets in different fields are different, for example, in the field of health care, abnormality detection is used to monitor human diseases; in intelligent traffic, it is used to find traffic accidents; in large production systems, for equipment failure diagnosis; in network security, it is used to detect network intrusion and the like. In the field of mobile communication systems, common abnormal data include signal quality anomalies, dropped call rate anomalies, call completion rate anomalies, data transmission rate anomalies, base station failure or anomalies, and traffic anomalies.
The existing anomaly detection method comprises clustering, random forest, single-class support vector machine and the like, training is carried out on data by using a machine learning algorithm, and then the anomaly data is detected through a model. And judging whether the abnormality exists or not by analyzing the statistical characteristics of the data. The existing abnormality detection method also comprises ARIMA model, exponential smoothing and the like, wherein data are regarded as time series, and whether abnormality exists or not is judged by analyzing the trend and periodicity of the series. The current communication system is mainly focused on providing high-speed, stable and safe communication connection, meeting the requirements of different users and diversified communication modes of application scenes, and has good expandability and interoperability. Meanwhile, requirements on mobility, low delay, high reliability, privacy protection and the like are also increasing. Since existing anomaly detection methods fail to meet the needs of current communication systems, these anomaly data are monitored and analyzed in conjunction with data mining and neural networks.
In the prior art, little research is done on abnormal data monitoring by using a neural network and data mining, and chinese patent application No. 201811522835.8 discloses a mobile communication data traffic abnormality monitoring system, which includes a traffic monitoring unit, an information following unit, a traffic analyzing unit, a personal database, a processor, a display unit, an event recording unit, a reminding unit and a data confirmation unit. The flow monitoring unit is used for monitoring communication flow of the communication equipment, wherein the communication flow comprises data flow and real-time rate information. The data flow represents the total consumption of the data flow until the current month, and the real-time rate information represents the real-time network speed when the network access is performed. According to the scheme, the data traffic service condition of the mobile equipment is monitored in real time through the traffic monitoring unit, and the communication traffic is transmitted to the traffic analysis unit. The flow analysis unit is combined with the abnormality analysis step to calculate and obtain an instant stable value. And then, calculating a stable difference value according to the instant stable value, and judging whether the data flow access of the user is abnormal or not by utilizing the difference value. But normal traffic typically occupies the vast majority and abnormal traffic is relatively small, which causes tag imbalance problems.
Most of the existing anomaly detection methods adopt neural network training anomaly detection models, and Chinese patent application number 202110397166.1 discloses an anomaly data detection method based on improved EMD and neural network models, which comprises the following steps: drawing an envelope curve on the original signal by using an envelope function; inputting the drawn envelope signals into a modified EMD algorithm to extract characteristic variables (IMF components); modifying the characteristic variable extraction flow, namely modifying a cubic spline interpolation function into an fminbnd function, wherein an envelope function adopts an inventcope function; inputting the extracted characteristic variables into a neural network model; after three layers of screening of the neural network model and matching with the frequency spectrum of the fault cause, finding out the fault point and the fault cause; however, the method has the phenomenon of modal aliasing, noise interferes with the sampled signal, and the characteristic variable of the signal cannot be accurately extracted.
The invention aims to solve the technical problems of low analysis speed and inaccurate analysis result in the existing communication abnormal data detection.
For this purpose, abnormal data monitoring and analysis methods based on data mining and neural networks are proposed.
Disclosure of Invention
The invention aims to provide an abnormal data monitoring and analyzing method based on data mining and a neural network, which is characterized in that real-time communication data is preprocessed, then data encoding is carried out, abnormal communication data is detected and identified by using an abnormal data detection method, and real-time communication data trend is predicted and early-warned by using a data mining and neural network technology. And identifying the abnormality of the standardized data by using a neural network related algorithm, and realizing intelligent early warning of communication abnormal data.
In order to achieve the above purpose, the present invention provides the following technical solutions:
the abnormal data monitoring and analyzing method based on the data mining and the neural network comprises the following steps:
acquiring a high-dimensional real-time communication data set of a user side, and classifying and marking the high-dimensional real-time communication data set;
preprocessing the classified high-dimensional real-time communication data set;
extracting features of the preprocessed high-dimensional real-time communication data set by using a bi-directional cyclic neural network BiRNN model;
performing dimension reduction on the extracted features of the high-dimension real-time communication data set by using a Principal Component Analysis (PCA) method, and encoding the dimension-reduced features by using a discrete encoding method to form encoded data;
Classifying the encoded data into a normal data set and an abnormal data set by an enhanced weighted random forest algorithm;
inputting the abnormal data set into a convolutional neural network detection model for training;
calculating a local anomaly factor value by comparing the average density of each data point with the adjacent neighbor points, the data points with the local anomaly factor value less than a certain threshold being anomaly points;
periodically updating the average density of the neighborhood set, and updating the local anomaly factor value of the data points in the neighborhood according to the updated average density;
processing the abnormal data set in real time by using an enhanced flow type abnormal detection algorithm and a trained convolutional neural network model, dynamically adjusting according to the environment and the data distribution of the data set change, and automatically identifying the abnormal type of the abnormal data;
outputting results and early warning, and detecting the abnormal data set in real time by a self-adaptive method; judging whether the data sample is abnormal or not according to the threshold value, performing corresponding early warning processing, and feeding early warning information back to an early warning system;
the self-adaptive method tracks the change of the data in real time, dynamically adjusts the model and the parameters, can be more suitable for different types and distribution of data, provides more accurate abnormality detection and early warning results, and ensures that the early warning accuracy reaches more than 90 percent.
Preferably, the data preprocessing specifically includes:
unique attribute processing: the unique attribute refers to the characteristic capable of uniquely identifying the sample to be identified, but has no influence on the identification of abnormal data, and is directly deleted;
missing data processing: finding the position of the missing value; filling the missing value; quickly acquiring index information of a missing value in the characteristic data, and processing the missing value by using a position logic index;
correlation attribute merging: for characteristic data with obvious correlation, combining the correlated data into one data by using a data operation, and deleting non-correlated data; the data is combined by adding two attribute columns feature 1 and feature 2 into the data, and the deleting operation is performed after the attribute combination.
Automatic cleaning: automatically identifying and processing the problems of missing values, abnormal values and repeated values in the data by using an automatic algorithm and a tool through a semi-supervised learning algorithm and an abnormal detection algorithm; the method comprises the following specific steps:
loading data to be cleaned into an appropriate computing environment;
missing value processing: and (3) missing value detection: detecting missing values in the data using a statistical index, a visualization method, or a clustering algorithm; missing value filling: filling the missing value by using an interpolation method according to the characteristics and meaning of the data;
Outlier processing: abnormal value detection: identifying outliers in the data using an outlier detection algorithm; outlier processing: selecting an abnormal value deleting method for processing according to the property of the abnormal value;
repeating the value processing: duplicate value detection: detecting duplicate records in the data using a unique identifier of the data or a combination of fields; repeating the value processing: selecting to reserve the first record to process the repeated value according to the service requirement;
verification and evaluation: verifying and evaluating the cleaned data, checking the cleaning effect, and comparing the cleaning effect with the original data; the missing value ratio, the outlier ratio, are used to evaluate the accuracy and integrity of the cleaning results.
Preferably, the encoding the dimension reduction feature by using a discrete encoding method to form encoded data specifically includes:
among the relevant attributes, one or more discrete attributes are categorized into One category, and each category is individually encoded using the one_hot encoding method.
Preferably, the training process of the convolutional neural network detection model comprises the following steps:
initializing parameters with random values;
inputting the coded data in the abnormal data set, and obtaining an output value through forward propagation of a convolution layer, a pooling layer and a full connection layer;
Calculating training errors between the output value and the target value of each layer of the convolution layer, the pooling layer and the full-connection layer; the target value refers to the classified abnormal data set;
performing back propagation updating weight according to the training error;
when the training error does not change significantly within 100 iterations, the training process is terminated; if the termination condition is not met, re-executing the input data;
and obtaining a trained convolutional neural network model.
Preferably, the number of neurons in the input layer of the convolutional neural network is set according to the number of the input abnormal communication data features and the coding bit number of each abnormal communication data feature, and the calculation method is as follows:
where M is the number of features of the input abnormal communication data,is the number of coding bits of the ith feature, +.>Is the number of neurons of the input layer; the abnormal communication data characteristic types comprise frequency abnormality, time delay abnormality, abnormal data packet frequency, abnormal data packet size, signal strength abnormality, abnormal protocol behavior and data integrity abnormality.
Preferably, the encoded data in the abnormal data set is input into a convolutional neural network detection model M1 for training;
the convolutional neural network model M1 comprises 12 convolutional layers and 8 pooling layers, the convolutional layers adopt 3 small convolutional kernels of 3*3, and the 8 pooling layers adopt maximum pooling;
Respectively adding residual error connection modules into a 3 rd convolution layer, a 4 th convolution layer and a 5 th convolution layer in the model M1, wherein one residual error connection module consists of two convolution layers, and batch normalization and activation functions are contained between the convolution layers; adding the output and the input of the 3 rd convolution layer of the model M1, transmitting the added result to the 4 th convolution layer as input, applying the residual error connection module again, adding the output and the input of the 4 th convolution layer, transmitting the added result to the 5 th convolution layer, continuously applying the residual error connection module, and obtaining a convolution neural network model M2;
inputting the encoded data in the abnormal data set into the convolutional neural network model M2, and performing convolutional, residual error connection and pooling operation;
the middle layer is 1 layer, and the neuron number of the middle layer is
Preferably, the number of neurons of the output layer is equal to the number of demand categories; the one_hot encoding method is used to encode the demand state and the function of the output neuron is selected as the log function.
Preferably, classification is performed by an enhanced weighted random forest algorithm, comprising the steps of:
inputting the encoded data into an enhanced weighted random forest model for training;
Performing anomaly detection and classification using a trained, enhanced weighted random forest model, separating the encoded data into a normal data set and an anomaly data set;
assigning a weight to each abnormal data sample;
for unlabeled new samples, the trained model uses the learned parameters and weights thereof to conduct classification inference according to the learned rules, and the new samples are distributed to normal categories or abnormal categories; the parameters and weights of the model are learned from the labeled samples by an optimization algorithm in the training process.
Preferably, the tag classification of the data refers to dividing all data sets into normal data sets and abnormal data sets, the encoded data is classified by an enhanced weighted random forest algorithm, and the weight function of the enhanced weighted random forest algorithm is as follows:
wherein,is the unbalance of the decision tree, N is the number of decision trees, < >>Is the voting weight of the decision tree;
given N balanced sub-training sets, training is carried out on the N balanced sub-training sets to obtain N decision tree classifiers,/>Is a natural number from 1 to N;
the final classifier is obtained by weighted voting and is expressed as follows:
Wherein Y is an abnormal data set;
the finally obtained classifierThe method is used for testing the classification effect by the test set.
Preferably, the abnormal data set is identified by using an enhanced flow type abnormal detection algorithm and a trained convolutional neural network model, and the calculation process of the local abnormal factor in the enhanced flow type abnormal detection algorithm is as follows:
for each outlier data point in the real-time communication dataset
Calculation ofK nearest neighbor of (2) and obtain the neighborhood set +.>
Calculating data pointsTo data point->K reachable distance +.>Which is data point>K adjacent distance>And (4) point->And->European distance between->Is the maximum value of (2);
according to the set distance thresholdDefining the data points with the distance smaller than or equal to the threshold value as the data points in the neighborhood of the target point; target point->Is expressed as:
wherein,representing target point->A set of data points within a neighborhood of (a); d is the dataset; />Representing data points +.>Is->A Euclidean distance between them; />Indicating that the distance is less than or equal to a set threshold +.>Data point->Belonging to the target point->Is within a neighborhood of (2); different k reachable distances correspond to different distance thresholds, and the threshold is dynamically adjusted according to the k reachable distances >The method comprises the steps of carrying out a first treatment on the surface of the Data point +.>To data point->K reachable distance +.>And threshold->Comparing; if the k reachable distance of a certain data point is smaller than the threshold +.>Marking it as an anomaly of the same type; if successive outliers occur, the threshold value +.>To improve accuracy; if no outlier occurs, the threshold value +.>To increase sensitivity;
calculation and calculationNeighborhood set of->Is expressed as>
Calculation ofIs->
Wherein M is the number of data points; by dynamically adjusting threshold valuesSo that abnormal data points in more adjacent areas are detected.
Preferably, the self-adaptive method is utilized to enable the abnormality detection model to be automatically adjusted according to the change of the data, so as to adapt to the new data distribution and mode change; in the adaptive methodThe updated formula of (c) is as follows:
wherein,is a parameter->Is (are) updated value->Is self-adaptive learning rate->Is an adaptive gradient adjustment factor,/->Is the first moment of the gradient, i.e. the mean,/->Is the second moment of the gradient, i.e. the variance, +.>Is a smooth item->Is the current gradient;
the learning rate is adjusted according to the self-adaptive gradient adjustment factor, and the parameters of the model are updated by using the adjusted learning rate; when the adaptive gradient adjustment factor becomes large, the learning rate becomes small; when the self-adaptive gradient adjustment factor becomes smaller, the learning rate becomes larger; the adaptive gradient adjustment factor affects not only the absolute magnitude of the learning rate, but also the rate of change of the learning rate; the change amplitude of the learning rate in the training process is adjusted by controlling the change rate of the self-adaptive gradient adjustment factor; an adaptive gradient adjustment factor greater than a certain threshold may result in a slower change in learning rate, while an adaptive gradient adjustment factor less than a certain threshold may result in a faster change in learning rate.
Compared with the prior art, the invention has the beneficial effects that:
1. the abnormal data points are effectively identified by classifying through an abnormal detection algorithm of the reinforced weighted random forest algorithm, the classification accuracy is improved, the characteristics of the abnormal data are better captured, a large-scale high-dimensional data set is effectively processed, and the real-time abnormal detection requirement is met. The enhanced weighted random forest algorithm allows the model to better handle sample size differences between different classes. By giving different weights to the data samples, the model is more concerned about the samples of few categories, thereby improving the classification accuracy. The weighted random forest algorithm considers the importance of the features and identifies the most discriminative features for classification or anomaly detection tasks by evaluating the weights of the features. This helps to improve the accuracy and efficiency of the model and reduces the impact of irrelevant or redundant features. The random forest algorithm has strong robustness, and has certain robustness on noise and abnormal values in data. The stability of the model is further improved through the improvement of the weighted random forest algorithm, and the problem of over-fitting of abnormal samples is reduced, so that the overall model performance is improved.
2. And predicting and early warning the trend of the real-time communication data by utilizing an enhanced stream type anomaly detection algorithm, processing the high-dimensional data stream in real time, immediately identifying the anomaly data, continuously updating a model according to new data, and adaptively adapting to the change of data distribution. The reinforced flow type anomaly detection algorithm has self-adaptability, and automatically learns and adjusts the model to adapt to the change of data distribution. Different k reachable distances correspond to different distance thresholds, and the threshold is dynamically adjusted according to the k reachable distancesThis improves the robustness and adaptability of the algorithm, ensuring that the performance of the model remains efficient during long-term operation. The streaming anomaly detection algorithm uses incremental learning to model update new data samples without the need to reprocess the entire data set. Thus, the calculation resources are saved, and the algorithm efficiency is improved. By dynamically adjusting threshold->So that abnormal data points in more adjacent areas are detected.
3. The abnormal data is detected and early-warned through a self-adaptive method, and a detection model or rule can be automatically adjusted according to the actual condition of the data so as to adapt to different data distribution and abnormal modes. The method ensures that the learning rate is automatically adapted to different gradient conditions in the training process, thereby achieving better optimization effect; the method can adapt to dynamic change of data and new abnormal modes, and improves robustness and adaptability of the system. And an abnormality detection algorithm is dynamically adjusted according to the distribution and the change of the actual data by using an adaptive method, so that the detection accuracy is improved. The self-adaptive method can capture the change modes and characteristics of the data and correspondingly adjust the abnormality detection model so as to better identify the abnormal data. The self-adaptive method can analyze and process the data in real time, timely detect abnormal data and make early warning.
Drawings
FIG. 1 is a flow chart of a method for monitoring and analyzing abnormal data based on data mining and neural network according to the present invention;
FIG. 2 is a graph of classification effects of various forms of models of the present invention;
FIG. 3 is a graph comparing weight functions of the present invention;
FIG. 4 is a graph comparing local anomaly factors of the present invention;
FIG. 5 is a graph comparing adaptive learning rates of the adaptive method of the present invention;
fig. 6 is a graph comparing early warning accuracy of the adaptive method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. The embodiments described herein are merely some, but not all embodiments of the invention. All other embodiments, which come within the scope of the invention without inventive work, are within the scope of the invention.
Referring to fig. 1 to 6, the present invention provides an abnormal data monitoring and analyzing method based on data mining and neural network, and the technical scheme is as follows:
the abnormal data monitoring and analyzing method based on the data mining and the neural network comprises the following steps:
Acquiring a high-dimensional real-time communication data set of a user side, and carrying out marking classification on the high-dimensional real-time communication data set; the high-dimensional dataset includes the following 10 dimensional data: timestamp, sender and receiver identification, communication mode, communication duration, communication quality indicator, bandwidth usage, communication location, data traffic, network topology, and user behavior; removing users without abnormal communication behaviors, and retaining abnormal data samples; the abnormal data comprise abnormal signal quality, abnormal call drop rate, abnormal call completing rate, abnormal data transmission rate, base station fault or abnormal state and abnormal flow;
preprocessing the classified high-dimensional real-time communication data set; the data preprocessing comprises automatic cleaning, multi-source data integration, abnormality detection and data restoration, unique attribute deletion, relevant attribute integration and missing value processing, and the acquired high-dimensional data is converted into a data set which does not contain missing data and only contains effective characteristics;
the data preprocessing specifically comprises the following steps:
unique attribute processing: the unique attribute refers to the characteristic capable of uniquely identifying the sample to be identified, but has no influence on the identification of abnormal data, and is directly deleted;
Missing data processing: finding the position of the missing value; filling the missing value; quickly acquiring index information of a missing value in the characteristic data, and processing the missing value by using a position logic index;
correlation attribute merging: for characteristic data with obvious correlation, combining the correlated data into one data by using a data operation, and deleting non-correlated data; the data is combined by adding two attribute columns feature 1 and feature 2 into the data, and the deleting operation is performed after the attribute combination.
Automatic cleaning: automatically identifying and processing the problems of missing values, abnormal values and repeated values in the data by using an automatic algorithm and a tool through a semi-supervised learning algorithm and an abnormal detection algorithm; the method comprises the following specific steps:
loading data to be cleaned into an appropriate computing environment;
missing value processing: and (3) missing value detection: detecting missing values in the data using a statistical index, a visualization method, or a clustering algorithm; missing value filling: filling the missing value by using an interpolation method according to the characteristics and meaning of the data;
outlier processing: abnormal value detection: identifying outliers in the data using an outlier detection algorithm; outlier processing: selecting an abnormal value deleting method for processing according to the property of the abnormal value;
Repeating the value processing: duplicate value detection: detecting duplicate records in the data using a unique identifier of the data or a combination of fields; repeating the value processing: selecting to reserve the first record to process the repeated value according to the service requirement;
verification and evaluation: verifying and evaluating the cleaned data, checking the cleaning effect, and comparing the cleaning effect with the original data; the missing value ratio, the outlier ratio, are used to evaluate the accuracy and integrity of the cleaning results.
Extracting features of the preprocessed high-dimensional real-time communication data set by using a bi-directional cyclic neural network BiRNN model, extracting features suitable for anomaly detection, removing data which cannot possibly have anomalies from the data, and extracting the anomaly data;
normalizing the abnormal data, performing dimension reduction on the extracted characteristics of the high-dimension real-time communication data set by using a Principal Component Analysis (PCA) method, and encoding the dimension reduction characteristics by using a discrete encoding method to form encoded data;
the step of encoding the dimension reduction feature by using a discrete encoding method to form encoded data specifically comprises the following steps:
among the relevant attributes, one or more discrete attributes are categorized into One category, and each category is individually encoded using the one_hot encoding method.
Classifying the encoded data into a normal data set and an abnormal data set by an enhanced weighted random forest algorithm;
assigning a weight to each abnormal data sample;
for unlabeled new samples, the trained model uses the learned parameters and weights thereof to conduct classification inference according to the learned rules, and the new samples are distributed to normal categories or abnormal categories; the parameters and weights of the model are learned from the labeled samples by an optimization algorithm in the training process.
The marking classification of the data refers to that all data sets are divided into normal data sets and abnormal data sets, the coded data are classified through an enhanced weighted random forest algorithm, and the weight function of the enhanced weighted random forest algorithm is as follows:
(1);
wherein,is the unbalance of the decision tree, N is the number of decision trees, < >>Is the voting weight of the decision tree;
given N balanced sub-training sets, training is carried out on the N balanced sub-training sets to obtain N decision tree classifiers,/>Is a natural number from 1 to N;
the final classifier is obtained by weighted voting and is expressed as follows:
(2);
wherein Y is an abnormal data set;
the finally obtained classifier The method is used for testing the classification effect by the test set.
Inputting the abnormal data into a convolutional neural network detection model for training;
the training process of the convolutional neural network detection model comprises the following steps:
initializing parameters with random values;
inputting the coded data in the abnormal data set, and obtaining an output value through forward propagation of a convolution layer, a pooling layer and a full connection layer;
calculating training errors between the output value and the target value of each layer of the convolution layer, the pooling layer and the full-connection layer; the target value refers to the classified abnormal data set;
performing back propagation updating weight according to the training error;
when the training error does not change significantly within 100 iterations, the training process is terminated; if the termination condition is not met, re-executing the input data;
and obtaining a trained convolutional neural network model.
The method comprises the steps of setting the number of neurons in an input layer of a convolutional neural network according to the number of input abnormal communication data features and the coding bit number of each abnormal communication data feature, wherein the calculation method is shown in the following formula:
;(3)
where M is the number of features of the input abnormal communication data,is the number of coding bits of the ith feature, +.>Is the number of neurons of the input layer; the abnormal communication data characteristic types comprise frequency abnormality, time delay abnormality, abnormal data packet frequency, abnormal data packet size, signal strength abnormality, abnormal protocol behavior and data integrity abnormality.
And inputting the encoded data in the abnormal data set into a convolutional neural network detection model M1 for training.
The convolutional neural network model M1 comprises 12 convolutional layers and 8 pooling layers, the convolutional layers adopt 3 small convolutional kernels of 3*3, and the 8 pooling layers adopt maximum pooling;
respectively adding residual error connection modules into a 3 rd convolution layer, a 4 th convolution layer and a 5 th convolution layer in the model M1, wherein one residual error connection module consists of two convolution layers, and batch normalization and activation functions are contained between the convolution layers; adding the output and the input of the 3 rd convolution layer of the model M1, transmitting the added result to the 4 th convolution layer as input, applying the residual error connection module again, adding the output and the input of the 4 th convolution layer, transmitting the added result to the 5 th convolution layer, continuously applying the residual error connection module, and obtaining a convolution neural network model M2;
inputting the encoded data in the abnormal data set into the convolutional neural network model M2, and performing convolutional, residual error connection and pooling operation;
the middle layer is 1 layer, and the neuron number of the middle layer is
The number of neurons of the output layer is equal to the number of demand classifications; the one_hot encoding method is used to encode the demand state and the function of the output neuron is selected as the log function.
Calculating a local anomaly factor value by comparing the average density of each data point with the adjacent neighbor points, the data points with the local anomaly factor value less than a certain threshold being anomaly points;
periodically updating the average density of the neighborhood set, and updating the local anomaly factor value of the data points in the neighborhood according to the updated average density;
processing the abnormal data set in real time by using an enhanced flow type abnormal detection algorithm and a trained convolutional neural network model, dynamically adjusting according to the environment and the data distribution of the data set change, and automatically identifying the abnormal type of the abnormal data;
identifying the abnormal data set by using an enhanced flow type abnormal detection algorithm and a trained convolutional neural network model, wherein the calculation process of local abnormal factors in the enhanced flow type abnormal detection algorithm is as follows:
for each outlier data point in the real-time communication dataset
Calculation ofK nearest neighbor of (2) and obtain the neighborhood set +.>
Calculating data pointsTo data point->K reachable distance +.>Which is data point>K adjacent distance>And (4) point->And->European distance between->Is the maximum value of (2);
according to the set distance thresholdDefining the data points with the distance smaller than or equal to the threshold value as the data points in the neighborhood of the target point; target point- >Is expressed as:
wherein,representing target point->A set of data points within a neighborhood of (a); d is the dataset; />Representing data points +.>Is->A Euclidean distance between them; />Indicating that the distance is less than or equal to a set threshold +.>Data point->Belonging to the target point->Is within a neighborhood of (2); different k reachable distances correspond to different distance thresholds, and the threshold is dynamically adjusted according to the k reachable distances>The method comprises the steps of carrying out a first treatment on the surface of the Data point +.>To data point->K reachable distance +.>And threshold->Comparing; if the k reachable distance of a certain data point is smaller than the threshold +.>Marking it as an anomaly of the same type; if a continuous number of outliers occurs, the threshold value +.>To improve accuracy; if no or few outliers occur, the threshold value +.>To increase sensitivity;
calculation and calculationNeighborhood set of->Is expressed as>
;(4)
Calculation ofIs->;(5)
Wherein M is the number of data points; by dynamically adjusting threshold valuesSo that abnormal data points in more adjacent areas are detected.
Outputting results and early warning, and detecting the abnormal data set in real time by a self-adaptive method; judging whether the data sample is abnormal or not according to the threshold value, performing corresponding early warning processing, and feeding early warning information back to an early warning system;
The self-adaptive method tracks the change of the data in real time, dynamically adjusts the model and the parameters, can be more suitable for different types and distribution of data, provides more accurate abnormality detection and early warning results, and ensures that the early warning accuracy reaches more than 90 percent.
The self-adaptive method is utilized to enable the anomaly detection model to be automatically adjusted according to the change of the data, and adapt to new data distribution and mode change; in the adaptive methodThe updated formula of (c) is as follows:
;(6)
wherein,is a parameter->Is (are) updated value->Is self-adaptive learning rate->Is an adaptive gradient adjustment factor,/->Is the first moment of the gradient, i.e. the mean,/->Is the second moment of the gradient, i.e. the variance, +.>Is a smooth item->Is the current gradient;
the learning rate is adjusted according to the self-adaptive gradient adjustment factor, and the parameters of the model are updated by using the adjusted learning rate; when the adaptive gradient adjustment factor becomes large, the learning rate becomes small; when the self-adaptive gradient adjustment factor becomes smaller, the learning rate becomes larger; the method ensures that the learning rate is automatically adapted to different gradient conditions in the training process, thereby achieving better optimization effect; the adaptive gradient adjustment factor affects not only the absolute magnitude of the learning rate, but also the rate of change of the learning rate; the change amplitude of the learning rate in the training process is adjusted by controlling the change rate of the self-adaptive gradient adjustment factor; an adaptive gradient adjustment factor greater than a certain threshold may result in a slower change in learning rate, while an adaptive gradient adjustment factor less than a certain threshold may result in a faster change in learning rate.
As an embodiment of the present invention, a mobile communication network operator in a certain area desires to find out problems and potential failure causes by analyzing abnormal data such as signal quality abnormality, call drop rate abnormality, call completion rate abnormality, data transmission rate abnormality, base station failure or abnormal state, and traffic abnormality, so as to take corresponding measures to improve network quality. Various anomaly data including signal quality, call drop rate, call completion rate, data transmission rate, base station status, and traffic data are first collected. The data is preprocessed, including data cleaning, outlier removal, normalization, etc. Exploratory analysis, including statistical description, visual analysis, etc., is then performed on each anomaly data. For example, a line graph or a bar graph is drawn to observe the time-series change condition of abnormal data and the correlation with other indexes. An anomaly detection algorithm (random forest algorithm) is used to identify outlier data points. For each outlier data point, the cause and influencing factors behind it are further analyzed, e.g. looking at the network equipment, base station location, weather conditions etc. related to the outlier. And establishing an abnormality detection model according to the existing data and the characteristics of the known abnormality. At the same time, features such as signal strength, network load, antenna direction, etc. are extracted based on existing knowledge. And determining specific reasons for the abnormality according to the result of the model and the feature importance analysis, and providing corresponding problem solutions. For example, if a base station failure rate in a certain area is found to be high, engineering maintenance or increased investment may be required to improve the stability of the base station apparatus. According to the proposed problem solution, corresponding improvement measures are implemented and the improved data changes are of interest. The improvement effect is monitored and evaluated, and if the results are still unsatisfactory, further optimization of the solution or re-identification of other potential anomalies is required.
As an embodiment of the invention, reference is made to fig. 1, which is a flow chart of the method according to the invention.
Acquiring a high-dimensional real-time communication data set of a user side, and carrying out marking classification on the high-dimensional real-time communication data set; removing users without abnormal communication behaviors, and retaining abnormal data samples;
preprocessing the classified high-dimensional real-time communication data set; the data preprocessing comprises automatic cleaning, multi-source data integration, abnormality detection and data restoration, unique attribute deletion, relevant attribute integration and missing value processing, and the acquired high-dimensional data is converted into a data set which does not contain missing data and only contains effective characteristics;
extracting features of the preprocessed high-dimensional real-time communication data set by using a bi-directional cyclic neural network BiRNN model, extracting features suitable for anomaly detection, removing data which cannot possibly have anomalies from the data, and extracting the anomaly data;
normalizing the abnormal data, performing dimension reduction on the extracted high-dimension real-time communication data set features by using a Principal Component Analysis (PCA) method, and encoding the dimension reduction features by using a discrete encoding method to form encoded data;
Classifying by an anomaly detection algorithm of the enhanced weighted random forest algorithm; dividing the encoded data into a normal data set and an abnormal data set;
inputting the abnormal data into a convolutional neural network detection model for training;
processing the data flow in real time by using an enhanced flow type anomaly detection algorithm and a trained convolutional neural network model, dynamically adjusting according to the continuously changing data distribution, and automatically identifying the anomaly type of the anomaly data;
and outputting and early warning results, detecting and early warning the abnormal data set in real time through a self-adaptive method, and feeding early warning information back to an early warning system.
As an embodiment of the present invention, referring to fig. 2, a classification effect diagram of various form models is shown.
The abnormal communication data is processed using a time sequence construction method and then input into a classifier. After operation, the classification effect of the traditional model and various models is obtained. The results of the operation of each model are shown in fig. 2.
As can be seen from fig. 2, the recall of the time series exponential model in the form of the ratio and the first relative value is highest. The time series exponential model in differential form is best in terms of accuracy. It is also seen from fig. 2 that the recall and accuracy exhibit a one-time fluctuation law. The comparison of classification accuracy under different hidden layer structures is shown in table 1.
As seen from table 1, concealment levels 1, 2, 3 and 4 show good classification accuracy, all reaching 91% or more. After 1000 iterations, the classification accuracy of the second stage reaches 99.22%, which is the maximum of classification accuracy. Therefore, it is concluded that the convolutional neural network model with the 4-layer hidden layer structure has good classification accuracy.
Table 1 comparison of classification accuracy under different hidden layer structures
As can be seen from Table 2, the classification and identification error rate of the classifier designed by the invention is 27.51%, and the accuracy is 72.49%. The values and actual values in table 2 represent the number of times there is no physical unit. Experiments show that the main cause of the error rate is the delay of the communication anomaly detection result. According to the example analysis, the abnormal data monitoring and analyzing method based on the data mining and the neural network has good effectiveness and accuracy and has a certain practical value.
Table 2 error rate of neural network classification
As an embodiment of the present invention, refer to fig. 3, a comparison graph of weight functions.
Assuming 5 decision trees, different weights are given to different samples according to the frequency of the samples in the data set or the importance degree of specific attributes. When (when) Taking 10%>Take 20->Taking 30%>40 parts of (I) in the middle of (II)>Taking 50, N is equal to 5, then +.>,/>,/>,/>The method comprises the steps of carrying out a first treatment on the surface of the When->50 parts of (I) in the middle of (II)>40 parts of (I) in the middle of (II)>Taking 30%>Take 20->Taking 10, N is equal to 5, then +.>,/>,/>,/>
As shown in fig. 3, as the number of decision trees increases,the weight value decreases accordingly. With decreasing unbalance of the decision tree, +.>The weight value increases. The reinforced weighted random forest algorithm performs weighting treatment on the samples by introducing a weight function so as to improve the accuracy and the robustness of the model. By introducing a weight function, the weighted random forest is more concerned with important or scarce samples and effectively handles unbalanced data sets. This improves the accuracy of the classification model over a few classes or important samples and reduces the risk of misclassification. The weighted random forest enables the model to more robustly cope with noise, outliers and outliers in the data by weighting the samples. This helps to reduce the impact of these interference factors on the model and improves the stability of the model. By carrying out weighting treatment on the samples, the weighted random forest better captures important characteristics and modes of sample distribution, so that the generalization capability of the model is improved. This helps reduce the over-fitting phenomenon and improves the predictive power of the model for unknown data.
As an embodiment of the present invention, reference is made to fig. 4, which is a comparison graph of local anomaly factors.
Assuming that for each data point, its 10 nearest neighbors neighborhood is determined, the reachable distance of each data point to each point in its neighborhood is calculated, and then the average of the inverse of the reachable distance is calculated. For each data point, the LOF value for each point in its neighborhood is calculated and then the average of these LOF values is taken as the LOF value for that data point. A threshold is set based on the calculated LOF value, and if the LOF value of the data point exceeds the threshold, it is determined as an outlier.
As shown in fig. 4, as the number of nearest neighbors increases, the LOF value decreases. As the achievable distance increases, the LOF value decreases. The LOF algorithm can effectively find and identify outliers in the flow anomaly detection. The LOF algorithm is able to capture local anomaly patterns, not just global anomaly patterns, by taking into account the density and outlier degree of the data points relative to their neighborhood. Because local neighborhood calculation is adopted, the LOF algorithm is sensitive to the change of data distribution, and a new abnormal mode in the data stream can be captured in time. When the LOF algorithm calculates the average density and the local anomaly factors, some optimization techniques such as approximate calculation and index structure are adopted, so that the calculation complexity is reduced, and the algorithm efficiency is improved.
As an embodiment of the present invention, referring to fig. 5, a comparison chart of adaptive learning rates of an adaptive method is shown.
Assume that1->Taking 0.3-0.7%>Taking 300%>Taking 100.
As shown in fig. 5, with the smoothing termAn increase in the update value; as the learning rate increases, the update value decreases. By adaptive gradient adjustment factor->Offset correction is performed to adaptively adjust the learning rate, and the learning rate is divided by the square root of these to perform parameter updating. Not only can the learning rate be adaptively adjusted, but also the updating amplitude and the dynamic range of different parameters can be adapted, so that the training effect and the training stability are improved. The self-adaptive learning rate method can dynamically adjust the learning rate according to the current situation, thereby improving the efficiency and performance of the algorithm.
As an embodiment of the present invention, referring to fig. 6, a comparison chart of early warning accuracy of the adaptive method is shown.
As shown in fig. 6, as the learning rate increases, the higher the early warning accuracy is, the more than 90%; along with the increase of the self-adaptive gradient adjustment factor, the early warning accuracy is improved. The adaptive learning rate may dynamically adjust the learning rate based on the model's performance during training so that the model can converge to an optimal solution more quickly. The adaptive gradient adjustment factor can dynamically adjust the update amplitude according to the information of the gradient, so that the model parameters can be updated towards the optimal direction more quickly. The proper self-adaptive learning rate and gradient adjustment factor can help the model to better converge to the optimal solution, and the accuracy and stability of the model parameters are improved. When the model parameters are more accurate and stable, the early warning accuracy of the abnormal detection and prediction model can be correspondingly improved. The self-adaptive method can dynamically adjust the model and parameters according to the change and heterogeneity of the data, and can be better adapted to different abnormal conditions. The dynamic adaptability can improve the accuracy of anomaly detection and prediction, thereby improving the early warning accuracy.
In summary, the abnormal data monitoring and analyzing method based on data mining and neural network comprises the following steps:
acquiring a high-dimensional real-time communication data set of a user side, and carrying out marking classification on the high-dimensional real-time communication data set; removing users without abnormal communication behaviors, and retaining abnormal data samples;
preprocessing the classified high-dimensional real-time communication data set; the data preprocessing comprises automatic cleaning, multi-source data integration, abnormality detection and data restoration, unique attribute deletion, relevant attribute integration and missing value processing, and the acquired high-dimensional data is converted into a data set which does not contain missing data and only contains effective characteristics;
the data preprocessing specifically comprises the following steps:
unique attribute processing: the unique attribute refers to the characteristic capable of uniquely identifying the sample to be identified, but has no influence on the identification of abnormal data, and is directly deleted;
missing data processing: finding the position of the missing value; filling the missing value; quickly acquiring index information of a missing value in the characteristic data, and processing the missing value by using a position logic index;
correlation attribute merging: for characteristic data with obvious correlation, combining the correlated data into one data by using a data operation, and deleting non-correlated data; the data is combined by adding two attribute columns feature 1 and feature 2 into the data, and the deleting operation is performed after the attribute combination.
Automatic cleaning: automatically identifying and processing the problems of missing values, abnormal values and repeated values in the data by using an automatic algorithm and a tool through a semi-supervised learning algorithm and an abnormal detection algorithm; the method comprises the following specific steps:
loading data to be cleaned into an appropriate computing environment;
missing value processing: and (3) missing value detection: detecting missing values in the data using a statistical index, a visualization method, or a clustering algorithm; missing value filling: filling the missing value by using an interpolation method according to the characteristics and meaning of the data;
outlier processing: abnormal value detection: identifying outliers in the data using an outlier detection algorithm; outlier processing: selecting an abnormal value deleting method for processing according to the property of the abnormal value;
repeating the value processing: duplicate value detection: detecting duplicate records in the data using a unique identifier of the data or a combination of fields; repeating the value processing: selecting to reserve the first record to process the repeated value according to the service requirement;
verification and evaluation: verifying and evaluating the cleaned data, checking the cleaning effect, and comparing the cleaning effect with the original data; the missing value ratio, the outlier ratio, are used to evaluate the accuracy and integrity of the cleaning results.
Extracting features of the preprocessed high-dimensional real-time communication data set by using a bi-directional cyclic neural network BiRNN model, extracting features suitable for anomaly detection, removing data which cannot possibly have anomalies from the data, and extracting the anomaly data;
normalizing the abnormal data, performing dimension reduction on the extracted characteristics of the high-dimension real-time communication data set by using a Principal Component Analysis (PCA) method, and encoding the dimension reduction characteristics by using a discrete encoding method to form encoded data;
the step of encoding the dimension reduction feature by using a discrete encoding method to form encoded data specifically comprises the following steps:
among the relevant attributes, one or more discrete attributes are categorized into One category, and each category is individually encoded using the one_hot encoding method.
Carrying out unified mark classification on the coded data;
classifying the encoded data into a normal data set and an abnormal data set by an enhanced weighted random forest algorithm;
assigning a weight to each abnormal data sample;
for unlabeled new samples, the trained model uses the learned parameters and weights thereof to conduct classification inference according to the learned rules, and the new samples are distributed to normal categories or abnormal categories; the parameters and weights of the model are learned from the labeled samples by an optimization algorithm in the training process.
The marking classification of the data refers to that all data sets are divided into normal data sets and abnormal data sets, the coded data are classified through an enhanced weighted random forest algorithm, and the weight function of the enhanced weighted random forest algorithm is as follows:
wherein,is the unbalance of the decision tree, N is the number of decision trees, < >>Is the voting weight of the decision tree;
given N balanced sub-training sets, training is carried out on the N balanced sub-training sets to obtain N decision tree classifiers,/>Is a natural number from 1 to N;
the final classifier is obtained by weighted voting and is expressed as follows:
wherein Y is an abnormal data set;
the finally obtained classifierThe method is used for testing the classification effect by the test set.
The invention classifies through the abnormal detection algorithm of the reinforced weighted random forest algorithm, improves the accuracy of a classification model, better processes the unbalanced data problem and improves the stability and the reliability of the overall classification. The method helps to reduce dimensionality, redundancy characteristics and time consumption of model training and prediction, and is suitable for abnormal detection and classification tasks of large data volume and high-frequency data.
Inputting the abnormal data into a convolutional neural network detection model for training;
The training process of the convolutional neural network detection model comprises the following steps:
initializing parameters with random values;
inputting the coded data in the abnormal data set, and obtaining an output value through forward propagation of a convolution layer, a pooling layer and a full connection layer;
calculating training errors between the output value and the target value of each layer of the convolution layer, the pooling layer and the full-connection layer; the target value refers to the classified abnormal data set;
performing back propagation updating weight according to the training error;
when the training error does not change significantly within 100 iterations, the training process is terminated; if the termination condition is not met, re-executing the input data;
and obtaining a trained convolutional neural network model.
The method comprises the steps of setting the number of neurons in an input layer of a convolutional neural network according to the number of input abnormal communication data features and the coding bit number of each abnormal communication data feature, wherein the calculation method is shown in the following formula:
;(3)
where M is the number of features of the input abnormal communication data,is the number of coding bits of the ith feature, +.>Is the number of neurons of the input layer; the abnormal communication data characteristic types comprise frequency abnormality, time delay abnormality, abnormal data packet frequency, abnormal data packet size, signal strength abnormality, abnormal protocol behavior and data integrity abnormality.
Inputting the abnormal data set into a convolutional neural network detection model M1 for training;
the convolutional neural network model M1 comprises 12 convolutional layers and 8 pooling layers, the convolutional layers adopt 3 small convolutional kernels of 3*3, and the 8 pooling layers adopt maximum pooling;
respectively adding residual error connection modules into a 3 rd convolution layer, a 4 th convolution layer and a 5 th convolution layer in the model M1, wherein one residual error connection module consists of two convolution layers, and batch normalization and activation functions are contained between the convolution layers; adding the output and the input of the 3 rd convolution layer of the model M1, transmitting the added result to the 4 th convolution layer as input, applying the residual error connection module again, adding the output and the input of the 4 th convolution layer, transmitting the added result to the 5 th convolution layer, continuously applying the residual error connection module, and obtaining a convolution neural network model M2;
inputting the encoded data in the abnormal data set into the convolutional neural network model M2, and performing convolutional, residual error connection and pooling operation;
the middle layer is 1 layer, and the neuron number of the middle layer is
The number of neurons of the output layer is equal to the number of demand classifications; the one_hot encoding method is used to encode the demand state and the function of the output neuron is selected as the log function.
Calculating a local anomaly factor value by comparing the average density of each data point with the adjacent neighbor points, the data points with the local anomaly factor value less than a certain threshold being anomaly points;
periodically updating the average density of the neighborhood set, and updating the local anomaly factor value of the data points in the neighborhood according to the updated average density;
processing the abnormal data set in real time by using an enhanced flow type abnormal detection algorithm and a trained convolutional neural network model, dynamically adjusting according to the environment and the data distribution of the data set change, and automatically identifying the abnormal type of the abnormal data;
identifying the classified abnormal data by using an enhanced flow type abnormal detection algorithm and a trained convolutional neural network model, wherein the calculation process of local abnormal factors in the enhanced flow type abnormal detection algorithm is as follows:
for each outlier data point in the real-time communication dataset
Calculation ofK nearest neighbor of (2) and obtain the neighborhood set +.>
Calculating data pointsTo data point->K reachable distance +.>Which is data point>K adjacent distance>And (4) point->And->European distance between->Is the maximum value of (2);
according to the set distance thresholdDefining the data points with the distance smaller than or equal to the threshold value as the data points in the neighborhood of the target point; target point- >Is expressed as:
wherein,representing target point->A set of data points within a neighborhood of (a); d is the dataset; />Representing data points +.>Is->A Euclidean distance between them; />Indicating that the distance is less than or equal to a set threshold +.>Data point->Belonging to the target point->Is within a neighborhood of (2); different k reachable distances correspond to different distance thresholds, and the threshold is dynamically adjusted according to the k reachable distances>The method comprises the steps of carrying out a first treatment on the surface of the Data point +.>To data point->K reachable distance +.>And threshold->Comparing; if the k reachable distance of a certain data point is smaller than the threshold +.>Marking it as an anomaly of the same type; if a continuous number of outliers occurs, the threshold value +.>To improve accuracy; if no or few outliers occur, the threshold value +.>To increase sensitivity;
calculation and calculationNeighborhood set of->Is expressed as>
;(4)
Calculation ofIs->; (5)
Wherein M is the number of data points; by dynamically adjusting threshold valuesSo that abnormal data points in more adjacent areas are detected.
The method and the device identify the classified abnormal data by utilizing the reinforced streaming abnormal detection algorithm and the trained model, monitor and analyze the abnormal data in real time, detect the newly-appearing abnormal data in time, have the capability of efficiently processing the real-time data stream, and timely send out early warning and response to the abnormal data.
Outputting results and early warning, and detecting the abnormal data set in real time by a self-adaptive method; judging whether the data sample is abnormal or not according to the threshold value, performing corresponding early warning processing, and feeding early warning information back to an early warning system;
the self-adaptive method tracks the change of the data in real time, dynamically adjusts the model and the parameters, can be more suitable for different types and distribution of data, provides more accurate abnormality detection and early warning results, and ensures that the early warning accuracy reaches more than 90 percent.
The self-adaptive method is utilized to enable the anomaly detection model to be automatically adjusted according to the change of the data, and adapt to new data distribution and mode change; in the adaptive methodThe updated formula of (c) is as follows:
;(6)
wherein,is a parameter->Is (are) updated value->Is self-adaptive learning rate->Is an adaptive gradient adjustment factor,/->Is the first moment of the gradient, i.e. the mean,/->Is the second moment of the gradient, i.e. the variance, +.>Is a smooth item->Is the current gradient;
the learning rate is adjusted according to the self-adaptive gradient adjustment factor, and the parameters of the model are updated by using the adjusted learning rate; when the adaptive gradient adjustment factor becomes large, the learning rate becomes small; when the self-adaptive gradient adjustment factor becomes smaller, the learning rate becomes larger; the adaptive gradient adjustment factor affects not only the absolute magnitude of the learning rate, but also the rate of change of the learning rate; the change amplitude of the learning rate in the training process is adjusted by controlling the change rate of the self-adaptive gradient adjustment factor; an adaptive gradient adjustment factor greater than a certain threshold may result in a slower change in learning rate, while an adaptive gradient adjustment factor less than a certain threshold may result in a faster change in learning rate.
According to the invention, the abnormal data is detected and early-warned by a self-adaptive method, the parameters and the threshold value of an abnormal detection algorithm are flexibly adjusted according to the characteristics and the change modes of the abnormal data, the accuracy of abnormal detection is improved, the abnormal data is dynamically adjusted according to the change of the abnormal data, and the abnormal data detection system adapts to abnormal conditions under different time periods and different data distribution, so that the abnormal detection system can respond to a new abnormal mode in real time and send early warning timely.
The present disclosure is a system, and/or computer program product. The computer program product includes a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (6)

1. The abnormal data monitoring and analyzing method based on the data mining and the neural network is characterized by comprising the following steps:
acquiring a high-dimensional real-time communication data set of a user side, and performing classification marking and preprocessing;
extracting features by using a bi-directional cyclic neural network BiRNN model, and reducing the dimension by adopting a principal component analysis method;
encoding by a discrete encoding method to form encoded data;
classifying the encoded data into a normal data set and an abnormal data set by an enhanced weighted random forest algorithm;
inputting the abnormal data set into a convolutional neural network detection model for training;
calculating a local anomaly factor value by comparing the average density of each data point with the adjacent neighbor points, the data points with the local anomaly factor value less than a certain threshold being anomaly points;
periodically updating the average density of the neighborhood set, and updating the local anomaly factor value of the data points in the neighborhood according to the updated average density;
processing the abnormal data set in real time by using an enhanced flow type abnormal detection algorithm and a trained convolutional neural network model, dynamically adjusting according to the environment and the data distribution of the data set change, and automatically identifying the abnormal type of the abnormal data;
Outputting results and early warning, and detecting the abnormal data set in real time by a self-adaptive method; judging whether the data sample is abnormal or not according to the threshold value, performing corresponding early warning processing, and feeding early warning information back to an early warning system;
the self-adaptive method tracks the change of data in real time, dynamically adjusts the model and parameters, can be more suitable for different types and distribution of data, provides more accurate anomaly detection and early warning results, and ensures that the early warning accuracy reaches more than 90 percent;
classification is performed by an enhanced weighted random forest algorithm, comprising the steps of:
inputting the encoded data into an enhanced weighted random forest model for training;
performing anomaly detection and classification using a trained, enhanced weighted random forest model, separating the encoded data into a normal data set and an anomaly data set;
assigning a weight to each abnormal data sample;
for unlabeled new samples, the trained model uses the learned parameters and weights thereof to conduct classification inference according to the learned rules, and the new samples are distributed to normal categories or abnormal categories; the parameters and weights of the model are learned from the marked samples through an optimization algorithm in the training process;
The tag classification of data refers to the classification of all data sets into normal data sets and abnormal data sets, the encoded data is classified by an enhanced weighted random forest algorithm, and the weight function of the enhanced weighted random forest algorithm is as follows:
wherein,is the unbalance of the decision tree, N is the number of decision trees, < >>Is the voting weight of the decision tree;
given N balanced sub-training sets, training is carried out on the N balanced sub-training sets to obtain N decision tree classifiersIs a natural number from 1 to N;
the final classifier is obtained by weighted voting and is expressed as follows:
wherein Y is an abnormal data set;
the finally obtained classifierThe testing set is used for testing classification effects;
identifying the abnormal data set by using an enhanced flow type abnormal detection algorithm and a trained convolutional neural network model, wherein the calculation process of local abnormal factors in the enhanced flow type abnormal detection algorithm is as follows:
for each outlier data point in the real-time communication dataset
Calculation ofK nearest neighbor of (2) and obtain the neighborhood set +.>
Calculating data pointsTo data point->K reachable distance +.>Which is data point>K adjacent distance>And (4) point->And (3) withEuropean distance between- >Is the maximum value of (2);
according to the distance thresholdDefining the data points with the distance smaller than or equal to the threshold value as the data points in the neighborhood of the target point; target point->Is expressed as:
wherein,representing target point->A set of data points within a neighborhood of (a); d is a real-time communication dataset; />Representing data points +.>Is->A Euclidean distance between them; />Indicating that the distance is less than or equal to the threshold value%>Data point->Belonging to the target point->Is within a neighborhood of (2); different k reachable distances correspond to different distance thresholds, and the threshold is dynamically adjusted according to the k reachable distances>The method comprises the steps of carrying out a first treatment on the surface of the Data point +.>To data point->K reachable distance +.>And threshold->Comparing; if the k reachable distance of a certain outlier data point is smaller than the threshold +.>Marking it as an anomaly of the same type; if successive outliers occur, the threshold value +.>To improve accuracy; if no outlier occurs, the threshold value +.>To increase sensitivity;
calculation and calculationNeighborhood set of->Is expressed as>
Calculation ofIs->
Wherein M is the number of data points;
the self-adaptive method is utilized to enable the anomaly detection model to be automatically adjusted according to the change of the data, and adapt to new data distribution and mode change; in the adaptive method The updated formula of (c) is as follows:
wherein,is a parameter->Is (are) updated value->Is self-adaptive learning rate->Is an adaptive gradient adjustment factor,/->Is the first moment of the gradient, i.e. the mean,/->Is the second moment of the gradient, i.e. the variance, +.>Is a smooth item->Is the current gradient;
the learning rate is adjusted according to the self-adaptive gradient adjustment factor, and the parameters of the model are updated by using the adjusted learning rate; when the adaptive gradient adjustment factor becomes large, the learning rate becomes small; when the self-adaptive gradient adjustment factor becomes smaller, the learning rate becomes larger;
acquiring a high-dimensional real-time communication data set of a user side, and carrying out marking classification on the high-dimensional real-time communication data set;
the high-dimensional dataset includes the following 10 dimensional data: timestamp, sender and receiver identification, communication mode, communication duration, communication quality indicator, bandwidth usage, communication location, data traffic, network topology, and user behavior;
removing users without abnormal communication behaviors, and retaining abnormal data samples;
the abnormal data comprise abnormal signal quality, abnormal call drop rate, abnormal call completing rate, abnormal data transmission rate, base station fault or abnormal state and abnormal flow.
2. The method for monitoring and analyzing abnormal data based on data mining and neural network according to claim 1, wherein the step of encoding the dimension-reduction feature using a discrete encoding method to form encoded data specifically comprises:
among the relevant attributes, one or more discrete attributes are categorized into One category, and each category is individually encoded using the one_hot encoding method.
3. The method for monitoring and analyzing abnormal data based on data mining and neural network according to claim 1, wherein the training process of the convolutional neural network detection model comprises:
initializing parameters with random values;
inputting the coded data in the abnormal data set, and obtaining an output value through forward propagation of a convolution layer, a pooling layer and a full connection layer;
calculating training errors between the output value and the target value of each layer of the convolution layer, the pooling layer and the full-connection layer; the target value refers to the classified abnormal data set;
performing back propagation updating weight according to the training error;
when the training error does not change significantly within 100 iterations, the training process is terminated; if the termination condition is not met, re-executing the input data;
And obtaining a trained convolutional neural network model.
4. The abnormal data monitoring and analyzing method based on data mining and neural network according to claim 3, wherein the number of neurons in the input layer of the convolutional neural network is set according to the number of input abnormal communication data features and the number of coding bits of each abnormal communication data feature, and the calculating method is as follows:
where M is the number of features of the input abnormal communication data,is the firstCoding bit number of i features, < >>Is the number of neurons of the input layer; the types of the abnormal communication data characteristics comprise frequency abnormality, time delay abnormality, abnormal data packet frequency, abnormal data packet size, signal strength abnormality, abnormal protocol behavior and data integrity abnormality.
5. The method for monitoring and analyzing abnormal data based on data mining and neural network according to claim 4, wherein the encoded data in the abnormal data set is inputted into a convolutional neural network detection model M1 for training,
the convolutional neural network model M1 comprises 12 convolutional layers and 8 pooling layers, the convolutional layers adopt 3 small convolutional kernels of 3*3, and the 8 pooling layers adopt maximum pooling;
Respectively adding residual error connection modules into a 3 rd convolution layer, a 4 th convolution layer and a 5 th convolution layer in the model M1, wherein one residual error connection module consists of two convolution layers, and batch normalization and activation functions are contained between the convolution layers; adding the output and the input of the 3 rd convolution layer of the model M1, transmitting the added result to the 4 th convolution layer as input, applying the residual error connection module again, adding the output and the input of the 4 th convolution layer, transmitting the added result to the 5 th convolution layer, continuously applying the residual error connection module, and obtaining a convolution neural network model M2;
inputting the encoded data in the abnormal data set into the convolutional neural network model M2, and performing convolutional, residual error connection and pooling operation;
the middle layer is 1 layer, and the neuron number of the middle layer is
6. The abnormal data monitoring and analyzing method based on data mining and neural network according to claim 3, wherein the number of neurons of the output layer is equal to the number of demand classifications; the one_hot encoding method is used to encode the demand state and the function of the output neuron is selected as the log function.
CN202311718358.3A 2023-12-14 2023-12-14 Abnormal data monitoring and analyzing method based on data mining and neural network Active CN117421684B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311718358.3A CN117421684B (en) 2023-12-14 2023-12-14 Abnormal data monitoring and analyzing method based on data mining and neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311718358.3A CN117421684B (en) 2023-12-14 2023-12-14 Abnormal data monitoring and analyzing method based on data mining and neural network

Publications (2)

Publication Number Publication Date
CN117421684A CN117421684A (en) 2024-01-19
CN117421684B true CN117421684B (en) 2024-03-12

Family

ID=89526888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311718358.3A Active CN117421684B (en) 2023-12-14 2023-12-14 Abnormal data monitoring and analyzing method based on data mining and neural network

Country Status (1)

Country Link
CN (1) CN117421684B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117849700B (en) * 2024-03-07 2024-05-24 南京国网电瑞电力科技有限责任公司 Modular electric energy metering system capable of controlling measurement

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961017A (en) * 2019-02-26 2019-07-02 杭州电子科技大学 A kind of cardiechema signals classification method based on convolution loop neural network
CN111143838A (en) * 2019-12-27 2020-05-12 北京科东电力控制系统有限责任公司 Database user abnormal behavior detection method
CN112001788A (en) * 2020-08-21 2020-11-27 东北大学 Credit card default fraud identification method based on RF-DBSCAN algorithm
CN113256066A (en) * 2021-04-23 2021-08-13 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
CN113378990A (en) * 2021-07-07 2021-09-10 西安电子科技大学 Traffic data anomaly detection method based on deep learning
CN114124482A (en) * 2021-11-09 2022-03-01 中国电子科技集团公司第三十研究所 Access flow abnormity detection method and device based on LOF and isolated forest
CN114859351A (en) * 2022-06-10 2022-08-05 重庆地质矿产研究院 Method for detecting surface deformation field abnormity based on neural network
CN115577275A (en) * 2022-11-11 2023-01-06 山东产业技术研究院智能计算研究院 Time sequence data anomaly monitoring system and method based on LOF and isolated forest
CN115964258A (en) * 2022-12-30 2023-04-14 天翼物联科技有限公司 Internet of things network card abnormal behavior grading monitoring method and system based on multi-time sequence analysis
CN116595465A (en) * 2023-04-10 2023-08-15 哈尔滨工程大学 High-dimensional sparse data outlier detection method and system based on self-encoder and data enhancement
CN116955926A (en) * 2023-07-03 2023-10-27 保定耘云信息技术咨询有限公司 Bank data analysis method based on deep learning
CN117216660A (en) * 2023-09-12 2023-12-12 杭州安恒信息技术股份有限公司 Method and device for detecting abnormal points and abnormal clusters based on time sequence network traffic integration

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112543465B (en) * 2019-09-23 2022-04-29 中兴通讯股份有限公司 Abnormity detection method, abnormity detection device, terminal and storage medium
DE102019135608A1 (en) * 2019-12-20 2021-06-24 Bayerische Motoren Werke Aktiengesellschaft Method, device and system for the detection of abnormal operating conditions of a device
EP3862927A1 (en) * 2020-02-05 2021-08-11 Another Brain Anomaly detector, method of anomaly detection and method of training an anomaly detector

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961017A (en) * 2019-02-26 2019-07-02 杭州电子科技大学 A kind of cardiechema signals classification method based on convolution loop neural network
CN111143838A (en) * 2019-12-27 2020-05-12 北京科东电力控制系统有限责任公司 Database user abnormal behavior detection method
CN112001788A (en) * 2020-08-21 2020-11-27 东北大学 Credit card default fraud identification method based on RF-DBSCAN algorithm
CN113256066A (en) * 2021-04-23 2021-08-13 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
CN113378990A (en) * 2021-07-07 2021-09-10 西安电子科技大学 Traffic data anomaly detection method based on deep learning
CN114124482A (en) * 2021-11-09 2022-03-01 中国电子科技集团公司第三十研究所 Access flow abnormity detection method and device based on LOF and isolated forest
CN114859351A (en) * 2022-06-10 2022-08-05 重庆地质矿产研究院 Method for detecting surface deformation field abnormity based on neural network
CN115577275A (en) * 2022-11-11 2023-01-06 山东产业技术研究院智能计算研究院 Time sequence data anomaly monitoring system and method based on LOF and isolated forest
CN115964258A (en) * 2022-12-30 2023-04-14 天翼物联科技有限公司 Internet of things network card abnormal behavior grading monitoring method and system based on multi-time sequence analysis
CN116595465A (en) * 2023-04-10 2023-08-15 哈尔滨工程大学 High-dimensional sparse data outlier detection method and system based on self-encoder and data enhancement
CN116955926A (en) * 2023-07-03 2023-10-27 保定耘云信息技术咨询有限公司 Bank data analysis method based on deep learning
CN117216660A (en) * 2023-09-12 2023-12-12 杭州安恒信息技术股份有限公司 Method and device for detecting abnormal points and abnormal clusters based on time sequence network traffic integration

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Online Anomaly Detection Leveraging Stream-Based Clustering and Real-Time Telemetry;Andrian Putina等;《IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT》;20210331;第18卷(第1期);第839-854页 *
基于LOF-RF的制冷系统故障检测与诊断研究;熊坤;《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》;20220315(第3期);第C028-662页 *
基于改进LOF算法的窃电检测方法研究;殷锋等;《中南民族大学学报(自然科学版)》;20220930;第41卷(第5期);第579-585页 *
基于迁移学习的用气行为异常检测研究与应用;刘可立;《中国优秀硕士学位论文全文数据库 工程科技Ⅰ辑》;20230215(第2期);第B017-327页 *

Also Published As

Publication number Publication date
CN117421684A (en) 2024-01-19

Similar Documents

Publication Publication Date Title
CN110097037B (en) Intelligent monitoring method and device, storage medium and electronic equipment
CN110263846B (en) Fault diagnosis method based on fault data deep mining and learning
CN108881196B (en) Semi-supervised intrusion detection method based on depth generation model
CN111585948B (en) Intelligent network security situation prediction method based on power grid big data
CN111353153B (en) GEP-CNN-based power grid malicious data injection detection method
CN101399672B (en) Intrusion detection method for fusion of multiple neutral networks
CN109766992B (en) Industrial control abnormity detection and attack classification method based on deep learning
CN113255848B (en) Water turbine cavitation sound signal identification method based on big data learning
CN117421684B (en) Abnormal data monitoring and analyzing method based on data mining and neural network
WO2022052510A1 (en) Anomaly detection system and method for sterile filling production line
CN113378990B (en) Flow data anomaly detection method based on deep learning
CN112738014A (en) Industrial control flow abnormity detection method and system based on convolution time sequence network
CN116684878B (en) 5G information transmission data safety monitoring system
CN113158722A (en) Rotary machine fault diagnosis method based on multi-scale deep neural network
CN110737976A (en) mechanical equipment health assessment method based on multi-dimensional information fusion
CN116668083A (en) Network traffic anomaly detection method and system
CN113780432B (en) Intelligent detection method for operation and maintenance abnormity of network information system based on reinforcement learning
CN116662817A (en) Asset identification method and system of Internet of things equipment
CN111666978A (en) Intelligent fault early warning system for IT system operation and maintenance big data
WO2024027487A1 (en) Health degree evaluation method and apparatus based on intelligent operations and maintenance scene
CN114915496A (en) Network intrusion detection method and device based on time weight and deep neural network
CN113722230A (en) Integrated assessment method and device for vulnerability mining capability of fuzzy test tool
CN115831339B (en) Medical system risk management and control pre-prediction method and system based on deep learning
CN117391458B (en) Safety production risk detection and early warning method and system based on data analysis
CN113609480B (en) Multipath learning intrusion detection method based on large-scale network flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant