CN113391976A - Distributed data node abnormal behavior detection method, system and storage medium - Google Patents

Distributed data node abnormal behavior detection method, system and storage medium Download PDF

Info

Publication number
CN113391976A
CN113391976A CN202110660410.9A CN202110660410A CN113391976A CN 113391976 A CN113391976 A CN 113391976A CN 202110660410 A CN202110660410 A CN 202110660410A CN 113391976 A CN113391976 A CN 113391976A
Authority
CN
China
Prior art keywords
data
node
layer
data set
slave node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110660410.9A
Other languages
Chinese (zh)
Inventor
马樱
吴桐雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University of Technology
Original Assignee
Xiamen University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University of Technology filed Critical Xiamen University of Technology
Priority to CN202110660410.9A priority Critical patent/CN113391976A/en
Publication of CN113391976A publication Critical patent/CN113391976A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • G06F11/3082Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting the data filtering being achieved by aggregating or compressing the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a distributed data node abnormal behavior detection method which is characterized in that the method is applied to a gateway data abnormal behavior detection system and improves the security of internet data; the method comprises the following steps: the first slave node is responsible for collecting a real-time data set from the gateway; the second slave node preprocesses the real-time data set; the main detection node is responsible for detecting the data in the real-time data set; the block node stores the set of data. The system comprises: the system comprises a data acquisition layer, a data preprocessing layer, a data analysis layer and a data storage layer; the data acquisition layer is in data connection with the data preprocessing layer, the data preprocessing layer is in data connection with the data analysis layer, and the data analysis layer is in data connection with the data storage layer.

Description

Distributed data node abnormal behavior detection method, system and storage medium
Technical Field
The invention relates to the technical field of network security, in particular to a distributed data node abnormal behavior detection method, a distributed data node abnormal behavior detection system and a storage medium.
Background
The development of the internet promotes the data security problem, so that a data detection system suitable for large data scale needs to be established, data is taken as a detection key point, abnormal detection behaviors are analogized to finding out abnormal data from the data, and the detection technology and the data mining technology are combined to effectively reduce the false detection alarm rate and improve the detection efficiency.
For example, the CN105591836A prior art discloses a data flow detection method and apparatus, which filters the rule that the data flow is finally matched, but can only detect one type of data flow.
Another typical prior art, such as CN109542772A, discloses an anomaly detection method based on data flow analysis, and the present invention is directed to a new BPEL software paradigm, taking into account the language characteristics that the conventional software does not have, but the technique can only be directed to single category data identification.
Referring to the method and apparatus for detecting data stream disclosed in the prior art of CN101459554A, a detection method combining the main stream characteristic information and the auxiliary stream characteristic information is adopted, so that detection of data stream not carrying obvious characteristic information is achieved, but the total amount of data stream is excessively large.
The invention aims to solve the problems that the traditional network abnormity detection is limited by data storage and processing capacity, the accuracy rate is low, the false alarm rate is high and the like generally existing in the field.
Disclosure of Invention
The invention aims to improve the detection capability and speed of a network anomaly detection model, and provides a method, a system and a storage medium for detecting the anomaly behavior of a distributed data node, aiming at the defects that the traditional network anomaly detection which generally exists at present is limited by data storage and processing capabilities, and has low accuracy, high false alarm rate and the like.
In order to overcome the defects of the prior art, the invention adopts the following technical scheme:
a distributed data node abnormal behavior detection method is characterized in that the method is applied to a distributed data node abnormal behavior detection system; it is characterized by comprising:
the first slave node is responsible for collecting a real-time data set from the gateway;
the second slave node preprocesses the real-time data set;
the main detection node is responsible for detecting the data in the real-time data set;
the block node stores the set of data.
Optionally, the first slave node formats the real-time data set, the processed real-time data set is stored in a temporary storage area, and the first slave node sends the real-time data set stored in the temporary storage area to a second slave node.
Optionally, the second slave node is responsible for receiving the formatted data set of the first slave node, and the data set is stored in the second slave node in a Topic queue manner.
Optionally, the second slave node performs a preprocessing operation on the data set, where a data feature set is built in the second slave node; the preprocessing operation carries out feature analysis on data in the formatted data set, if the data features of the data belong to the data feature set, the data are divided into a first data set, if the data features in the formatted data set do not belong to the data feature set, the data are divided into a second data set, and the second slave node respectively sends the first data set and the second data set to a main detection node.
Optionally, the main detection node performs data detection on the data set, a cluster analysis model is built in the main detection node, the cluster analysis model is responsible for performing detection analysis on the first data set, if the first data set does not meet a cluster analysis requirement, it is considered that the first data set has an abnormal behavior, at this time, the main detection node sends deletion information to the first slave node, the temporary storage area deletes a data stream to which the first data set belongs after the first slave node receives the information, and if the first data set meets the cluster analysis requirement, the first data set is sent to the block node.
Optionally, a computing engine is built in the primary detection node, and the computing engine performs computing operation on the second data set after the second data set is encapsulated in a Hash table manner.
Optionally, the block node is responsible for performing data storage and data reading operations on the first data set, if the block node stores the first data set, the block node sends the delete message to the first slave node, and the temporary storage area deletes the storage information to which the first data set belongs.
Optionally, the main detection node reads the first data set stored in the block node, and the main detection node performs encryption analysis on the first data set.
Optionally, the system is configured to process an abnormal behavior of a data stream, and perform abnormal behavior detection on network data; the system comprises: the system comprises a data acquisition layer, a data preprocessing layer, a data analysis layer and a data storage layer; the data acquisition layer is in data connection with the data preprocessing layer, the data preprocessing layer is in data connection with the data analysis layer, and the data analysis layer is in data connection with the data storage layer.
Optionally, the data collection layer is configured to collect, by the first slave node, the gateway real-time data set.
Optionally, the data preprocessing layer is configured to perform preprocessing operation on the data set by the second slave node.
Optionally, the data analysis layer is configured to perform the cluster analysis operation on the main detection node.
Optionally, the data storage layer is used for performing data storage on the block node and performing read-write operation on the data block.
Optionally, a computer readable storage medium, on which a distributed data node abnormal behavior detection program is stored, the program being executable by one or more processors to implement the distributed data node abnormal behavior detection method according to any one of claims 1 to 5.
The beneficial effects obtained by the invention are as follows:
1. the data stream with abnormal points and the data stream with non-abnormal points are subjected to cluster analysis by adopting a data analysis layer, a new detection model is established in the cluster analysis process, and an algorithm for quickly removing the isolated points is designed to improve the detection efficiency;
2. by adopting a real-time analysis technology, the data abnormal behavior can be accurately and quickly detected on the premise of high-capacity data detection.
3. The characteristic value of the data stream is stored by adopting a block chain storage technology, so that the characteristic value storage capacity of the data detection system is improved, and the detection capacity of the system is improved;
4. by adopting two data classification methods, two types of data streams can be detected, and the detection diversity is improved.
Drawings
The invention will be further understood from the following description in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Like reference numerals designate corresponding parts throughout the different views.
Fig. 1 is a flowchart of a distributed data node abnormal behavior detection method according to the present invention.
Fig. 2 is a schematic structural diagram of a data acquisition layer in the distributed data node abnormal behavior detection system according to the present invention.
Fig. 3 is a schematic structural diagram of a data preprocessing layer in the distributed data node abnormal behavior detection system according to the present invention.
Fig. 4 is a schematic structural diagram of a distributed data node abnormal behavior detection system according to the present invention.
Detailed Description
In order to make the objects and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the embodiments thereof; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Other systems, methods, and/or features of the present embodiments will become apparent to one with skill in the art upon examination of the following detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims. Additional features of the disclosed embodiments are described in, and will be apparent from, the detailed description that follows.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by the terms "upper" and "lower" and "left" and "right" etc., which is based on the orientation or positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but it is not intended to indicate or imply that the device or assembly referred to must have a specific orientation.
The first embodiment is as follows:
a distributed data node abnormal behavior detection method is applied to a distributed data node abnormal behavior detection system; it is characterized by comprising:
the first slave node is responsible for collecting a real-time data set from the gateway;
the second slave node preprocesses the real-time data set;
the main detection node is responsible for detecting the data in the real-time data set;
the block node stores the data set;
the distributed data node abnormal behavior detection method is applied to the distributed data node abnormal behavior detection system, wherein the system comprises: the system comprises a data acquisition layer, a data preprocessing layer, a data analysis layer and a data storage layer; the data acquisition layer is in data connection with the data preprocessing layer, the data preprocessing layer is in data connection with the data analysis layer, and the data analysis layer is in data connection with the data storage layer; the layers are independent from each other, the data format of the layers is standard conversion, the data calling between the layers adopts a standardized interface,
the step that the first slave node is responsible for collecting the real-time data set from the gateway comprises the following steps:
the first slave node formats the real-time data set, the processed real-time data set is stored in a temporary storage area, and the first slave node sends the real-time data set stored in the temporary storage area to a second slave node;
the data acquisition layer is used for the first slave node to acquire the gateway real-time data set; the data acquisition step comprises the following specific implementation steps:
the data acquisition layer analyzes a Jpacket object from the gateway by using a pcap _ loop () function, acquires original data information of a data packet from the Jpacket object, stores the captured data packet, generates a basic feature object containing the data packet, judges whether data features of the data packet contain source/destination IP addresses, service and protocol basic features, filters the data packet if the data packet does not contain the basic features, packs the data packet into cdv files by the data acquisition layer if the data packet contains the basic features, and sends the cdv files to the data acquisition layer.
Example two: the present embodiment should be understood to include at least all the features of any one of the foregoing embodiments, and further improve on the foregoing features, in particular, to provide a distributed data node abnormal behavior detection system, and to apply the distributed data node abnormal behavior detection method to the system;
the distributed data node abnormal behavior detection method is characterized by comprising the following steps:
the first slave node is responsible for collecting a real-time data set from the gateway;
the second slave node preprocesses the real-time data set;
the main detection node is responsible for detecting the data in the real-time data set;
the block node stores the data set;
the distributed data node abnormal behavior detection method is applied to the distributed data node abnormal behavior detection system, wherein the system comprises: the system comprises a data acquisition layer, a data preprocessing layer, a data analysis layer and a data storage layer; the data acquisition layer is in data connection with the data preprocessing layer, the data preprocessing layer is in data connection with the data analysis layer, and the data analysis layer is in data connection with the data storage layer; the layers are independent from each other, the data format of the layers is standard conversion, the data calling between the layers adopts a standardized interface,
the step that the first slave node is responsible for collecting the real-time data set from the gateway comprises the following steps:
the first slave node formats the real-time data set, the processed real-time data set is stored in a temporary storage area, and the first slave node sends the real-time data set stored in the temporary storage area to a second slave node;
the data acquisition layer is used for the first slave node to acquire the gateway real-time data set; the data acquisition step comprises the following specific implementation steps:
the data acquisition layer analyzes a Jpacket object from the gateway by utilizing a pcap _ loop () function, acquires original data information of a data packet from the Jpacket object, stores the captured data packet and then generates a basic feature object containing the data packet, judges whether the data feature of the data packet contains a source/destination IP address, a service and a protocol basic feature, if the data packet does not contain the basic feature, the data packet is filtered by the data acquisition layer, if the data packet contains the basic feature, the data packet is packaged into a. cdv file by the data acquisition layer and the. cdv file is sent to the data acquisition layer;
the data preprocessing layer receives the cdv file, and the data preprocessing layer is used for the data preprocessing process from a second slave node, the second slave node is responsible for receiving the formatted data set of the first slave node, and the data set is stored in the second slave node in a Topic queue manner; the second slave node carries out preprocessing operation on the data set, wherein a data feature set is built in the second slave node; the preprocessing operation carries out feature analysis on data in the formatted data set, if the data features of the data belong to the data feature set, the data are divided into a first data set, if the data features in the formatted data set do not belong to the data feature set, the data are divided into a second data set, and the second slave node respectively sends the first data set and the second data set to a main detection node;
and a Euclidean space is built in the data preprocessing layer, wherein the data characteristics contained in the data characteristic set comprise: the data preprocessing layer divides the data into the first data set if the data features of the data belong to the data feature set, and the preprocessing steps of the data preprocessing layer on the first data set are as follows:
1. digitizing the discrete data; extracting data in the cdv file, wherein the data preprocessing layer performs feature recognition on the data according to the data features, the discrete data types comprise back, land and nmap, discrete attributes of the discrete data are projected to the Euclidean space, the discrete attributes have own spatial positions in the Euclidean space, then the Euclidean distance of the data at the spatial positions in the Euclidean space is calculated, and the discrete attributes adopt binary coding for data coding;
2. carrying out standardization and normalization processing on the continuous data; normalizing and normalizing the continuous characteristic data in the cdv file, wherein partial data is selected as sample data from the continuous characteristic data, and the processing steps are as follows:
normalizing the continuous attributes according to a formula (1);
B=(vi-r)/σ (1)
wherein B represents continuous attribute normalization of the continuous data, σ represents standard deviation of the sample data, the standard deviation is calculated by formula (2), viRepresenting the ith attribute of the continuous data, wherein i represents the attribute serial number of the continuous data, taking any positive integer, and selecting the number of the attributes to be 10; r represents the average value of the attribute, the average value being taken by equation (3);
(2)
Figure BDA0003114958850000091
wherein 10 represents the number of said attributes;
(3)
Figure BDA0003114958850000092
wherein 10 represents the number of said attributes;
normalizing the numerical value; normalizing the data which is processed in the step I and becomes standardized into a [0,1] interval, and mapping a value range of the data into a [0,1] range for processing, wherein the normalization processing is calculated by a formula (4);
Figure BDA0003114958850000093
wherein C'iIs CiNormalized value of [0,1]]Inner arbitrary number, CminIs CiMinimum value of (1), CmaxIs CiMaximum value of (1); ciIs the normalized data value;
3. performing principal component analysis and dimensionality reduction treatment; carrying out dimension reduction processing on the data with the high-dimensional data characteristics in the cdv file, wherein the specific steps are as follows:
1) combining the high-dimensional characteristic data into an n-row m-column matrix 1 according to columns, wherein each row of the matrix 1 represents an attribute field;
2) zero-averaging each row of the matrix 1, and subtracting the average value in each row;
3) solving a covariance matrix of the matrix 1;
4) solving an eigenvalue of the covariance matrix and an eigenvector corresponding to the eigenvalue;
5) arranging the eigenvectors into a matrix 2 from top to bottom according to the corresponding eigenvalue size, and taking the first k rows to form a matrix 3, wherein the k value is any positive integer not exceeding the row number and the column number of the matrix 2;
6) multiplying the matrix 3 with the matrix 1 to obtain low-dimensional data subjected to dimension reduction processing on the data with high-dimensional data characteristics in the cdv file subjected to dimension reduction processing;
if the data feature of the data does not belong to any feature element in the data feature set, the data is divided into a second data set by the data preprocessing layer, and the preprocessing step of the second data set by the data preprocessing layer is as follows:
if the data features of the second data set belong to five category labels, the five category labels include: normal, Dos, Probe, U2R, R2L, respectively representing positive integers from 1 to 5 for said category labels, respectively, the rest of said data not belonging to said five category labels being represented by 0;
the data preprocessing layer carries out vectorization on the data characteristics of the preprocessed first data set and the preprocessed second data set, and the vectorization of the data characteristics of the first data set and the preprocessed second data set is realized through a Vector Assembler class;
and after completing the vectorization of the data characteristics, the data preprocessing layer sends the first data set and the second data set to the data analysis layer.
Example three: the present embodiment should be understood to include at least all the features of any one of the foregoing embodiments, and further improve on the foregoing features, in particular, to provide a distributed data node abnormal behavior detection system, and to apply the distributed data node abnormal behavior detection method to the system;
the distributed data node abnormal behavior detection method is characterized by comprising the following steps:
the first slave node is responsible for collecting a real-time data set from the gateway;
the second slave node preprocesses the real-time data set;
the main detection node is responsible for detecting the data in the real-time data set;
the block node stores the data set;
the distributed data node abnormal behavior detection method is applied to the distributed data node abnormal behavior detection system, wherein the system comprises: the system comprises a data acquisition layer, a data preprocessing layer, a data analysis layer and a data storage layer; the data acquisition layer is in data connection with the data preprocessing layer, the data preprocessing layer is in data connection with the data analysis layer, and the data analysis layer is in data connection with the data storage layer; the layers are independent from each other, the data format of the layers is standard conversion, the data calling between the layers adopts a standardized interface,
the step that the first slave node is responsible for collecting the real-time data set from the gateway comprises the following steps:
the first slave node formats the real-time data set, the processed real-time data set is stored in a temporary storage area, and the first slave node sends the real-time data set stored in the temporary storage area to a second slave node;
the data acquisition layer is used for the first slave node to acquire the gateway real-time data set; the data acquisition step comprises the following specific implementation steps:
the data acquisition layer analyzes a Jpacket object from the gateway by utilizing a pcap _ loop () function, acquires original data information of a data packet from the Jpacket object, stores the captured data packet and then generates a basic feature object containing the data packet, judges whether the data feature of the data packet contains a source/destination IP address, a service and a protocol basic feature, if the data packet does not contain the basic feature, the data packet is filtered by the data acquisition layer, if the data packet contains the basic feature, the data packet is packaged into a. cdv file by the data acquisition layer and the. cdv file is sent to the data acquisition layer;
the data preprocessing layer receives the cdv file, and the data preprocessing layer is used for the data preprocessing process from a second slave node, the second slave node is responsible for receiving the formatted data set of the first slave node, and the data set is stored in the second slave node in a Topic queue manner; the second slave node carries out preprocessing operation on the data set, wherein a data feature set is built in the second slave node; the preprocessing operation carries out feature analysis on data in the formatted data set, if the data features of the data belong to the data feature set, the data are divided into a first data set, if the data features in the formatted data set do not belong to the data feature set, the data are divided into a second data set, and the second slave node respectively sends the first data set and the second data set to a main detection node;
and a Euclidean space is built in the data preprocessing layer, wherein the data characteristics contained in the data characteristic set comprise: the data preprocessing layer divides the data into the first data set if the data features of the data belong to the data feature set, and the preprocessing steps of the data preprocessing layer on the first data set are as follows:
a1, digitizing the discrete data; extracting data in the cdv file, wherein the data preprocessing layer performs feature recognition on the data according to the data features, the discrete data types comprise back, land and nmap, discrete attributes of the discrete data are projected to the Euclidean space, the discrete attributes have own spatial positions in the Euclidean space, then the Euclidean distance of the data at the spatial positions in the Euclidean space is calculated, and the discrete attributes adopt binary coding for data coding;
a2, normalizing and normalizing the continuous data; normalizing and normalizing the continuous characteristic data in the cdv file, wherein partial data is selected as sample data from the continuous characteristic data, and the processing steps are as follows:
aa1, normalizing said continuity attribute according to formula (1);
B=(vi-r)/σ(1)
wherein B represents continuous attribute normalization of the continuous data, σ represents standard deviation of the sample data, the standard deviation is calculated by formula (2), viRepresenting the ith attribute of the continuous data, wherein i represents the attribute serial number of the continuous data, taking any positive integer, and selecting the number of the attributes to be 10; r represents the average value of the attribute, the average value being taken by equation (3);
(2)
Figure BDA0003114958850000131
wherein 10 represents the number of said attributes;
(3)
Figure BDA0003114958850000132
wherein 10 represents the number of said attributes;
aa2, normalizing the values; normalizing the normalized data processed in the step (i) into a [0,1] interval, and mapping the value range of the data into a [0,1] range for processing, wherein the normalization processing is calculated by a formula (4);
C′i=1+(Ci-Cmax)÷(Cmax-Cmin) (4)
wherein C'iIs CiNormalized value of [0,1]]Inner arbitrary number, CminIs CiMinimum value of (1), CmaxIs CiMaximum value of (1); ciIs the normalized data value;
a3, performing principal component analysis and dimensionality reduction treatment; carrying out dimension reduction processing on the data with the high-dimensional data characteristics in the cdv file, wherein the specific steps are as follows:
ab1, combining the high-dimensional characteristic data into n rows and m columns of matrix 1 according to columns, wherein each row of the matrix 1 represents an attribute field;
ab2, zero-averaging each row of the matrix 1, subtracting the average in each row;
ab3, calculating a covariance matrix of the matrix 1;
ab4, calculating the eigenvalue of the covariance matrix and the eigenvector corresponding to the eigenvalue;
ab5, arranging the eigenvectors into a matrix 2 from top to bottom according to the corresponding eigenvalue size, and taking the first k rows to form a matrix 3, wherein the k value is any positive integer not exceeding the number of rows and columns of the matrix 2;
ab6, multiplying the matrix 3 with the matrix 1 to obtain the low-dimensional data after the dimensionality reduction of the data with the high-dimensional data characteristics in the cdv file after the dimensionality reduction of the k-dimensional data;
if the data feature of the data does not belong to any feature element in the data feature set, the data is divided into a second data set by the data preprocessing layer, and the preprocessing step of the second data set by the data preprocessing layer is as follows:
if the data features of the second data set belong to five category labels, the five category labels include: normal, Dos, Probe, U2R, R2L, respectively representing positive integers from 1 to 5 for said category labels, respectively, the rest of said data not belonging to said five category labels being represented by 0;
the data preprocessing layer carries out vectorization on the data characteristics of the preprocessed first data set and the preprocessed second data set, and the vectorization of the data characteristics of the first data set and the preprocessed second data set is realized through a Vector Assembler class;
after completing the vectorization of the data characteristics, the data preprocessing layer sends the first data set and the second data set to the data analysis layer;
the data analysis layer is used for the main detection node to perform the clustering analysis operation; the specific clustering analysis method of the main detection node comprises the following steps:
the main detection node performs data detection operation on the data set, a cluster analysis model is built in the main detection node, the cluster analysis model is responsible for performing the detection analysis operation on the first data set, if the first data set does not meet the cluster analysis requirement, the first data set is considered to have abnormal behavior, at the moment, the main detection node sends deletion information to the first slave node, the temporary storage area deletes the data stream to which the first data set belongs after the first slave node receives the information, and if the first data set meets the cluster analysis requirement, the first data set is sent to the block node;
a calculation engine is built in the main detection node, and the calculation engine performs data calculation operation after packaging the second data set in a Hash table mode on the second data set;
the real-time steps of the data analysis layer for performing cluster analysis on the first data set are as follows:
b1, data sorting; the data analysis layer distinguishes normal data streams and isolated points according to the data features of the first data set, the normal data streams are processed into a training set, and the training set realizes the mining of data with normal information outlines; if the normal data stream has the isolated point, the normal data stream is converted into a step b2 to carry out isolated point removing operation, and if the normal data stream does not have the isolated point, the normal data stream is converted into a step b3 to carry out data clustering operation;
b2, removing isolated points; in the embodiment, a recursive method is used to divide a space domain in a data stream into a left symmetric sub-domain and a right symmetric sub-domain, a vertical line 1 is used to divide the space 1 into a left sub-space 1 and a right sub-space 1 in a recursive manner, points whose distance from the straight line 1 does not exceed a specified minimum value are respectively found in the sub-spaces 1, wherein the specified minimum value is set as x and the points satisfying the above condition are set as a point set 1, the number of the points in the point set does not exceed 6, and the points in the point set symmetrically exist in the left and right directions; finding out a pair of points 1 which are closest to the straight line 1 from the point set 1, then dividing the subspace where the closest pair of points 1 exists into square regions 2 which are symmetrical up and down and left and right by taking a vertical line 2 as a symmetry axis, finding out a pair of points 2 which are closest to the symmetry axis by taking the symmetry axis from the region 2, replacing one pair of points 1 with the pair of points 2, continuing to divide the space domain where the pair of points 2 are located according to the division rule of the subspace where the pair of points 1 are located, finding out the closest pair of points, if the pair of points do not exist, namely only the number of isolated points which is 1 are closest to the division straight line, then the isolated points are isolated points, rejecting the data stream space domain with the isolated points, if the pair of points can be found, continuing to carry out the space region division operation until the isolated points which exist in the space are found, step b3 is carried out on the data stream with the isolated points removed;
b3, clustering data; combining the data stream which is sorted in the step b1 and has no isolated point with the data from which the isolated point has been removed in the step b2 to form an integrated data stream, and performing cluster analysis on the integrated data stream, wherein in this embodiment, a candidate micro cluster of data which is an existing training data stream is D, a critical candidate micro cluster in the training data stream is E, a micro cluster radius of the training data stream is R, and the micro cluster radius setting step is as follows:
ba1, selecting partial data in the training data stream as sample data by using a random sampling method;
ba2, randomly pairing every two sample data to generate N pairs of data, and calculating the distance between each pair of data;
ba3, expected EX and variance DX for calculating the distance between said N pairs of data;
ba4, constructing a threshold radius R, wherein the threshold radius is calculated by a formula (5);
(5)
R=P×EX+P×DX
wherein P in the formula (5) is a fixed value, and 1/5 is selected in the embodiment to achieve the optimal effect;
setting the subset of the data candidate micro-cluster D as a candidate micro-cluster subset T1, and setting the subset of the critical candidate micro-cluster E as a candidate critical micro-cluster T2; merging the data points in the integrated data stream as new data points n into candidate micro-clusters D in the training data stream, merging the data points n into the training data stream if the radius R of the micro-clusters after merging the data points n does not exceed a specified range k, where k is not hard-defined in this embodiment and the k may take any value, merging the data points n into the training data stream if the growing range exceeds a specified range, merging the data points n into critical candidate micro-clusters E in the training data stream if the radius of the critical candidate micro-clusters E after merging the data points n does not exceed a specified range l, where the specified range l is not hard-defined and may take any value, merging the data points n into the critical candidate micro-clusters E, and then checking new weights w of the critical micro-clusters E, if the increment value of the new weight w exceeds the maximum weight A given by the system, the critical candidate micro-cluster E is converted into a new candidate micro-cluster E', the critical candidate micro-cluster E is removed from the critical candidate micro-cluster by the data analysis layer, and a new critical candidate micro-cluster is established; if the data point n can not be included in the critical micro-cluster D and the critical candidate micro-cluster E, creating a new critical candidate micro-cluster E 'according to the data point n, inserting the critical candidate micro-cluster D into a critical candidate micro-cluster cache for subsequent processing, and taking the data point n as outlier noise or as a new micro-cluster seed in the new critical candidate micro-cluster E'; if no new data point is merged into all the candidate micro-clusters, the weight values of the candidate micro-clusters are decreased, if the weight of the candidate micro-clusters is smaller than the maximum weight value A, the data analysis layer performs elimination processing on the candidate micro-cluster set T1 and releases a space memory, and the weight check time span of the candidate micro-cluster set T1 is determined by formula (6);
(6)
Figure RE-GDA0003144637610000171
wherein a in the formula (6) represents the maximum weight value, which is an arbitrary positive number and is greater than a value 1, T represents the time span, which is an arbitrary positive number, n represents the number of subsets of the candidate micro-cluster set T1 in the training data stream, which is an arbitrary positive integer; data points in the integrated data stream are all merged into the candidate micro-cluster set T1 or the candidate critical micro-cluster T2 through step 3, data remaining unable to be merged into the candidate micro-cluster set T1 or the candidate critical micro-cluster T2 will be merged, the set is an abnormal cluster set, and data merged into the candidate micro-cluster set T1 or the candidate critical micro-cluster T2 is merged as a normal cluster set;
the data analysis layer stores the data characteristics of the data stream which does not belong to any micro-cluster into the block chain database and serves as a training sample set, the data analysis layer can rapidly identify the data stream with abnormal nodes according to the training sample set, the data analysis layer sends the data stream with the abnormal nodes to the gateway and sends the data stream which belongs to a normal micro-cluster to the encryption analysis module, and the analysis clustering operation of the data analysis layer is completed;
the data analysis layer detects the second data set in real time based on a Spark framework, a calculation engine is built in the main detection node, and the calculation engine performs data calculation operation after packaging the second data set in a Hash table mode; the calculation engine is built under the Spark framework, and the building process of the calculation engine is as follows:
c1, initializing a program, setting Spark running configuration by using a Spark Conf object, and creating an sc and an ssc object in a Spark frame, wherein the sc is responsible for connecting a Spark cluster, and data in the second data set enters the computing engine through the ssc for data processing;
c2, calling a function, wherein the calculation engine calls a ssc.start () function, and the ssc.start () function calls the cluster analysis model to perform data detection analysis on the second data set;
after the data analysis layer completes the data detection analysis, transmitting the analysis result to the data storage layer, wherein the data storage layer is used for storing the data set by the block node; the method in which the block node stores the data set comprises: the block node is responsible for carrying out data storage and data reading operations on the first data set, if the block node stores the first data set, the block node sends the deletion message to the first slave node, and the temporary storage area deletes the storage information to which the first data set belongs;
the data storage layer is used for storing data in the block nodes and performing read-write operation on the data blocks;
the computer readable storage medium has stored thereon a distributed data node abnormal behavior detection program, which is executable by one or more processors to implement the distributed data node abnormal behavior detection method according to any one of claims 1 to 7.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described or illustrated in detail in a certain embodiment, reference may be made to the descriptions of other embodiments.
In summary, the invention provides a method, a system and a storage medium for detecting abnormal behavior of distributed data nodes.
Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications may be made without departing from the scope of the invention. That is, the methods, systems, and devices discussed above are examples. Various configurations may omit, substitute, or add various procedures or components as appropriate. For example, in alternative configurations, the methods may be performed in an order different than that described, and/or various components may be added, omitted, and/or combined. Moreover, features described with respect to certain configurations may be combined in various other configurations, as different aspects and elements of the configurations may be combined in a similar manner. Further, elements therein may be updated as technology evolves, i.e., many elements are examples and do not limit the scope of the disclosure or claims.
Specific details are set forth in the description in order to provide a thorough understanding of the exemplary configurations including implementations. However, configurations may be practiced without these specific details, for example, well-known circuits, processes, algorithms, structures, and techniques have been shown without unnecessary detail in order to avoid obscuring the configurations. This description provides example configurations only, and does not limit the scope, applicability, or configuration of the claims. Rather, the foregoing description of the configurations will provide those skilled in the art with an enabling description for implementing the described techniques. Various changes may be made in the function and arrangement of elements without departing from the spirit or scope of the disclosure.
In conclusion, it is intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that these examples are illustrative only and are not intended to limit the scope of the invention. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (10)

1. A distributed data node abnormal behavior detection method is characterized in that the method is applied to a gateway data abnormal behavior detection system and improves the security of internet data; it is characterized by comprising:
the first slave node is responsible for collecting a real-time data set from the gateway;
the second slave node preprocesses the real-time data set;
the main detection node is responsible for detecting the data in the real-time data set;
the block node stores the set of data.
2. The method according to claim 1, wherein the step of collecting the real-time data set from the gateway by the first slave node comprises:
the first slave node formats the real-time data set, the processed real-time data set is stored in a temporary storage area, and the first slave node sends the real-time data set stored in the temporary storage area to a second slave node.
3. A method for detecting abnormal behaviour in a distributed data node according to any of the preceding claims, wherein said step of pre-processing said set of real-time data by said second slave node comprises:
the second slave node is responsible for receiving the formatted data set of the first slave node, and the data set is stored in the second slave node in a Topic queue manner.
4. A method for detecting abnormal behaviour in a distributed data node according to any of the preceding claims, wherein said step of pre-processing said set of real-time data by said second slave node comprises:
the second slave node carries out preprocessing operation on the data set, wherein a data feature set is built in the second slave node; the preprocessing operation carries out feature analysis on data in the formatted data set, if the data features of the data belong to the data feature set, the data are divided into a first data set, if the data features in the formatted data set do not belong to the data feature set, the data are divided into a second data set, and the second slave node respectively sends the first data set and the second data set to a main detection node.
5. The method according to one of the preceding claims, wherein the step of detecting the data in the real-time data set by the main detection node comprises:
the main detection node performs data detection operation on the data set, a cluster analysis model is built in the main detection node, the cluster analysis model is responsible for performing the detection analysis operation on the first data set, if the first data set does not meet the cluster analysis requirement, the first data set is considered to have abnormal behavior, at the moment, the main detection node sends deletion information to the first slave node, the temporary storage area deletes the data stream to which the first data set belongs after the first slave node receives the information, and if the first data set meets the cluster analysis requirement, the first data set is sent to the block node.
6. A distributed data node abnormal behavior detection system as claimed in one of the preceding claims, wherein the system is responsible for processing abnormal behavior of data flow, performing abnormal behavior detection on network data; the system comprises: the system comprises a data acquisition layer, a data preprocessing layer, a data analysis layer and a data storage layer; the data acquisition layer is in data connection with the data preprocessing layer, the data preprocessing layer is in data connection with the data analysis layer, and the data analysis layer is in data connection with the data storage layer; the data acquisition layer is used for the first slave node to acquire the gateway real-time data set.
7. A distributed data node anomalous behavior detection system as claimed in any preceding claim wherein said data pre-processing layer is arranged to pre-process said data set by a second slave node.
8. A distributed data node anomalous behavior detection system as claimed in any preceding claim wherein said data analysis layer is operative to perform said cluster analysis operation by said primary detection node.
9. A distributed data node abnormal behavior detection system according to any of the preceding claims, wherein said data storage layer is used for data storage by said block nodes and for read and write operations to said data blocks.
10. The computer-readable storage medium according to any one of the preceding claims, wherein the computer-readable storage medium has stored thereon a distributed data node abnormal behavior detection program, the program being executable by one or more processors to implement the distributed data node abnormal behavior detection method according to any one of claims 1 to 5.
CN202110660410.9A 2021-06-15 2021-06-15 Distributed data node abnormal behavior detection method, system and storage medium Withdrawn CN113391976A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110660410.9A CN113391976A (en) 2021-06-15 2021-06-15 Distributed data node abnormal behavior detection method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110660410.9A CN113391976A (en) 2021-06-15 2021-06-15 Distributed data node abnormal behavior detection method, system and storage medium

Publications (1)

Publication Number Publication Date
CN113391976A true CN113391976A (en) 2021-09-14

Family

ID=77621064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110660410.9A Withdrawn CN113391976A (en) 2021-06-15 2021-06-15 Distributed data node abnormal behavior detection method, system and storage medium

Country Status (1)

Country Link
CN (1) CN113391976A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170034195A1 (en) * 2015-07-27 2017-02-02 Electronics And Telecommunications Research Institute Apparatus and method for detecting abnormal connection behavior based on analysis of network data
CN108040074A (en) * 2018-01-26 2018-05-15 华南理工大学 A kind of real-time network unusual checking system and method based on big data
CN112615881A (en) * 2020-12-28 2021-04-06 马樱 Data flow detection system based on block chain

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170034195A1 (en) * 2015-07-27 2017-02-02 Electronics And Telecommunications Research Institute Apparatus and method for detecting abnormal connection behavior based on analysis of network data
CN108040074A (en) * 2018-01-26 2018-05-15 华南理工大学 A kind of real-time network unusual checking system and method based on big data
CN112615881A (en) * 2020-12-28 2021-04-06 马樱 Data flow detection system based on block chain

Similar Documents

Publication Publication Date Title
Wang et al. Identifying intrusions in computer networks with principal component analysis
CN111629006B (en) Malicious flow updating method fusing deep neural network and hierarchical attention mechanism
CN111159243B (en) User type identification method, device, equipment and storage medium
CN111556016B (en) Network flow abnormal behavior identification method based on automatic encoder
CN111930526B (en) Load prediction method, load prediction device, computer equipment and storage medium
CN112615881B (en) Data flow detection system based on block chain
CN111191720B (en) Service scene identification method and device and electronic equipment
CN111669385B (en) Malicious traffic monitoring system fusing deep neural network and hierarchical attention mechanism
CN117081858B (en) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
CN111538741A (en) Deep learning analysis method and system for big data of alarm condition
CN111460881A (en) Traffic sign countermeasure sample detection method and classification device based on neighbor discrimination
CN108073611A (en) The filter method and device of a kind of warning information
CN110825545A (en) Cloud service platform anomaly detection method and system
CN112364803A (en) Living body recognition auxiliary network and training method, terminal, equipment and storage medium
CN116662817B (en) Asset identification method and system of Internet of things equipment
CN113780432B (en) Intelligent detection method for operation and maintenance abnormity of network information system based on reinforcement learning
CN115277189A (en) Unsupervised intrusion flow detection and identification method based on generative countermeasure network
CN117375896A (en) Intrusion detection method and system based on multi-scale space-time feature residual fusion
CN113407410A (en) Block chain network anomaly detection method
CN117118718A (en) Intrusion detection method and system based on multi-generator GAN data enhancement
CN113391976A (en) Distributed data node abnormal behavior detection method, system and storage medium
CN106530199B (en) Multimedia integration steganalysis method based on window type hypothesis testing
CN112860648A (en) Intelligent analysis method based on log platform
CN111654463A (en) Support vector electromechanical network intrusion detection system and method based on feature selection
CN111475380A (en) Log analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210914