CN116865994A

CN116865994A - Network data security prediction method based on big data

Info

Publication number: CN116865994A
Application number: CN202310617495.1A
Authority: CN
Inventors: 张益鸣; 刘兴龙; 杨子阳; 杨茗; 赵毅涛; 杨晓华; 茶建华; 孙立元; 代盛国; 赵永辉; 艾渊; 杨昊; 任建宇; 李家浩
Original assignee: Yunnan Power Grid Co Ltd
Current assignee: Yunnan Power Grid Co Ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-10-10

Abstract

The application discloses a network data security prediction method based on big data, which comprises the following steps: collecting log data and pushing the log data to Kafka; preliminary classifying the useful features extracted from the Kafka through a random forest algorithm; performing cluster analysis on the data by using a K-means algorithm; labeling according to the classification samples, and modeling by using a Bayesian classification algorithm; and carrying out different treatments on the classified data labels. According to the network data security prediction method based on big data, the rapid prediction and recognition of network security risks are realized by collecting, processing and analyzing massive network data and combining a machine learning algorithm and a cluster analysis technology. The network attack behavior is classified, modeled and predicted by adopting a plurality of methods such as a random forest algorithm, a K-means algorithm, a Bayesian classification algorithm and the like, and unknown security threats can be efficiently detected and responded, so that the network security of enterprises is effectively protected, and the method has the advantages of instantaneity, accuracy, expandability and the like.

Description

Network data security prediction method based on big data

Technical Field

The application relates to the technical field of network data security, in particular to a network data security prediction method based on big data.

Background

In recent years, the internet has rapidly developed, and the technical field has been diversified. With the rising of emerging technologies such as cloud computing, internet of things, big data, 5G and the like, network information safety boundaries are weakened continuously, safety protection contents are increased continuously, great challenges are provided for data safety and information safety, policies in the aspects of network safety and data management in China are discharged continuously, legislation systems are perfected continuously, and supervision of network safety is improved continuously.

The key technology of the network security situation awareness system and the awareness process thereof are combined, main activities for network situation awareness can be summarized, including measurement, processing of original network data, extraction of feature data, detection, risk assessment of various network activities through calculation, capturing of abnormal network activity data, finally cognition of factors which possibly influence normal operation of the network system is enhanced through the series of processes, network security trend is predicted based on accurate cognition, and support is provided for making effective defense measures through attack tracing and visual display.

Because of randomness and uncertainty of network attack, security situation change based on the random network attack is a complex nonlinear process, the traditional prediction model method (manual matching and rule-based matching) cannot meet the requirements gradually, and more researches are developing towards intelligent prediction methods, such as artificial intelligent methods like machine learning algorithms and the like. The method has the advantages of self-learning capability, high medium-short term prediction accuracy and less human participation.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the application and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description of the application and in the title of the application, which may not be used to limit the scope of the application.

The present application has been made in view of the above-described problems.

Therefore, the technical problems solved by the application are as follows: because of randomness and uncertainty of network attack, the security situation change based on the randomness and uncertainty is a complex nonlinear process, and the requirement of the traditional prediction model method cannot be met.

In order to solve the technical problems, the application provides the following technical scheme: a network data security prediction method based on big data comprises the following steps:

collecting log data and pushing the log data to Kafka;

preliminary classifying the useful features extracted from the Kafka through a random forest algorithm;

performing cluster analysis on the data by using a K-means algorithm;

labeling according to the classification samples, and modeling by using a Bayesian classification algorithm;

and carrying out different treatments on the classified data labels.

As a preferable scheme of the big data based network data security prediction method of the present application, the method comprises: the process of the preliminary classification includes,

when useful features are extracted from Kafka, the system performs preliminary classification on the useful features according to the existing classification samples and the classifier added with the random forest algorithm, and feeds back analysis results to an administrator;

when classifying network attack behaviors, firstly collecting known attack samples, then constructing a plurality of decision trees, secondly comparing new data features collected from Kafka with the samples, and determining whether the new data features belong to a certain known attack type by integrating prediction results of the plurality of decision trees in a voting method;

when the new data features are primarily classified into a certain attack type, the system feeds back an analysis result to an administrator, informs the administrator of possible risks of the new data features, and provides detailed analysis reports including attack types, attack targets and attack modes;

when the new data features cannot be initially classified into a certain known attack type, the system feeds back an analysis result to an administrator, and informs the administrator that the new data features are not of the known attack type, but potential risks still exist; the system may suggest an administrator to add relevant classification samples, and the administrator may perform corresponding operations based on the suggestions, add new classification samples, and retrain the model.

As a preferable scheme of the big data based network data security prediction method of the present application, the method comprises: the step of classifying the network attack by the random forest algorithm includes,

collecting and preparing data sets, collecting known attack samples, extracting useful features from the attack samples, preprocessing the data, and ensuring the quality and the integrity of the data sets;

selecting features, namely selecting a random sample from the cleaned data set to establish a first decision tree, randomly selecting k features from a feature list for each node, and calculating an optimal dividing point by using the features;

training a decision tree, randomly selecting k features, selecting an optimal dividing point, dividing data into two subsets, and recursively repeating the feature selection and the decision tree training until a stop condition is met;

constructing a random forest, repeatedly selecting characteristics and training decision trees to generate a plurality of decision trees, wherein each decision tree is independent and random, evaluating the performance of the model by using cross verification and other technologies, and selecting optimal parameters and characteristics;

predicting a new sample, comparing the new sample with the existing attack sample when the new data characteristic is collected from Kafka, and determining whether the new data characteristic belongs to a known attack type or not by integrating the prediction results of a plurality of decision trees in a voting method; and transmitting the samples to be predicted to each decision tree to obtain a prediction result of each tree, and synthesizing the results of all the decision trees in a majority voting or average value mode to obtain a final classification result.

As a preferable scheme of the big data based network data security prediction method of the present application, the method comprises: the step of cluster analysis includes the steps of,

for each cluster, calculating the mass center of the cluster, and distributing the data points closest to the mass center to the corresponding cluster;

when the distances between one data point and a plurality of centroids are equal, randomly selecting one centroid for allocation;

after the allocation is completed, the data points belonging to the same cluster are marked as the same class.

As a preferable scheme of the big data based network data security prediction method of the present application, the method comprises: the labeling process based on the classified samples includes,

for each data point, calculating the similarity between the data point and each classified sample;

according to the similarity, assigning the data points to the categories to which the most similar classification samples belong;

after the distribution is completed, the data points are labeled correspondingly, and the label types are classified into safe events and unsafe events.

As a preferable scheme of the big data based network data security prediction method of the present application, the method comprises: the bayesian classification algorithm is represented as,

wherein, P (Y=k|X=x) is posterior probability, the variable Y has k results, X is given observation data, pi _k A priori probability that variable Y belongs to the kth class, f _k (x) As likelihood function, on the premise that the classification result is k, the probability of the observation data being x is f _k (x)，π ₁ A priori probability that the variable Y belongs to class 1, f _l (x) Is a likelihood function;

the Bayes presumes that the characteristics are independent, and on the premise that the classification result is k, the characteristic variables are mutually independent and expressed as:

P(x ₁ ，x ₂ ，...，x _n |y＝k)＝P(x ₁ |y＝k)*P(x ₁ |y＝k)*…*P(x _n |y＝k)

wherein x is ₁ ～x _n Is the acquired data.

As a preferable scheme of the big data based network data security prediction method of the present application, the method comprises: the Bayesian classification algorithm generates a Bayesian classification model according to the existing classification sample under a system modeling module, the model is adjusted in the way that,

a new classification sample is added, and when an administrator finds that the new data feature is not covered by the current classification sample, it is added to the classification sample library and the model is retrained.

Unnecessary classification samples are deleted, and when an administrator finds that some classification samples are no longer needed or out of date, they are deleted from the classification sample library and the model is retrained.

The weights of the classified samples are adjusted, and when an administrator finds that the influence of some classified samples on the model is too large or too small, the weights of the classified samples are adjusted, and the model is retrained.

When the administrator adjusts the model, the system retrains the model and sends a notification to the administrator after training is completed to inform the update of the model.

In order to solve the technical problems, the application also provides the following technical scheme: a big data based network data security prediction system comprising:

a data collection module for collecting network data security-related information from various channels;

the data cleaning and preprocessing module is used for cleaning, removing weight and standardizing the collected data;

the data storage module is used for storing the preprocessed data into a database or a data warehouse for subsequent analysis and modeling;

the data analysis and modeling module is used for analyzing and modeling the data stored in the data warehouse, finding out potential network security threats and vulnerabilities and predicting the potential network security threats and vulnerabilities;

and the prediction result display module is used for displaying the prediction result to a user, and the user views the prediction result through the visual interface.

A computer device, comprising: a memory and a processor; the memory stores a computer program characterized in that: the processor, when executing the computer program, implements the steps of the method of any of the present application.

A computer-readable storage medium having stored thereon a computer program, characterized by: which when executed by a processor, carries out the steps of the method described in the application.

The application has the beneficial effects that: according to the network data security prediction method based on big data, the rapid prediction and recognition of network security risks are realized by collecting, processing and analyzing massive network data and combining a machine learning algorithm and a cluster analysis technology. The network attack behaviors are classified, modeled and predicted by adopting a plurality of methods such as a random forest algorithm, a K-means algorithm, a Bayesian classification algorithm and the like, and unknown security threats can be efficiently detected and responded, so that the network security of enterprises is effectively protected. In addition, the method has the advantages of instantaneity, accuracy, expandability and the like, and can be flexibly adjusted and optimized according to different requirements, so that the method shows more excellent technical effects in practice.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

fig. 1 is an overall flowchart of a network data security prediction method based on big data according to a first embodiment of the present application;

fig. 2 is a schematic diagram of a scheme architecture of a network data security prediction method based on big data according to a first embodiment of the present application;

fig. 3 is a schematic application scenario diagram of a network data security prediction method based on big data according to a first embodiment of the present application;

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present application can be understood in detail, a more particular description of the application, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present application is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the application. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present application have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the application. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present application, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art.

Example 1

Referring to fig. 1-3, for one embodiment of the present application, there is provided a network data security prediction method based on big data, including:

s1: collecting log data and pushing the log data to Kafka;

further, when log data such as a new data security event is generated, the jump tool collects and pushes it to Kafka (a distributed stream processing platform). At the same time the system will send a notification to the administrator reminding him to pay attention to the event.

S2: preliminary classifying the useful features extracted from the Kafka through a random forest algorithm;

further, when extracting useful features from Kafka, the system will perform preliminary classification according to the existing classification samples and the classifier added with the random forest algorithm, push a piece of real-time data to another topic of Kafka, and simultaneously save an off-line data to the HBASE database; and feeding back the analysis result to an administrator, which can further process according to the analysis result.

Furthermore, when classifying network attack behaviors, firstly collecting known attack samples, then constructing a plurality of decision trees, secondly comparing new data features collected from Kafka with the samples, and determining whether the new data features belong to a certain known attack type by integrating the prediction results of the plurality of decision trees in a voting method mode.

Further, when the new data feature is initially classified as a certain attack type, the system will feed back the analysis result to the administrator, inform the administrator of the possible risk of the new data feature, and provide detailed analysis report including information of attack type, attack target and attack mode. And the administrator further processes according to the information, so as to strengthen the safety protection measures and timely check the system loopholes.

Further, when the new data feature cannot be initially classified as a known attack type, the system will also feed back the analysis result to the administrator, informing the administrator that the new data feature is not of the known attack type, but may still have a potential risk. At the same time, the system may suggest that the administrator add relevant classification samples; the administrator may perform the corresponding operations based on the suggestions, add new classification samples, and retrain the model.

Further, the step of classifying the network attack by the random forest algorithm includes,

data sets are collected and prepared, known attack samples are collected, and useful features are extracted therefrom, including attack type, attacker IP address, etc. The data are then subjected to basic processing such as cleaning, missing value processing and the like, and the quality and the integrity of the data set are ensured.

Selecting a feature, selecting a random sample from the feature based on the cleaned data set, and establishing a first decision tree. For each node, k features are randomly chosen from the feature list, where k < < m (m is the total feature number), and then these features are used to calculate the best scoring point.

Training the decision tree, randomly selecting k features, selecting the best dividing point, dividing the data into two subsets, and recursively repeating selecting features and training the decision tree until a stopping condition is met, e.g., the number of leaf nodes has reached a maximum or the depth has reached a maximum.

The method comprises the steps of constructing a random forest, repeatedly selecting characteristics and training decision trees to generate a plurality of decision trees, wherein each decision tree is independent and random, evaluating the performance of the model by using cross verification and other technologies, and selecting optimal parameters and characteristics to improve the accuracy and stability of the random forest.

Predicting a new sample, comparing the new sample with the existing attack sample when the new data characteristic is collected from Kafka, and determining whether the new data characteristic belongs to a known attack type or not by integrating the prediction results of a plurality of decision trees in a voting method; and transmitting the samples to be predicted to each decision tree to obtain a prediction result of each tree. And then the results of all decision trees are integrated in a majority voting or average value mode and the like, so that a final classification result is obtained.

S3: performing cluster analysis on the data by using a K-means algorithm;

further, when using the K-means (a clustering algorithm) algorithm for clustering, data is taken from the HBASE database, the system generates a classification sample according to the clustering result and stores it in the database, the clustering step includes,

for each cluster, calculating the centroid of the cluster, and then distributing the data points closest to the centroid to the cluster;

S4: labeling according to the classification samples, and modeling by using a Bayesian classification algorithm;

furthermore, when the characteristic data of the real-time data stream is labeled by using the Bayesian classification algorithm, the system classifies the characteristic data according to the existing classification sample and feeds the classification result back to the administrator. The step of labelling comprises the steps of,

Further, modeling is performed by using a Bayesian classification algorithm to generate a Bayesian classification model, the Bayesian classification model is operated periodically, the Bayesian classification algorithm is expressed as,

where P (y=k|x=x) is the posterior probability, the possible values of the variable Y are k results (in the present application k is 1 and 2, the security event is classified as 1, the unsafe event is classified as 2), X is given observed data, pi _k A priori probability that variable Y belongs to the kth class, f _k (x) As likelihood function, on the premise that the classification result is k, the probability of the observation data being x is f _k (x)，π _l A priori probability that the variable Y belongs to class I, f _l (x) Is a likelihood function;

it should be noted that, on the premise that the bayesian hypothesis features are independent, that is, the classification result is k, feature variables are mutually independent, and are expressed as:

P(x ₁ ，x ₂ ，...，x _n |y＝k)＝P(x ₁ |y＝k)*P(x ₁ |y＝k)+…*P(x _n |y＝k)

wherein x is ₁ ～x _n Represented as acquired data.

Further, when the Bayesian classification model is run, the system generates the model based on the existing classification samples and updates the model periodically. The administrator can check the state of the model at any time and adjust the model according to the requirement, and the model state comprises:

the number of classification samples currently in use;

accuracy and recall of the model;

training progress and time of the model;

the time of the last model update.

Further, depending on the model state, the administrator may decide whether the model needs to be adjusted. The mode of adjusting the model is as follows:

Furthermore, when the administrator adjusts the model, the system retrains the model, and sends a notification to the administrator after the training is completed to inform the administrator of the update condition of the model, and the administrator can check the state of the model at any time and adjust the model according to the requirement.

S5: and carrying out different treatments on the classified data labels.

Furthermore, when the classified data tag is obtained, the system performs different processes according to the type of the tag, and the specific processing scheme is as follows:

when the tag belongs to an unsafe event, pushing the tag data to Kafka in real time for a service system platform to call; at the same time, the system will send a notification to the administrator, reminding him to pay attention to the event.

When the tag belongs to a security event, the tag is stored in a database;

furthermore, through the data labels in the database, real-time monitoring and quick response are realized, the network security protection level is effectively improved, unknown attacks are predicted and analyzed by using the existing classification samples and algorithm models, security risks are found timely, corresponding measures are taken for protection and repair, and the probability and loss of the network being attacked are reduced.

The present embodiment also provides a computing device comprising, a memory and a processor; the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the network data security prediction method based on big data as proposed by the above embodiment.

The present embodiment also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements a network data security prediction method based on big data as proposed in the above embodiments.

The storage medium according to the present embodiment belongs to the same inventive concept as the network data security prediction method based on big data according to the above embodiment, and technical details not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same beneficial effects as the above embodiment.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include read only memory, magnetic tape, floppy disk, flash memory, optical memory, high density embedded nonvolatile memory, resistive memory, magnetic memory, ferroelectric memory, phase change memory, graphene memory, and the like. Volatile memory can include random access memory, external cache memory, or the like. By way of illustration, and not limitation, RAM can take many forms, such as static random access memory or dynamic random access memory. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

Example 2

In order to verify the beneficial effects of the present application, the following is a scientific demonstration through simulation experiments.

The implementation adopts the traditional technical scheme and the my technical scheme to process the same weblog data set;

TABLE 1 analog network packet

Time stamp	Source IP	Target IP	Data packet size	Label type
					1621963185	192.168.1.10	10.0.0.2	500bytes	Security event
1621963186	10.0.0.3	192.168.1.12	1000bytes	Unsafe event
					1621963187	192.168.1.14	10.0.0.5	2000bytes	Security event
1621963188	10.0.0.7	192.168.1.16	1500bytes	Unsafe event
					1621963189	192.168.1.18	10.0.0.9	800bytes	Security event

This dataset containing 5 pieces of weblog data was processed and classified using conventional techniques, with 2 each of the security events and the non-security events taking a total of 1 hour. The final classification accuracy is 80%, the false alarm rate is 20%, and the missing report rate is 0%.

We then process and classify the same set of weblog data using the above solution. Data were first pushed into Kafka, with 10 minutes for the initial classification of random forest algorithm, 5 minutes for the cluster analysis of K-means algorithm, and 30 minutes for modeling of bayesian classification algorithm. At the time of classification, we do different treatments according to classification tags, such as for security events, pass directly; for unsafe events, relevant information is recorded and a corresponding alarm notification is triggered, etc. The total time of the whole process is 45 minutes, the classification accuracy reaches 100%, the false alarm rate is 0%, and the missing report rate is 0%.

The technical effect of the application is obvious under the condition of a small number of samples, the method has obvious advantages in accuracy and processing time, and the advantages of the application compared with the traditional scheme are more obvious under the condition of a large number of samples in a large data network.

It should be noted that the above embodiments are only for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present application may be modified or substituted without departing from the spirit and scope of the technical solution of the present application, which is intended to be covered in the scope of the claims of the present application.

Claims

1. The network data security prediction method based on big data is characterized by comprising the following steps:

collecting log data and pushing the log data to Kafka;

performing cluster analysis on the data by using a K-means algorithm;

and carrying out different treatments on the classified data labels.

2. The big data based network data security prediction method of claim 1, wherein: the process of the preliminary classification includes,

3. The big data based network data security prediction method of claim 2, wherein: the step of classifying the network attack by the random forest algorithm includes,

constructing a random forest, repeatedly selecting characteristics and training decision trees to generate a plurality of decision trees, wherein each decision tree is independent and random, evaluating the performance of the model by using a cross verification technology, and selecting optimal parameters and characteristics;

4. The big data based network data security prediction method of claim 1, wherein: the step of cluster analysis includes the steps of,

5. The big data based network data security prediction method of claim 1, wherein: the labeling process based on the classified samples includes,

6. The big data based network data security prediction method of claim 1, wherein: the bayesian classification algorithm is represented as,

wherein, P (Y=k|X=x) is posterior probability, the variable Y has k results, X is given observation data, pi _k A priori probability that variable Y belongs to the kth class, f _k (x) As likelihood function, on the premise that the classification result is k, the probability of the observation data being x is f _k (x)，π _l A priori probability that the variable Y belongs to class I, f _l (x) Is a likelihood function;

P(x ₁ ，x ₂ ，...，x _n |y＝k)P(x ₁ |y＝k)*P(x ₁ |y＝k)*…*P(x _n |y＝k)

wherein x is ₁ ～x _n Is the acquired data.

7. The big data based network data security prediction method of claim 6, wherein: the Bayesian classification algorithm generates a Bayesian classification model according to the existing classification sample under a system modeling module, the model is adjusted in the way that,

adding a new classification sample, and when an administrator finds that the new data features are not covered by the current classification sample, adding the new classification sample into a classification sample library and retraining a model;

deleting unnecessary classification samples, deleting some classification samples from the classification sample library when an administrator finds that the classification samples are no longer needed or outdated, and retraining the model;

adjusting the weights of the classified samples, and when an administrator finds that the influence of some classified samples on the model is too large or too small, adjusting the weights of the classified samples and retraining the model;

8. A big data based network data security prediction system, comprising:

9. A computer device, comprising: a memory and a processor; the memory stores a computer program characterized in that: the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the steps of the method of any of claims 1 to 7 when executed by a processor.