CN117176482A

CN117176482A - Big data network safety protection method and system

Info

Publication number: CN117176482A
Application number: CN202311454852.3A
Authority: CN
Inventors: 徐志华; 高云; 姚磊
Original assignee: Guoren Property Insurance Co ltd
Current assignee: Guoren Property Insurance Co ltd
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2023-12-05
Anticipated expiration: 2043-11-03
Also published as: CN117176482B

Abstract

The application discloses a big data network safety protection method and a system, comprising the following steps: s1: and a data collection module: collecting network flow parameters, user behavior parameters and equipment information; s2: and a pretreatment module: cleaning and formatting the collected data; s3: identifying malicious content by adopting an improved SimHash algorithm, wherein the method comprises the steps of optimizing a safety weight factor by using a whale algorithm; s4: and (5) ending. The application provides a big data network safety protection method and a system, wherein the method adopts an improved SimHash algorithm to identify malicious content, comprises the steps of optimizing a safety weight factor by using a whale algorithm to judge network safety, and calculating the adjusted weight through a basic weight and the safety weight factor, thereby realizing the automatic judgment accuracy of safety network safety and greatly enhancing the safety of safety network data transaction.

Description

Big data network safety protection method and system

Technical Field

The application relates to the technical field of insurance data network security, in particular to a big data network security protection method and system.

Background

With the rapid development of big data and internet technology, network security problems are increasingly prominent. The insurance industry enterprises are an important component of the financial field, and the requirement for network security is particularly urgent. Traditional network security protection methods cannot meet the complex and changeable network attack means nowadays. With the rapid development of internet technology and the advent of the big data age, the problem of network security is increasingly prominent. Traditional network security protection methods often rely on fixed rules and known malicious features, and are difficult to cope with increasingly complex and diverse network attack means. Therefore, it becomes important to research and develop a network security protection method that can adaptively identify and defend against unknown threats.

The SimHash algorithm is a widely applied method for text similarity calculation. By converting text content into numeric fingerprints and calculating hamming distances between the fingerprints, simHash can quickly evaluate the similarity of text. However, the conventional SimHash algorithm is mainly used for general text processing, and lacks specific optimization for network security scenarios. In the prior art, the SimHash has poor identification capability on malicious content, and a method for judging by combining the characteristics of insurance data and network security factors is not adopted, so that the abnormal identification rate is low, and various network attack behaviors cannot be accurately identified.

Disclosure of Invention

In order to solve the above problems in the prior art, the present application provides a method and a system for protecting big data network security, wherein the method identifies malicious content by adopting an improved SimHash algorithm, includes optimizing security weight factors by using whale algorithm to perform network security judgment, and determining the security weight factors by using basic weightSafety weight factorThe adjusted weight is calculated, the automatic judgment accuracy of the safety network safety is realized, and the safety network safety is greatly enhancedThe security of the insurance network data transaction is ensured.

The application relates to a big data network safety protection method, which comprises the following steps:

s1: and a data collection module: collecting network flow parameters, user behavior parameters and equipment information;

s2: and a pretreatment module: cleaning and formatting the collected data;

s3: identifying malicious content by adopting an improved SimHash algorithm, wherein the method comprises the steps of optimizing a safety weight factor by using a whale algorithm;

s31: word segmentation is carried out on the preprocessed data network content, and the basic weight of the word is calculated；

S32: calculating a safety weight factor: for each word, calculate its security weight factorAnd determining +.>And->

Wherein,is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content;

s321: initializing whale population: each whale represents a set of parameters {,/>}；

S322: defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index;

representing the proportion of the actual malicious content identified as malicious content calculated based on the parameters a and beta,representing the proportion of correctly identified malicious content calculated based on the parameters α and β;

s323: simulating whale predation behavior: continuously updating the position of whales by simulating the predation behavior of whales, and searching for the optimal parameter combination;

s324: finding an optimal solution: when the algorithm converges, the best whale found represents the bestAnd->;

S33: calculating the adjusted weight:

s34: calculating a hash value;

s35: computing SimHash fingerprint: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:

s36: comparison with known malicious attack parameters: comparing the calculated SimHash parameter with the parameters of known malicious content, and marking the similarity as abnormal if the similarity exceeds a set threshold;

s4: and (5) ending.

Preferably, the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.

Preferably, the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.

Preferably, the saidIs a function, representing the word +.>A degree of association with malicious content,

。

the application also provides a big data network safety protection system, which comprises:

and a data collection module: collecting network flow parameters, user behavior parameters and equipment information;

and a pretreatment module: cleaning and formatting the collected data;

the improved SimHash algorithm identifies malicious content modules, including optimizing security weighting factors using whale algorithm;

firstly, word segmentation is carried out on the preprocessed data network content, and the basic weight of the word is calculated；

Secondly, calculating a safety weight factor: for each word, calculate its security weight factorAnd determining +.>And->

initializing whale population: each whale represents a set of parameters {,/>}；

Defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index;

simulating whale predation behavior: continuously updating the position of whales by simulating the predation behavior of whales, and searching for the optimal parameter combination;

finding an optimal solution: when (when)The best whale found represents the best when the algorithm convergesAnd->;

Calculating the adjusted weight:

calculating a hash value;

computing SimHash fingerprint: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:

and a comparison module for comparing the malicious attack parameters with the known fingerprints: comparing the calculated SimHash with known malicious attack parameters, and marking the similarity as abnormal if the similarity exceeds a set threshold;

s4: and (5) ending.

。

the application provides a big data network safety protection method and a system, which can realize the following beneficial technical effects:

1. the application identifies malicious content by adopting an improved SimHash algorithm, comprises the steps of optimizing a security weight factor by using a whale algorithm to judge network security, combining the SimHash algorithm with the whale algorithm and applying the SimHash algorithm to the aspect of judging insurance network cases, and identifying and judging the network security by combining the interaction characteristics of insurance data networks, thereby greatly enhancing the judgment accuracy of insurance network abnormal attack and improving the security.

2. The application uses basic weightSecurity weight factor->Calculating the adjusted weight: -calculating the adjusted weight:>the weight factors in multiple aspects are comprehensively considered, so that the automatic judgment accuracy of the safety of the insurance network is realized, the safety of the data transaction of the insurance network is greatly enhanced, and the judgment accuracy of the participation calculation of influence factors and the network attack behavior is improved.

3. The application adopts whale algorithm, and the application can realize the purpose of the application,is a weight coefficient, +.>Optimizing the threshold value to find the best quality so as to greatly improve the accuracy of the coefficient, and adding +_>Wherein->Is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content; through the optimization of the technology and the whale algorithm, the accuracy of the safety weight factors is greatly improved, and the judgment accuracy of the insurance network attack behaviors is further greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of the steps of a security protection method for a big data network according to the present application;

fig. 2 is a schematic diagram of a big data network security protection system of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Example 1:

in order to solve the above-mentioned problems mentioned in the prior art, as shown in fig. 1: the application provides a big data network safety protection method, which comprises the following steps:

network traffic parameters

Network traffic parameters are mainly concerned with the transmission and communication of data in the network. The following are some of the main parameters:

IP address: a source IP address and a destination IP address for identifying a sender and a receiver of the data packet.

Port number: a source port and a destination port for identifying a particular network service.

Protocol type: such as TCP, UDP, ICMP, for determining the transmission mode of the data packet.

Packet size: the size of each packet is used to analyze the network load.

Transmission rate: the transmission speed of the data can be used to detect network congestion or attacks.

Session information: including the start and end times of the session, for analyzing the persistence of the network connection.

User behavior parameters

The user behavior parameters are primarily concerned with the user's activities and behaviors in the network. The following are some of the main parameters:

login information: including user name, login time, login location, etc.

Browsing behavior: accessed web address, residence time, click behavior, etc.

Searching records: search keywords and search results for the user.

Upload/download behavior: the type, size, time, etc. of the file being uploaded or downloaded.

Interaction behavior: interaction records with other users or systems, such as chat, comments, etc.

Other possible parameters include

In addition to the above parameters, the following parameters may be included:

device information: the type of device used by the user, the operating system, the browser version, etc.

Network status: such as delay, packet loss rate, etc., for evaluating network quality.

Security event: any security related event such as login failure, abnormal access, etc.

Third party application behavior: such as the user logging in through social media, paying using a third party, etc.

Geographic location information: the physical location information of the user can be used to analyze user behavior and risk assessment.

Illustrative examples

For example, a user accesses an insurance company's online service from IP address 192.168.1.1 through port 443. The user uses the Windows operating system to log in through the Chrome browser and stay on the product page for 5 minutes, and then downloads an insurance contract. In the process, the system will collect the above network traffic parameters and user behavior parameters, as well as other possible parameters such as device information, size of downloaded files, etc., for subsequent analysis and security protection.

S2: and a pretreatment module: cleaning and formatting the collected data;

1. data cleansing

Data cleansing is the process of removing or correcting erroneous, inconsistent or irrelevant information in a data set.

Examples: removing duplicate records: if the same operation of the same user is recorded multiple times, duplicate entries may be deleted.

Filling the missing value: for example, if some records lack port numbers, these missing values may be populated according to the protocol type (e.g., HTTP typically uses port 80).

Correcting the error value: for example, if the IP address format is incorrect (e.g., "192.300.1.1"), it may be marked as erroneous and corrected or deleted.

2. Data conversion

Data conversion is the process of converting data into a format or structure suitable for analysis.

Examples: the units are unified: for example, all data transmission rates are converted to a unified unit, such as Mbps.

The time format is unified: all dates and times are converted to a unified format, such as "YYYY-MM-DD HH: MM: SS".

3. Data normalization

Data normalization is the process of converting numeric attributes of different ranges into similar ranges for subsequent analysis.

Examples: maximum and minimum normalization: for example, the packet size ranges from 0 to 1500 bytes to 0 to 1.

Z-Score normalization: for example, the user dwell time is converted to Z-Score in order to identify abnormal dwell times.

4. Data aggregation

Data aggregation is a process of combining multiple data points into a single data point, typically used to reduce the complexity of the data.

Examples: time period polymerization: for example, network traffic per minute is aggregated to traffic per hour.

User behavior aggregation: for example, all click actions within a single user's day are aggregated into a single record.

5. Feature extraction

Feature extraction is the process of extracting useful information from raw data to facilitate subsequent analysis and modeling.

Examples: extracting the geographic position of the IP address: for example, geographic information of a country, a city, etc. is extracted from the IP address.

Extracting a theme of browsing behaviors: for example, a main topic or keyword is extracted from web page content accessed by a user.

Through the above preprocessing steps, the data will be cleaned, transformed, normalized, aggregated and feature extracted, providing accurate and consistent inputs for subsequent optimization analysis and decision execution.

The weighting of the word is a key step in many text analysis tasks, including the analysis of web content using SimHash algorithms. The weight may reflect the importance of the word in the text or the degree of association with a particular task. The following is a specific description and examples of how the weights are calculated:

method 1: word frequency-inverse document frequency (TF-IDF)

TF-IDF is a commonly used weight calculation method that combines the frequency of words in Text (TF) with the rarity of words in the whole document collection (IDF).

Examples:

assume that we have an article about web security, including the word "attack" 10 times, with a total number of articles of 100, and that only 100 articles mention "attack" out of the entire document collection (e.g., 1000 articles).

Calculate TF (word frequency):

calculate IDF (inverse document frequency):

and (3) calculating TI-IDF weight:

TF-IDF=TF*IDF=0.1*2.3=0.23

method 2: theme-based weighting

If we know the topic or classification of text, we can assign weights according to the degree of association of words with topics.

Examples:

assuming we are analyzing an article about a firewall, we can assign a higher weight to the words associated with the firewall.

The word "firewall" weight: 5

The word "security" weight: 3

General vocabulary weights: 1

Method 3: weights based on expert knowledge

In some cases, expert knowledge may be relied upon to assign weights, particularly when the content of the analysis relates to a particular field or a particular pattern needs to be identified.

Examples:

assuming we are analyzing an article about phishing attacks, we can assign higher weights to words related to phishing attacks based on expert knowledge.

The word "fishing" weight: 5

The word "fraud" weight: 4

General vocabulary weights: 1

By the method, the weight of the word can be calculated according to different requirements and scenes. These weights may be used in SimHash algorithm or other text analysis tasks to capture key information and patterns of text.

The relevance function of the Security Weight Factor (SWF) is used to gauge the relevance of words to a particular security topic or malicious behavior. This degree of association can be calculated in a number of ways, some possible methods and specific formula expressions being as follows:

method 1: based on the frequency of occurrence of words in malicious content

If we have a set of known malicious content (e.g., malware descriptions, phishing website text, etc.), we can calculate the frequency of occurrence of words in these content as the degree of association.

Examples:

assume that the word "attack" occurs 50 times in 100 known malicious articles, and 200 times in the entire document collection (e.g., 1000 articles).

S322: defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index; fitness function definition assume that we are optimizing two parameters: a weight coefficient alpha and a threshold beta in a Safety Weight Factor (SWF). We can define the fitness function using the following criteria:

accuracy (Accuracy): the proportion of malicious content and non-malicious content is correctly identified.

Recall (Recall): the proportion of malicious content is correctly identified.

Precision (Precision): the proportion of actual malicious content that is identified as malicious content.

We can define fitness functions in combination with these indices. For example, we can use the F1 score, which is the harmonic mean of the precision and recall:

let us assume that we have the following data:

the calculated accuracy using the parameters α=0.5 and β=0.3 is 0.8.

The recall calculated using parameters α=0.5 and β=0.3 was 0.7.

We can calculate these values by substituting them into the fitness function formula:

this fitness value can be used in whale algorithm to evaluate parameter combinations

Effects of α=0.5 and β=0.3. The whale algorithm will try to find a combination of parameters that maximizes the fitness function and thus finds the optimal alpha and beta to improve the accuracy of identifying malicious content.

S33: calculating the adjusted weight:

s34: calculating a hash value;

s4: and (5) ending.

The SimHash algorithm can segment the login information, the interaction behavior, the search records and the like in the calculation process, and calculates the weight of the words. The following is an illustration of how the word segmentation and weight calculation can be performed for these different types of data:

1. login information analysis

The login information may include places in text form, device information, etc. The word segmentation and weight calculation can be performed by the following steps:

word segmentation: and segmenting text contents such as login places, equipment information and the like.

Weight calculation: each word is assigned a weight based on the user's historical login behavior. For example, unusual login locations may have higher weights.

2. Interactive behavior analysis

The interactive behavior may include text content such as chat, comments, etc. The word segmentation and weight calculation can be performed by the following steps:

word segmentation: and segmenting text contents such as chat, comments and the like.

Weight calculation: each word is assigned a weight based on the sensitivity of the content or similarity to known malicious behavior. For example, content containing sensitive words may have a higher weight.

3. Search record analysis

The search record includes text content of search keywords and search results. The word segmentation and weight calculation can be performed by the following steps:

word segmentation: and segmenting the text content of the search keywords and the search results.

Weight calculation: each word is assigned a weight based on the user's search history and interest preferences. For example, words that are highly relevant to the user's interests may have higher weights.

Through word segmentation and weight calculation, the SimHash algorithm can convert different types of data such as login information, interaction behaviors, search records and the like into fingerprints in a numerical form. The fingerprints can be used for subsequent applications such as similarity calculation, anomaly detection, recommendation systems and the like, and support is provided for network security protection and user experience optimization.

。

example 2:

the application also provides a big data network safety protection system, as shown in fig. 2, comprising:

the SimHash algorithm can be used for analyzing various types of data such as login information, interaction behaviors, search records and the like. The following is an illustration of how the SimHash algorithm can be applied to these different types of data:

1. login information analysis

The login information typically includes a user name, a password, a login time, a login location, and the like. SimHash may be used to detect abnormal login behavior.

Feature extraction: features are extracted from the login information, such as login location, login device, etc.

Weight calculation: each feature is assigned a weight, possibly based on historical login behavior of the user.

SimHash calculation: the SimHash algorithm is used to calculate the fingerprint of the login information.

Abnormality detection: comparing with the user's historical login fingerprint, if the similarity is below a certain threshold, it may be an abnormal login.

2. Interactive behavior analysis

The interaction may include chat, comment, share, etc. with other users. SimHash can be used to detect malicious or abnormal interactions.

Feature extraction: and extracting the characteristics of keywords, topics and the like from the interactive contents.

Weight calculation: each feature is assigned a weight, possibly based on the sensitivity of the content or similarity to known malicious behavior.

SimHash calculation: the SimHash algorithm is used to compute the fingerprint of the interaction behavior.

Malicious content detection: if the similarity is above a certain threshold, it may be malicious content, compared to fingerprints of known malicious content.

3. Search record analysis

The search record includes search keywords and search results of the user. SimHash can be used for user behavior analysis and recommendation systems.

Feature extraction: features, such as topics, categories, etc., are extracted from the search keywords and results.

Weight calculation: each feature is assigned a weight, possibly based on the user's search history and interest preferences.

SimHash calculation: the SimHash algorithm is used to calculate the fingerprint of the search record.

Recommendation system: and comparing the search fingerprints with search fingerprints of other users to find similar users for the recommendation system.

In general, the flexibility and versatility of SimHash algorithm makes it applicable to various types of data analysis, including login information, interaction behavior, search records, and the like. Through proper feature extraction and weight calculation, simHash can be used for detecting abnormality, identifying malicious content, analyzing user behavior and the like, and provides powerful support for network security protection.

And a pretreatment module: cleaning and formatting the collected data;

finding an optimal solution: when the algorithm converges, the best whale found represents the bestAnd->;

Calculating the adjusted weight:

calculating a hash value;

s4: and (5) ending.

The whale algorithm is a heuristic optimization algorithm that finds the best solution by simulating the predation behavior of whales. The following is a detailed description and formulation of how the whale algorithm is used to find the best parameter combination:

1. initializing whale populations

First, we need to create a virtual population of "whales", each representing a set of parameters, such as the weight coefficient α and the threshold β in the Safety Weight Factor (SWF).

Whalei={αi,βi}

2. Defining fitness functions

The fitness function is used to evaluate the effect of each set of parameters. For example, we can use the F1 score as a fitness function:

Fitness(Whalei)=（2*Precision(αi,βi)*Recall(αi,βi))/(Precision(αi,βi)+Recall(αi,βi))

3. simulating whale predation behavior

By simulating the predation behavior of whales, the position of whales is continuously updated, and the optimal parameter combination is found.

Examples:

let us assume that we have 5 whales, each representing a different set of α and β. We can find the best parameter combination by:

calculating the fitness of each whale:

Fitness(Whalei),i=1,2,…,5

finding the current optimal whale:

BestWhale=argimax Fitness(Whalei)

updating the position of each whale:

NewPositioni=Whalei+A*(BestWhale−Whalei)

where a is an adjustment factor that may be gradually reduced with the number of iterations.

Updating whale positions:

Whalei=NewPositioni

repeating the steps until the algorithm converges or the maximum iteration number is reached.

4. Finding an optimal solution

When the algorithm converged, the best whale found represents the best α and β.

{αoptimal,βoptimal}=BestWhale

Through this process, the whale algorithm can effectively find the optimal parameter combination to improve the accuracy of identifying malicious content. The process fully utilizes the global searching capability and the local searching capability of the whale algorithm, so that the parameter optimization is more accurate and robust.

。

The foregoing has described in detail a method and system for protecting a big data network, wherein specific examples are employed to illustrate the principles and embodiments of the present application, and the above examples are only for aiding in understanding the core idea of the present application; also, as will be apparent to those skilled in the art in light of the present teachings, the present disclosure should not be limited to the specific embodiments and applications described herein.

Claims

1. The big data network safety protection method is characterized by comprising the following steps:

s2: and a pretreatment module: cleaning and formatting the collected data;

representing the proportion of the actual malicious content identified as malicious content calculated based on the parameters α and β,/o>Representing the proportion of correctly identified malicious content calculated based on the parameters α and β;

S33: calculating the adjusted weight:

s34: calculating a hash value;

s35: computing SimHash: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:

s36: comparison with known malicious attack parameters: comparing the calculated SimHash parameter with a known malicious parameter, and marking the similarity as abnormal if the similarity exceeds a set threshold;

s4: and (5) ending.

2. The method of claim 1, wherein the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.

3. The method for protecting big data network security according to claim 1, wherein the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.

4. The method of claim 1, wherein the steps ofIs a function, representing the word +.>A degree of association with malicious content,

。

5. a big data network security protection system, comprising:

and a pretreatment module: cleaning and formatting the collected data;

the representation is based on parametersAlpha and beta calculated proportion of identified as actual malicious content among malicious content,/for>Representing the proportion of correctly identified malicious content calculated based on the parameters α and β;

Calculating the adjusted weight:

calculating a hash value;

s4: and (5) ending.

6. The big data network security protection system of claim 5, wherein the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.

7. The big data network security protection system of claim 5, wherein the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.

8. The big data network security system of claim 5, wherein theIs a function, representing the word +.>A degree of association with malicious content,

。

9. an electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements the steps of the big data network security protection method of any of claims 1 to 4.

10. Computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the big data network security protection method according to any of claims 1 to 4.