CN117176482A - Big data network safety protection method and system - Google Patents

Big data network safety protection method and system Download PDF

Info

Publication number
CN117176482A
CN117176482A CN202311454852.3A CN202311454852A CN117176482A CN 117176482 A CN117176482 A CN 117176482A CN 202311454852 A CN202311454852 A CN 202311454852A CN 117176482 A CN117176482 A CN 117176482A
Authority
CN
China
Prior art keywords
parameters
word
weight
whale
safety
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311454852.3A
Other languages
Chinese (zh)
Other versions
CN117176482B (en
Inventor
徐志华
高云
姚磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoren Property Insurance Co ltd
Original Assignee
Guoren Property Insurance Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoren Property Insurance Co ltd filed Critical Guoren Property Insurance Co ltd
Priority to CN202311454852.3A priority Critical patent/CN117176482B/en
Publication of CN117176482A publication Critical patent/CN117176482A/en
Application granted granted Critical
Publication of CN117176482B publication Critical patent/CN117176482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application discloses a big data network safety protection method and a system, comprising the following steps: s1: and a data collection module: collecting network flow parameters, user behavior parameters and equipment information; s2: and a pretreatment module: cleaning and formatting the collected data; s3: identifying malicious content by adopting an improved SimHash algorithm, wherein the method comprises the steps of optimizing a safety weight factor by using a whale algorithm; s4: and (5) ending. The application provides a big data network safety protection method and a system, wherein the method adopts an improved SimHash algorithm to identify malicious content, comprises the steps of optimizing a safety weight factor by using a whale algorithm to judge network safety, and calculating the adjusted weight through a basic weight and the safety weight factor, thereby realizing the automatic judgment accuracy of safety network safety and greatly enhancing the safety of safety network data transaction.

Description

Big data network safety protection method and system
Technical Field
The application relates to the technical field of insurance data network security, in particular to a big data network security protection method and system.
Background
With the rapid development of big data and internet technology, network security problems are increasingly prominent. The insurance industry enterprises are an important component of the financial field, and the requirement for network security is particularly urgent. Traditional network security protection methods cannot meet the complex and changeable network attack means nowadays. With the rapid development of internet technology and the advent of the big data age, the problem of network security is increasingly prominent. Traditional network security protection methods often rely on fixed rules and known malicious features, and are difficult to cope with increasingly complex and diverse network attack means. Therefore, it becomes important to research and develop a network security protection method that can adaptively identify and defend against unknown threats.
The SimHash algorithm is a widely applied method for text similarity calculation. By converting text content into numeric fingerprints and calculating hamming distances between the fingerprints, simHash can quickly evaluate the similarity of text. However, the conventional SimHash algorithm is mainly used for general text processing, and lacks specific optimization for network security scenarios. In the prior art, the SimHash has poor identification capability on malicious content, and a method for judging by combining the characteristics of insurance data and network security factors is not adopted, so that the abnormal identification rate is low, and various network attack behaviors cannot be accurately identified.
Disclosure of Invention
In order to solve the above problems in the prior art, the present application provides a method and a system for protecting big data network security, wherein the method identifies malicious content by adopting an improved SimHash algorithm, includes optimizing security weight factors by using whale algorithm to perform network security judgment, and determining the security weight factors by using basic weightSafety weight factorThe adjusted weight is calculated, the automatic judgment accuracy of the safety network safety is realized, and the safety network safety is greatly enhancedThe security of the insurance network data transaction is ensured.
The application relates to a big data network safety protection method, which comprises the following steps:
s1: and a data collection module: collecting network flow parameters, user behavior parameters and equipment information;
s2: and a pretreatment module: cleaning and formatting the collected data;
s3: identifying malicious content by adopting an improved SimHash algorithm, wherein the method comprises the steps of optimizing a safety weight factor by using a whale algorithm;
s31: word segmentation is carried out on the preprocessed data network content, and the basic weight of the word is calculated
S32: calculating a safety weight factor: for each word, calculate its security weight factorAnd determining +.>And->
Wherein,is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content;
s321: initializing whale population: each whale represents a set of parameters {,/>};
S322: defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index;
representing the proportion of the actual malicious content identified as malicious content calculated based on the parameters a and beta,representing the proportion of correctly identified malicious content calculated based on the parameters α and β;
s323: simulating whale predation behavior: continuously updating the position of whales by simulating the predation behavior of whales, and searching for the optimal parameter combination;
s324: finding an optimal solution: when the algorithm converges, the best whale found represents the bestAnd->;
S33: calculating the adjusted weight:
s34: calculating a hash value;
s35: computing SimHash fingerprint: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:
s36: comparison with known malicious attack parameters: comparing the calculated SimHash parameter with the parameters of known malicious content, and marking the similarity as abnormal if the similarity exceeds a set threshold;
s4: and (5) ending.
Preferably, the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.
Preferably, the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.
Preferably, the saidIs a function, representing the word +.>A degree of association with malicious content,
the application also provides a big data network safety protection system, which comprises:
and a data collection module: collecting network flow parameters, user behavior parameters and equipment information;
and a pretreatment module: cleaning and formatting the collected data;
the improved SimHash algorithm identifies malicious content modules, including optimizing security weighting factors using whale algorithm;
firstly, word segmentation is carried out on the preprocessed data network content, and the basic weight of the word is calculated
Secondly, calculating a safety weight factor: for each word, calculate its security weight factorAnd determining +.>And->
Wherein,is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content;
initializing whale population: each whale represents a set of parameters {,/>};
Defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index;
representing the proportion of the actual malicious content identified as malicious content calculated based on the parameters a and beta,representing the proportion of correctly identified malicious content calculated based on the parameters α and β;
simulating whale predation behavior: continuously updating the position of whales by simulating the predation behavior of whales, and searching for the optimal parameter combination;
finding an optimal solution: when (when)The best whale found represents the best when the algorithm convergesAnd->;
Calculating the adjusted weight:
calculating a hash value;
computing SimHash fingerprint: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:
and a comparison module for comparing the malicious attack parameters with the known fingerprints: comparing the calculated SimHash with known malicious attack parameters, and marking the similarity as abnormal if the similarity exceeds a set threshold;
s4: and (5) ending.
Preferably, the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.
Preferably, the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.
Preferably, the saidIs a function, representing the word +.>A degree of association with malicious content,
the application provides a big data network safety protection method and a system, which can realize the following beneficial technical effects:
1. the application identifies malicious content by adopting an improved SimHash algorithm, comprises the steps of optimizing a security weight factor by using a whale algorithm to judge network security, combining the SimHash algorithm with the whale algorithm and applying the SimHash algorithm to the aspect of judging insurance network cases, and identifying and judging the network security by combining the interaction characteristics of insurance data networks, thereby greatly enhancing the judgment accuracy of insurance network abnormal attack and improving the security.
2. The application uses basic weightSecurity weight factor->Calculating the adjusted weight: -calculating the adjusted weight:>the weight factors in multiple aspects are comprehensively considered, so that the automatic judgment accuracy of the safety of the insurance network is realized, the safety of the data transaction of the insurance network is greatly enhanced, and the judgment accuracy of the participation calculation of influence factors and the network attack behavior is improved.
3. The application adopts whale algorithm, and the application can realize the purpose of the application,is a weight coefficient, +.>Optimizing the threshold value to find the best quality so as to greatly improve the accuracy of the coefficient, and adding +_>Wherein->Is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content; through the optimization of the technology and the whale algorithm, the accuracy of the safety weight factors is greatly improved, and the judgment accuracy of the insurance network attack behaviors is further greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of the steps of a security protection method for a big data network according to the present application;
fig. 2 is a schematic diagram of a big data network security protection system of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Example 1:
in order to solve the above-mentioned problems mentioned in the prior art, as shown in fig. 1: the application provides a big data network safety protection method, which comprises the following steps:
s1: and a data collection module: collecting network flow parameters, user behavior parameters and equipment information;
network traffic parameters
Network traffic parameters are mainly concerned with the transmission and communication of data in the network. The following are some of the main parameters:
IP address: a source IP address and a destination IP address for identifying a sender and a receiver of the data packet.
Port number: a source port and a destination port for identifying a particular network service.
Protocol type: such as TCP, UDP, ICMP, for determining the transmission mode of the data packet.
Packet size: the size of each packet is used to analyze the network load.
Transmission rate: the transmission speed of the data can be used to detect network congestion or attacks.
Session information: including the start and end times of the session, for analyzing the persistence of the network connection.
User behavior parameters
The user behavior parameters are primarily concerned with the user's activities and behaviors in the network. The following are some of the main parameters:
login information: including user name, login time, login location, etc.
Browsing behavior: accessed web address, residence time, click behavior, etc.
Searching records: search keywords and search results for the user.
Upload/download behavior: the type, size, time, etc. of the file being uploaded or downloaded.
Interaction behavior: interaction records with other users or systems, such as chat, comments, etc.
Other possible parameters include
In addition to the above parameters, the following parameters may be included:
device information: the type of device used by the user, the operating system, the browser version, etc.
Network status: such as delay, packet loss rate, etc., for evaluating network quality.
Security event: any security related event such as login failure, abnormal access, etc.
Third party application behavior: such as the user logging in through social media, paying using a third party, etc.
Geographic location information: the physical location information of the user can be used to analyze user behavior and risk assessment.
Illustrative examples
For example, a user accesses an insurance company's online service from IP address 192.168.1.1 through port 443. The user uses the Windows operating system to log in through the Chrome browser and stay on the product page for 5 minutes, and then downloads an insurance contract. In the process, the system will collect the above network traffic parameters and user behavior parameters, as well as other possible parameters such as device information, size of downloaded files, etc., for subsequent analysis and security protection.
S2: and a pretreatment module: cleaning and formatting the collected data;
1. data cleansing
Data cleansing is the process of removing or correcting erroneous, inconsistent or irrelevant information in a data set.
Examples: removing duplicate records: if the same operation of the same user is recorded multiple times, duplicate entries may be deleted.
Filling the missing value: for example, if some records lack port numbers, these missing values may be populated according to the protocol type (e.g., HTTP typically uses port 80).
Correcting the error value: for example, if the IP address format is incorrect (e.g., "192.300.1.1"), it may be marked as erroneous and corrected or deleted.
2. Data conversion
Data conversion is the process of converting data into a format or structure suitable for analysis.
Examples: the units are unified: for example, all data transmission rates are converted to a unified unit, such as Mbps.
The time format is unified: all dates and times are converted to a unified format, such as "YYYY-MM-DD HH: MM: SS".
3. Data normalization
Data normalization is the process of converting numeric attributes of different ranges into similar ranges for subsequent analysis.
Examples: maximum and minimum normalization: for example, the packet size ranges from 0 to 1500 bytes to 0 to 1.
Z-Score normalization: for example, the user dwell time is converted to Z-Score in order to identify abnormal dwell times.
4. Data aggregation
Data aggregation is a process of combining multiple data points into a single data point, typically used to reduce the complexity of the data.
Examples: time period polymerization: for example, network traffic per minute is aggregated to traffic per hour.
User behavior aggregation: for example, all click actions within a single user's day are aggregated into a single record.
5. Feature extraction
Feature extraction is the process of extracting useful information from raw data to facilitate subsequent analysis and modeling.
Examples: extracting the geographic position of the IP address: for example, geographic information of a country, a city, etc. is extracted from the IP address.
Extracting a theme of browsing behaviors: for example, a main topic or keyword is extracted from web page content accessed by a user.
Through the above preprocessing steps, the data will be cleaned, transformed, normalized, aggregated and feature extracted, providing accurate and consistent inputs for subsequent optimization analysis and decision execution.
S3: identifying malicious content by adopting an improved SimHash algorithm, wherein the method comprises the steps of optimizing a safety weight factor by using a whale algorithm;
s31: word segmentation is carried out on the preprocessed data network content, and the basic weight of the word is calculated
The weighting of the word is a key step in many text analysis tasks, including the analysis of web content using SimHash algorithms. The weight may reflect the importance of the word in the text or the degree of association with a particular task. The following is a specific description and examples of how the weights are calculated:
method 1: word frequency-inverse document frequency (TF-IDF)
TF-IDF is a commonly used weight calculation method that combines the frequency of words in Text (TF) with the rarity of words in the whole document collection (IDF).
Examples:
assume that we have an article about web security, including the word "attack" 10 times, with a total number of articles of 100, and that only 100 articles mention "attack" out of the entire document collection (e.g., 1000 articles).
Calculate TF (word frequency):
calculate IDF (inverse document frequency):
and (3) calculating TI-IDF weight:
TF-IDF=TF*IDF=0.1*2.3=0.23
method 2: theme-based weighting
If we know the topic or classification of text, we can assign weights according to the degree of association of words with topics.
Examples:
assuming we are analyzing an article about a firewall, we can assign a higher weight to the words associated with the firewall.
The word "firewall" weight: 5
The word "security" weight: 3
General vocabulary weights: 1
Method 3: weights based on expert knowledge
In some cases, expert knowledge may be relied upon to assign weights, particularly when the content of the analysis relates to a particular field or a particular pattern needs to be identified.
Examples:
assuming we are analyzing an article about phishing attacks, we can assign higher weights to words related to phishing attacks based on expert knowledge.
The word "fishing" weight: 5
The word "fraud" weight: 4
General vocabulary weights: 1
By the method, the weight of the word can be calculated according to different requirements and scenes. These weights may be used in SimHash algorithm or other text analysis tasks to capture key information and patterns of text.
S32: calculating a safety weight factor: for each word, calculate its security weight factorAnd determining +.>And->
The relevance function of the Security Weight Factor (SWF) is used to gauge the relevance of words to a particular security topic or malicious behavior. This degree of association can be calculated in a number of ways, some possible methods and specific formula expressions being as follows:
method 1: based on the frequency of occurrence of words in malicious content
If we have a set of known malicious content (e.g., malware descriptions, phishing website text, etc.), we can calculate the frequency of occurrence of words in these content as the degree of association.
Examples:
assume that the word "attack" occurs 50 times in 100 known malicious articles, and 200 times in the entire document collection (e.g., 1000 articles).
Wherein,is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content;
s321: initializing whale population: each whale represents a set of parameters {,/>};
S322: defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index; fitness function definition assume that we are optimizing two parameters: a weight coefficient alpha and a threshold beta in a Safety Weight Factor (SWF). We can define the fitness function using the following criteria:
accuracy (Accuracy): the proportion of malicious content and non-malicious content is correctly identified.
Recall (Recall): the proportion of malicious content is correctly identified.
Precision (Precision): the proportion of actual malicious content that is identified as malicious content.
We can define fitness functions in combination with these indices. For example, we can use the F1 score, which is the harmonic mean of the precision and recall:
representing the proportion of the actual malicious content identified as malicious content calculated based on the parameters a and beta,representing the proportion of correctly identified malicious content calculated based on the parameters α and β;
let us assume that we have the following data:
the calculated accuracy using the parameters α=0.5 and β=0.3 is 0.8.
The recall calculated using parameters α=0.5 and β=0.3 was 0.7.
We can calculate these values by substituting them into the fitness function formula:
this fitness value can be used in whale algorithm to evaluate parameter combinations
Effects of α=0.5 and β=0.3. The whale algorithm will try to find a combination of parameters that maximizes the fitness function and thus finds the optimal alpha and beta to improve the accuracy of identifying malicious content.
S323: simulating whale predation behavior: continuously updating the position of whales by simulating the predation behavior of whales, and searching for the optimal parameter combination;
s324: finding an optimal solution: when the algorithm converges, the best whale found represents the bestAnd->;
S33: calculating the adjusted weight:
s34: calculating a hash value;
s35: computing SimHash fingerprint: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:
s36: comparison with known malicious attack parameters: comparing the calculated SimHash parameter with the parameters of known malicious content, and marking the similarity as abnormal if the similarity exceeds a set threshold;
s4: and (5) ending.
Preferably, the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.
The SimHash algorithm can segment the login information, the interaction behavior, the search records and the like in the calculation process, and calculates the weight of the words. The following is an illustration of how the word segmentation and weight calculation can be performed for these different types of data:
1. login information analysis
The login information may include places in text form, device information, etc. The word segmentation and weight calculation can be performed by the following steps:
word segmentation: and segmenting text contents such as login places, equipment information and the like.
Weight calculation: each word is assigned a weight based on the user's historical login behavior. For example, unusual login locations may have higher weights.
2. Interactive behavior analysis
The interactive behavior may include text content such as chat, comments, etc. The word segmentation and weight calculation can be performed by the following steps:
word segmentation: and segmenting text contents such as chat, comments and the like.
Weight calculation: each word is assigned a weight based on the sensitivity of the content or similarity to known malicious behavior. For example, content containing sensitive words may have a higher weight.
3. Search record analysis
The search record includes text content of search keywords and search results. The word segmentation and weight calculation can be performed by the following steps:
word segmentation: and segmenting the text content of the search keywords and the search results.
Weight calculation: each word is assigned a weight based on the user's search history and interest preferences. For example, words that are highly relevant to the user's interests may have higher weights.
Through word segmentation and weight calculation, the SimHash algorithm can convert different types of data such as login information, interaction behaviors, search records and the like into fingerprints in a numerical form. The fingerprints can be used for subsequent applications such as similarity calculation, anomaly detection, recommendation systems and the like, and support is provided for network security protection and user experience optimization.
Preferably, the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.
Preferably, the saidIs a function, representing the word +.>A degree of association with malicious content,
example 2:
the application also provides a big data network safety protection system, as shown in fig. 2, comprising:
and a data collection module: collecting network flow parameters, user behavior parameters and equipment information;
the SimHash algorithm can be used for analyzing various types of data such as login information, interaction behaviors, search records and the like. The following is an illustration of how the SimHash algorithm can be applied to these different types of data:
1. login information analysis
The login information typically includes a user name, a password, a login time, a login location, and the like. SimHash may be used to detect abnormal login behavior.
Feature extraction: features are extracted from the login information, such as login location, login device, etc.
Weight calculation: each feature is assigned a weight, possibly based on historical login behavior of the user.
SimHash calculation: the SimHash algorithm is used to calculate the fingerprint of the login information.
Abnormality detection: comparing with the user's historical login fingerprint, if the similarity is below a certain threshold, it may be an abnormal login.
2. Interactive behavior analysis
The interaction may include chat, comment, share, etc. with other users. SimHash can be used to detect malicious or abnormal interactions.
Feature extraction: and extracting the characteristics of keywords, topics and the like from the interactive contents.
Weight calculation: each feature is assigned a weight, possibly based on the sensitivity of the content or similarity to known malicious behavior.
SimHash calculation: the SimHash algorithm is used to compute the fingerprint of the interaction behavior.
Malicious content detection: if the similarity is above a certain threshold, it may be malicious content, compared to fingerprints of known malicious content.
3. Search record analysis
The search record includes search keywords and search results of the user. SimHash can be used for user behavior analysis and recommendation systems.
Feature extraction: features, such as topics, categories, etc., are extracted from the search keywords and results.
Weight calculation: each feature is assigned a weight, possibly based on the user's search history and interest preferences.
SimHash calculation: the SimHash algorithm is used to calculate the fingerprint of the search record.
Recommendation system: and comparing the search fingerprints with search fingerprints of other users to find similar users for the recommendation system.
In general, the flexibility and versatility of SimHash algorithm makes it applicable to various types of data analysis, including login information, interaction behavior, search records, and the like. Through proper feature extraction and weight calculation, simHash can be used for detecting abnormality, identifying malicious content, analyzing user behavior and the like, and provides powerful support for network security protection.
And a pretreatment module: cleaning and formatting the collected data;
the improved SimHash algorithm identifies malicious content modules, including optimizing security weighting factors using whale algorithm;
firstly, word segmentation is carried out on the preprocessed data network content, and the basic weight of the word is calculated
Secondly, calculating a safety weight factor: for each word, calculate its security weight factorAnd determining +.>And->
Wherein,is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content;
initializing whale population: each whale represents a set of parameters {,/>};
Defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index;
representing the proportion of the actual malicious content identified as malicious content calculated based on the parameters a and beta,representing the proportion of correctly identified malicious content calculated based on the parameters α and β;
simulating whale predation behavior: continuously updating the position of whales by simulating the predation behavior of whales, and searching for the optimal parameter combination;
finding an optimal solution: when the algorithm converges, the best whale found represents the bestAnd->;
Calculating the adjusted weight:
calculating a hash value;
computing SimHash fingerprint: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:
and a comparison module for comparing the malicious attack parameters with the known fingerprints: comparing the calculated SimHash with known malicious attack parameters, and marking the similarity as abnormal if the similarity exceeds a set threshold;
s4: and (5) ending.
The whale algorithm is a heuristic optimization algorithm that finds the best solution by simulating the predation behavior of whales. The following is a detailed description and formulation of how the whale algorithm is used to find the best parameter combination:
1. initializing whale populations
First, we need to create a virtual population of "whales", each representing a set of parameters, such as the weight coefficient α and the threshold β in the Safety Weight Factor (SWF).
Whalei={αi,βi}
2. Defining fitness functions
The fitness function is used to evaluate the effect of each set of parameters. For example, we can use the F1 score as a fitness function:
Fitness(Whalei)=(2*Precision(αi,βi)*Recall(αi,βi))/(Precision(αi,βi)+Recall(αi,βi))
3. simulating whale predation behavior
By simulating the predation behavior of whales, the position of whales is continuously updated, and the optimal parameter combination is found.
Examples:
let us assume that we have 5 whales, each representing a different set of α and β. We can find the best parameter combination by:
calculating the fitness of each whale:
Fitness(Whalei),i=1,2,…,5
finding the current optimal whale:
BestWhale=argimax Fitness(Whalei)
updating the position of each whale:
NewPositioni=Whalei+A*(BestWhale−Whalei)
where a is an adjustment factor that may be gradually reduced with the number of iterations.
Updating whale positions:
Whalei=NewPositioni
repeating the steps until the algorithm converges or the maximum iteration number is reached.
4. Finding an optimal solution
When the algorithm converged, the best whale found represents the best α and β.
{αoptimal,βoptimal}=BestWhale
Through this process, the whale algorithm can effectively find the optimal parameter combination to improve the accuracy of identifying malicious content. The process fully utilizes the global searching capability and the local searching capability of the whale algorithm, so that the parameter optimization is more accurate and robust.
Preferably, the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.
Preferably, the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.
Preferably, the saidIs a function, representing the word +.>A degree of association with malicious content,
the application provides a big data network safety protection method and a system, which can realize the following beneficial technical effects:
1. the application identifies malicious content by adopting an improved SimHash algorithm, comprises the steps of optimizing a security weight factor by using a whale algorithm to judge network security, combining the SimHash algorithm with the whale algorithm and applying the SimHash algorithm to the aspect of judging insurance network cases, and identifying and judging the network security by combining the interaction characteristics of insurance data networks, thereby greatly enhancing the judgment accuracy of insurance network abnormal attack and improving the security.
2. The application uses basic weightSecurity weight factor->Calculating the adjusted weight: -calculating the adjusted weight:>the weight factors in multiple aspects are comprehensively considered, so that the automatic judgment accuracy of the safety of the insurance network is realized, the safety of the data transaction of the insurance network is greatly enhanced, and the judgment accuracy of the participation calculation of influence factors and the network attack behavior is improved.
3. The application adopts whale algorithm, and the application can realize the purpose of the application,is a weight coefficient, +.>Optimizing the threshold value to find the best quality so as to greatly improve the accuracy of the coefficient, and adding +_>Wherein->Is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content; through the optimization of the technology and the whale algorithm, the accuracy of the safety weight factors is greatly improved, and the judgment accuracy of the insurance network attack behaviors is further greatly improved.
The foregoing has described in detail a method and system for protecting a big data network, wherein specific examples are employed to illustrate the principles and embodiments of the present application, and the above examples are only for aiding in understanding the core idea of the present application; also, as will be apparent to those skilled in the art in light of the present teachings, the present disclosure should not be limited to the specific embodiments and applications described herein.

Claims (10)

1. The big data network safety protection method is characterized by comprising the following steps:
s1: and a data collection module: collecting network flow parameters, user behavior parameters and equipment information;
s2: and a pretreatment module: cleaning and formatting the collected data;
s3: identifying malicious content by adopting an improved SimHash algorithm, wherein the method comprises the steps of optimizing a safety weight factor by using a whale algorithm;
s31: word segmentation is carried out on the preprocessed data network content, and the basic weight of the word is calculated
S32: calculating a safety weight factor: for each word, calculate its security weight factorAnd determining +.>And->
Wherein,is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content;
s321: initializing whale population: each whale represents a set of parameters {,/>};
S322: defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index;
representing the proportion of the actual malicious content identified as malicious content calculated based on the parameters α and β,/o>Representing the proportion of correctly identified malicious content calculated based on the parameters α and β;
s323: simulating whale predation behavior: continuously updating the position of whales by simulating the predation behavior of whales, and searching for the optimal parameter combination;
s324: finding an optimal solution: when the algorithm converges, the best whale found represents the bestAnd->;
S33: calculating the adjusted weight:
s34: calculating a hash value;
s35: computing SimHash: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:
s36: comparison with known malicious attack parameters: comparing the calculated SimHash parameter with a known malicious parameter, and marking the similarity as abnormal if the similarity exceeds a set threshold;
s4: and (5) ending.
2. The method of claim 1, wherein the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.
3. The method for protecting big data network security according to claim 1, wherein the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.
4. The method of claim 1, wherein the steps ofIs a function, representing the word +.>A degree of association with malicious content,
5. a big data network security protection system, comprising:
and a data collection module: collecting network flow parameters, user behavior parameters and equipment information;
and a pretreatment module: cleaning and formatting the collected data;
the improved SimHash algorithm identifies malicious content modules, including optimizing security weighting factors using whale algorithm;
firstly, word segmentation is carried out on the preprocessed data network content, and the basic weight of the word is calculated
Secondly, calculating a safety weight factor: for each word, calculate its security weight factorAnd determining +.>And->
Wherein,is the i-th word,/-th word>Is a weight coefficient, +.>Threshold value (S)>Is a function, representing the word +.>Correlation with malicious content;
initializing whale population: each whale represents a set of parameters {,/>};
Defining a fitness function: the fitness function can evaluate the effect of each group of parameters based on the accuracy and recall index;
the representation is based on parametersAlpha and beta calculated proportion of identified as actual malicious content among malicious content,/for>Representing the proportion of correctly identified malicious content calculated based on the parameters α and β;
simulating whale predation behavior: continuously updating the position of whales by simulating the predation behavior of whales, and searching for the optimal parameter combination;
finding an optimal solution: when the algorithm converges, the best whale found represents the bestAnd->;
Calculating the adjusted weight:
calculating a hash value;
computing SimHash fingerprint: and (3) calculating the SimHash fingerprint of the network content by combining the adjusted weight and the adjusted hash value:
and a comparison module for comparing the malicious attack parameters with the known fingerprints: comparing the calculated SimHash with known malicious attack parameters, and marking the similarity as abnormal if the similarity exceeds a set threshold;
s4: and (5) ending.
6. The big data network security protection system of claim 5, wherein the network traffic parameters include: IP address, port number, protocol type, transmission rate, session start time, session end time, the user behavior parameters include: user name, login time, login location, accessed website, residence time, click behavior, search keywords, search results, type, size, time of uploaded or downloaded file; the device information includes: device type, operating system, browser version.
7. The big data network security protection system of claim 5, wherein the preprocessing module: the collected data is flushed and formatted, the flushing including multiple record-deletion of duplicate entries by the same operation of the same user, the formatting including normalization of the data with Z-Score.
8. The big data network security system of claim 5, wherein theIs a function, representing the word +.>A degree of association with malicious content,
9. an electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements the steps of the big data network security protection method of any of claims 1 to 4.
10. Computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the big data network security protection method according to any of claims 1 to 4.
CN202311454852.3A 2023-11-03 2023-11-03 Big data network safety protection method and system Active CN117176482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311454852.3A CN117176482B (en) 2023-11-03 2023-11-03 Big data network safety protection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311454852.3A CN117176482B (en) 2023-11-03 2023-11-03 Big data network safety protection method and system

Publications (2)

Publication Number Publication Date
CN117176482A true CN117176482A (en) 2023-12-05
CN117176482B CN117176482B (en) 2024-01-09

Family

ID=88930331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311454852.3A Active CN117176482B (en) 2023-11-03 2023-11-03 Big data network safety protection method and system

Country Status (1)

Country Link
CN (1) CN117176482B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370975A (en) * 2023-12-08 2024-01-09 国任财产保险股份有限公司 Sql injection detection method and system based on deep learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108737423A (en) * 2018-05-24 2018-11-02 国家计算机网络与信息安全管理中心 Fishing website based on webpage key content similarity analysis finds method and system
CN111967063A (en) * 2020-09-02 2020-11-20 开普云信息科技股份有限公司 Data tampering monitoring and identifying method and device based on multi-dimensional analysis, electronic equipment and storage medium thereof
CN116167002A (en) * 2023-01-30 2023-05-26 沈阳化工大学 Industrial control network anomaly detection method based on optimized random forest
CN116719798A (en) * 2023-04-26 2023-09-08 中国工业互联网研究院 Simhash text de-duplication method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108737423A (en) * 2018-05-24 2018-11-02 国家计算机网络与信息安全管理中心 Fishing website based on webpage key content similarity analysis finds method and system
CN111967063A (en) * 2020-09-02 2020-11-20 开普云信息科技股份有限公司 Data tampering monitoring and identifying method and device based on multi-dimensional analysis, electronic equipment and storage medium thereof
CN116167002A (en) * 2023-01-30 2023-05-26 沈阳化工大学 Industrial control network anomaly detection method based on optimized random forest
CN116719798A (en) * 2023-04-26 2023-09-08 中国工业互联网研究院 Simhash text de-duplication method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
阮嘉琨;蔡延光;蔡颢;张丽;: "基于灰狼算法的Simhash冗余数据检测算法", 东莞理工学院学报, no. 05 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370975A (en) * 2023-12-08 2024-01-09 国任财产保险股份有限公司 Sql injection detection method and system based on deep learning
CN117370975B (en) * 2023-12-08 2024-03-26 国任财产保险股份有限公司 Sql injection detection method and system based on deep learning

Also Published As

Publication number Publication date
CN117176482B (en) 2024-01-09

Similar Documents

Publication Publication Date Title
KR102046789B1 (en) Deep-learning-based intrusion detection method, system and computer program for web applications
CN107483488B (en) Malicious Http detection method and system
CN107204960B (en) Webpage identification method and device and server
CN107332848B (en) Network flow abnormity real-time monitoring system based on big data
CN112839014B (en) Method, system, equipment and medium for establishing abnormal visitor identification model
CN117176482B (en) Big data network safety protection method and system
CN104579773A (en) Domain name system analysis method and device
WO2022143511A1 (en) Malicious traffic identification method and related apparatus
CN111526136A (en) Malicious attack detection method, system, device and medium based on cloud WAF
RU2759087C1 (en) Method and system for static analysis of executable files based on predictive models
CN112866281B (en) Distributed real-time DDoS attack protection system and method
CN110162958B (en) Method, apparatus and recording medium for calculating comprehensive credit score of device
CN116015842A (en) Network attack detection method based on user access behaviors
CN115442075A (en) Malicious domain name detection method and system based on heterogeneous graph propagation network
Jacobs et al. Enhancing Vulnerability prioritization: Data-driven exploit predictions with community-driven insights
CN113282920B (en) Log abnormality detection method, device, computer equipment and storage medium
CN117254983A (en) Method, device, equipment and storage medium for detecting fraud-related websites
Liu et al. Doc2vec-based insider threat detection through behaviour analysis of multi-source security logs
CN111797997A (en) Network intrusion detection method, model construction method, device and electronic equipment
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
CN116599743A (en) 4A abnormal detour detection method and device, electronic equipment and storage medium
CN113822684B (en) Black-birth user identification model training method and device, electronic equipment and storage medium
CN112929369A (en) Distributed real-time DDoS attack detection method
CN110689074A (en) Feature selection method based on fuzzy set feature entropy value calculation
CN113612765B (en) Website detection method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant