CN117473571B - Data information security processing method and system - Google Patents

Data information security processing method and system Download PDF

Info

Publication number
CN117473571B
CN117473571B CN202311491262.8A CN202311491262A CN117473571B CN 117473571 B CN117473571 B CN 117473571B CN 202311491262 A CN202311491262 A CN 202311491262A CN 117473571 B CN117473571 B CN 117473571B
Authority
CN
China
Prior art keywords
event
time
features
association
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311491262.8A
Other languages
Chinese (zh)
Other versions
CN117473571A (en
Inventor
李荣耀
何建银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Deep Technology Information Technology Co ltd
Original Assignee
Guangdong Deep Technology Information Technology Co ltd
Filing date
Publication date
Application filed by Guangdong Deep Technology Information Technology Co ltd filed Critical Guangdong Deep Technology Information Technology Co ltd
Priority to CN202311491262.8A priority Critical patent/CN117473571B/en
Publication of CN117473571A publication Critical patent/CN117473571A/en
Application granted granted Critical
Publication of CN117473571B publication Critical patent/CN117473571B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application discloses a data information security processing method and a system, which belongs to the field of design data processing, and comprises the following steps: acquiring first risk identification information from service processing equipment; acquiring second risk identification information comprising threat information, a security knowledge base and a historical analysis model from a cloud platform; preprocessing the collected first risk identification information and second risk identification information; constructing time-associated features, space-associated features and sequence-associated features of the first risk identification information converted into the structured format; constructing threat information features IOC of the second risk identification information converted into the structured format as professional features; and training a security risk association degree model based on an Apriori algorithm and pearson correlation coefficients by using the constructed association features and professional features, and acquiring the association between the first risk identification information and the second risk identification information. Aiming at the problem of low accuracy of risk identification in information security in the prior art, the application improves the accuracy of risk identification in heterogeneous data.

Description

Data information security processing method and system
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and system for securely processing data information.
Background
With the development of technologies such as the Internet, cloud computing and big data, network security risks are in complex and changeable situations. Enterprise network environments face network attacks and security threats that are diverse in origin and complex in type. How to realize the active discovery and intelligent evaluation of the network security risk is an important subject for guaranteeing the network security.
Traditional security risk assessment mainly relies on risk identification information from a single source, and security blindspot exists, so that comprehensive assessment on a complex security environment cannot be performed.
In the related art, for example, in CN115563657a, a method, a system and a cloud platform for secure processing of data information are provided, which include: at least one piece of second security risk identification information obtained by performing security risk identification on each group of session security detection reports through a pre-configured risk identification decision algorithm; and obtaining the decision analysis quality data of the preconfigured risk identification decision algorithm under a plurality of safety risk identification links obtained by dividing and treating the integrated risk identification items of the service processing equipment by utilizing the determined at least one first safety risk identification information and the determined at least one second safety risk identification information. However, the application mainly relies on a single data source and a preset model, and the risk identification precision needs to be further improved.
Disclosure of Invention
1. Technical problem to be solved
Aiming at the problem of low risk identification accuracy in information security in the prior art, the invention provides a data information security processing method and a data information security processing system, and the accuracy of risk identification in heterogeneous data is improved by using the techniques of establishing a relevance model between security events by using an Apriori algorithm and Pearson correlation coefficients and the like.
2. Technical proposal
The aim of the invention is achieved by the following technical scheme.
An aspect of the embodiments of the present disclosure provides a data information security processing method, including: acquiring first risk identification information comprising equipment logs, monitoring data and alarm information from service processing equipment; service processing equipment: refers to servers, network devices, security devices, etc. that run critical business systems and process business data. These devices generate various log, monitoring and alarm information during operation. Device log: such as a server running log, a network connection log, a user operation log, etc., the running state of the device, network connection information, user operation activities, etc., are recorded. Monitoring data: such as CPU or memory usage monitoring, network traffic monitoring, security event monitoring, etc., reflect the performance status and security status of the device in real time. Alarm information: such as intrusion detection alarms, DDoS attack alarms, database audit alarms, etc., and alarm information generated by the security monitoring system when a security event occurs on the device. The acquisition mode is as follows: the device log and the monitoring data can be obtained through a log collecting system and a monitoring system; and acquiring alarm information through the safety information and the event management system. FIRST risk identification information: the comprehensive equipment log, the monitoring data and the alarm information can analyze the running state of equipment, the network communication behavior, the business operation activities, the security attack events and the like, and are used for risk identification and association analysis. The device log and the like can provide detailed status information of the device and service operation, and the identification capability of potential risks can be improved by analyzing the low-level detailed information.
Acquiring second risk identification information comprising threat information, a safety knowledge base and a history analysis model from a cloud platform, wherein the safety knowledge base is a structured knowledge base comprising safety event characteristics and corresponding schemes, and the history analysis model is a safety event matching model based on machine learning training; cloud platform: the cloud platform integrating the multi-source safety information and knowledge can provide abundant safety knowledge support for risk identification. Threat intelligence: including known threat intelligence Indicators (IOCs), attack patterns, vulnerability information, etc., for detecting known security threats. Secure knowledge base: structured storage of security event features (e.g., attack means, impact, etc.) and corresponding processing scheme knowledge supports acquisition of a coping scheme through event matching. Historical analysis model: the safety event detection and matching model trained by using the machine learning algorithm can support the identification and judgment of unknown new events. The acquisition mode is as follows: and acquiring threat information, a knowledge base and other information through accessing an open interface of the cloud security platform. SECOND risk identification information: and carrying out known threat detection, new event matching and scheme recommendation through cloud security knowledge and a model, and taking the known threat detection, new event matching and scheme recommendation as a complementary source of risk identification. The cloud security knowledge graph and the AI model can detect more types of risks, and the risks are complementary with FIRST information to improve the identification coverage.
Preprocessing the collected first risk identification information and second risk identification information, and converting the first risk identification information and the second risk identification information into a structural format; the purpose of pretreatment is as follows: and data noise is eliminated, data quality is improved, and the format is converted to facilitate subsequent processing. Pretreatment: data cleaning: and filtering useless and abnormal data and correcting error data. Data integration: data fields of different origin but semantically related are aggregated. Data conversion: unstructured data (e.g., text, logs) is converted to structured data (e.g., database tables). Data normalization: converting the different range data into a unified range. And (3) de-duplication treatment: duplicate and redundant data is deleted. Structural conversion: feature fields in unstructured data are extracted according to a predefined data model. The feature fields are mapped into a structured model (e.g., a form). For new features that cannot be mapped, the data model is updated using machine learning. Noise is eliminated, data quality is improved, and high-quality input is provided for subsequent processing. The structural conversion reduces the processing complexity and improves the processing efficiency. The standardized format facilitates feature extraction and model training. Preprocessing improves data quality, structuring reduces processing difficulty, and provides high-quality structured input data for constructing a security association model.
Constructing time-associated features, space-associated features and sequence-associated features of the first risk identification information converted into the structured format; constructing threat information features IOC of the second risk identification information converted into the structured format as professional features; time-related features: modeling event time relation by using a time stamp sequence, calculating a time interval difference value to judge time correlation, counting event time distribution by adopting a sliding window, and fusing different time characteristics by a time correlation learner; spatial correlation features: judging event space correlation by using a space distance algorithm, judging a clustering mode by using space autocorrelation analysis, and comprehensively judging the space correlation by using a space correlation analysis model; sequence association features: constructing an event sequence according to time sequence, using a frequent sequence mode to reflect an event sequence rule, and mining event causal relationship by association rules; threat intelligence IOC features: extracting characteristic IOC indexes from the threat report, performing association analysis to generate combined IOC indexes, verifying the validity of the IOC by using network traffic, encoding the IOC and constructing the IOC characteristics by using a characteristic selection algorithm; the time, space and sequence association features reflect the event association from multiple dimensions, the IOC features bring professional threat information knowledge, and the combination analysis of the association features improves the comprehensiveness and accuracy of risk identification.
Training a security risk association degree model based on an Apriori algorithm and pearson correlation coefficients by using the constructed association features and professional features to acquire the association between the first risk identification information and the second risk identification information; correlation characteristics: the time, space and sequence association features reflect implicit association between events; IOC expertise brings in expertise. Apriori algorithm: utilizing the association rule to learn a frequent association mode, mining potential association relations among different events, and calculating confidence coefficient to evaluate the reliability of the association rule; pearson correlation coefficient: measuring the linear correlation degree between two variables, judging the correlation of the numerical characteristics between two events, and carrying out correlation degree quantification by combining the confidence coefficient; correlation model: inputting multidimensional association features of two events, and a model: integrating the learning model, fusing the Apriori algorithm and the pearson algorithm, outputting the relevance scores among the events; the Apriori mines potential relevance of the event, pearson calculates numerical characteristic relevance, the Apriori and the pearson integrate the potential relevance, quantitative calculation and evaluation of relevance are realized, and more accurate multi-source relevance judgment is provided for risk identification.
And generating a security scheme containing resource configuration and monitoring strategies by using the acquired relevance. The obtained association: reflecting the degree of association between different security events. Resource allocation policy: according to the relevance of the key events, high-risk core assets are determined, and the priority of monitoring and protecting resources of the assets is improved, such as increasing log collection amount, deploying WAF and the like; monitoring strategies: setting the time range and the key point of active monitoring according to the time sequence rule of the related events, for example, monitoring the event B within 24 hours after the event A occurs; generating a security scheme, wherein a correlation analysis engine is combined with a security knowledge base, and matches a feasible coping scheme according to the event correlation to generate resource configuration and monitoring policy recommendation; and a pertinence scheme is generated according to the association rule of the security event, so that compared with a general strategy, the method has the advantages of pertinence, coincidence with the event association, and improvement of the use efficiency of the security resource and the attack detection capability.
Further, constructing time-associated features, space-associated features and sequence-associated features of the first risk identification information converted into the structured format; constructing threat intelligence features IOC of the second risk identification information converted into the structured format, wherein the threat intelligence features IOC are professional features and comprise the following steps: constructing time correlation features by using a time stamp smoothing and counting method, wherein the time correlation features comprise a time stamp sequence, a time interval and a sliding time window frequency; the time smoothing technology removes random errors of the time sequence, improves the time data quality, and provides high-quality time data input for extracting time-related features. By constructing multi-granularity time correlation features, namely a time stamp sequence, a time interval and a sliding window frequency, event time correlation can be learned from multiple dimensions, and event time relations can be more comprehensively described. The time stamp sequence keeps time sequence information, supports the logic relationship of judging event time, and is the basis of judging time causal relationship. The time interval calculates the time distance between the events, can judge whether the two events are closely related in time, and has the capability of judging the time association strength. The event frequency is counted by sliding the time window, so that the event time aggregation rule can be effectively found, and the time correlation is judged. The combination analysis of the multi-granularity time features can mutually verify the time relevance, reduce the error rate of single time index judgment and improve the accuracy of time relevance judgment. Therefore, the event time association rule can be modeled more accurately, the time correlation of event hiding is found, and the effect and quality of risk identification and safety management by using time association are improved.
Constructing a spatial correlation characteristic comprising a spatial distance of a safety event and a spatial clustering mode by using a spatial distance algorithm and spatial autocorrelation analysis; the spatial distance algorithm is utilized to calculate the spatial distance between the events, so that whether the events are adjacent in space can be judged, and the function of judging the spatial correlation strength is realized. The spatial autocorrelation analysis can detect the clustering mode of the events on the spatial distribution, which is beneficial to finding the spatial aggregation of the events. The spatial distance reflects the spatial tightness degree of the event, the spatial correlation is judged by the spatial clustering mode, and the spatial correlation can be comprehensively judged by the combination of the spatial distance and the spatial clustering mode. The space distance calculation utilizes the space attributes such as the coordinates, the IP address and the like of the event, and ensures the accuracy of the space calculation. The space autocorrelation analysis utilizes a statistical method to avoid the subjective judgment of space clustering, so that the space correlation mining is more objective. The constructed spatial correlation feature can be applied to network security correlation analysis to judge whether attack sources distributed at different positions have spatial correlation. The method can also be applied to an IoT environment to judge whether the data collected by different sensors are related in space, so that the spatial relationship analysis of physical events is realized. The capability of analyzing the relevance of the security events through the space data is improved, so that the security management and the risk identification have more space dimension thinking. In conclusion, the effect of space correlation analysis in safety management is enhanced by the application of space feature engineering and space analysis technology.
Constructing a safety event sequence feature containing a frequent sequence mode by using a sequence model mining algorithm, and constructing a safety event causal chain by using a correlation rule algorithm; the sequence model mining algorithm is utilized to analyze the time sequence of the events, and find frequent event sequence modes, so that the sequence among the events can be reflected. The frequent sequence pattern identifies the event sequence rule and provides a basis for judging the event sequence causal relationship. The association rule algorithm may discover implicit causal relationships between events directly from a large amount of event data. The mutual verification of the event time sequence and the causal chain can improve the accuracy of judging the sequence association relation. The sequence mode and the association rule are both data-driven methods, so that the sequence rule can be automatically found from the data without manually constructing a sequence model. In network security analysis, the application can be applied to identify causal relationships between attack multi-stage sequence patterns and attack steps. In business system analysis, order dependencies of business operations and abnormal operation sequences can be found. The construction of the sequence association features enhances the ability to learn the association knowledge from the event sequence, making the security management more focused on the sequential thinking. In summary, the application can effectively discover the implicit sequence relevance between event sequences through the combined application of the sequence mode and the relevance rule.
Constructing threat information features IOC comprising source IP, destination IP and URL by utilizing NAT analysis and alarm association analysis; the NAT analysis technology can restore the internal and external network address conversion, extract the true source IP of the attack connection, and construct the accurate source IPIOC characteristics. The alarm association analysis may discover multiple security events related to the same source IP, expanding the extraction IOCs. Not only extracting source IP, but also constructing multi-dimensional IOC indexes such as destination IP, malicious URL and the like, and comprehensively describing attack characteristics. The IOC features can be applied to intrusion detection systems to blacklist mask matching known attack sources. The IOC can also be input into a threat information cloud platform for carrying out association analysis with global threat information. By threat intelligence correlation techniques such as NAT resolution, accurate and comprehensive IOC features can be constructed. The introduction of the IOC features brings real-time attack features to the security defense system, and improves the pertinence of risk identification and association analysis. Finally, the capability of enhancing self safety protection by threat information is realized, and the effects of risk identification and association analysis are improved. In conclusion, the application can effectively improve the quality of risk identification and safety management by constructing accurate and multidimensional IOC features and combining with threat information technology. Specifically, NAT (network address translation) is a network address translation technology, which can translate an internal private IP into an external public IP, so as to realize intranet access to the internet. Due to the NAT address translation, the recorded attack source IP is the translated public network IP, and not the actual intranet source IP. NAT translation is a technical means that can trace back and restore the real IP address before NAT translation. When the threat information IOC is constructed, the real internal network IP address of the attack source can be restored through the NAT analysis technology. Rather than directly using the converted public network IP, this may affect the accuracy of the IOC features. Specific techniques for NAT parsing include parsing NAT logs, inferring from time relationships, traffic feature based identification, and the like. The accuracy of the extracted source IPIOC features can be improved by applying NAT parsing to IOC construction. Therefore, false alarm caused by NAT conversion is avoided, and the effect of the IOC applied to defending systems such as intrusion detection is improved. An accurate and operational threat intelligence IOC feature is constructed. In summary, the NAT resolution technique improves the accuracy and usability of IOC features by restoring the true source IP of the attack link.
Wherein the security event represents a record related to a system or network security state change, the security event comprising a source IP and a destination port. The source IP and destination port are basic attributes of the network connection and are also important features for constructing network security association analysis. The source IP reflects the initiator of the network event and can be used to determine whether the source of the attack is relevant. The destination port reflects the type of service under attack and can determine whether the targets are related. The source IP and the destination port are used as basic attributes of the security event, which not only accords with a standard model of network security monitoring, but also provides basic characteristics for building association. When time correlation is performed, a distribution mode of source IP destination port combination in time can be judged. When spatial correlation is performed, the source IP has an inherent spatial attribute, and the spatial aggregation of the source IP can be determined. In order association, the time series pattern of the source IP destination port may be analyzed. Thus, defining a security event model containing source IP and destination ports can support building association features for each dimension. And a basic judgment standard is provided for association analysis, so that the quality of security event association analysis is improved. In summary, in combination with the technical characteristics of the security event, the inclusion of the source IP and the destination port is an essential feature for constructing the network security association analysis, and also provides support for improving the association effect. In particular, a security event refers to various records or data related to a system or network security state change. It may include the following: the method comprises the steps of generating security alarm information by network intrusion detection equipment, security related events recorded in an operating system or application program log, an audit log which is generated by a database audit system and records database access behaviors, extracting suspicious network connection records by a network flow analysis tool, and analyzing abnormal user operation events generated by a system by a terminal behavior; the main features in a security event may include: event occurrence time, event type, participating subject (user, IP address, etc.), target (accessed asset), result (success/failure), etc.; by analyzing the security events, potential threats to the system or network can be detected, and precautions and countermeasures can be taken early. Security event analysis is also an important data source for building security association analysis models. In summary, security events are various recorded data reflecting changes in the security state of a system or network that provide the underlying data support for security threat detection and association analysis.
Further, constructing the time-dependent feature includes the steps of: acquiring a time stamp of a security event and generating a time stamp sequence; the timestamp of the event occurrence may be extracted from the security event. The time stamps of the different events are ordered and can be constructed as a time sequence. The time sequence records the occurrence process of the events in time sequence, and the time logic relation among the events is reserved. When time correlation analysis is carried out, the time sequence is the basis for judging the time sequence and the causal relationship of the event. By analyzing the time sequence, the time-related event modes can be found, and the time-related event modes are the basis of time-related analysis. And the time stamp is directly used for calculating the time correlation, so that the time sequence information among the events can be lost. The build time series also provides input for subsequent time smoothing, extraction time intervals, etc. Therefore, the event time distribution rule is learned more comprehensively, and the event time relevance is judged. The effect of security association analysis based on event time stamps can be improved finally. In summary, the time stamp is obtained and the time sequence is generated, which is the basis for performing the time-dependent analysis and is also an important step for improving the time-dependent analysis effect.
Performing wavelet denoising and bilinear interpolation resampling on the time stamp sequence to obtain an equidistant time stamp sequence; the original time stamp sequence may have random perturbations or deletions that require smoothing to improve quality. Wavelet denoising can effectively eliminate random high-frequency noise in time series. Bilinear interpolation may resample the time series, generating equally spaced time samples. The wavelet denoising can eliminate random errors of the sequence and improve the time data quality. Bilinear interpolation can fill in the loss of time series, so that the time series is continuous at equal intervals. The smoothed time sequence can reflect the time distribution rule of the event more accurately. The method is beneficial to eliminating the influence of random errors and improving the accuracy of judging time association. The smoothed equidistant time sequence is also convenient for subsequent statistical analysis, such as calculating sliding time window frequency and the like. Finally, high-quality time data input can be provided for the correlation analysis based on the time stamp, and the effect of the time correlation analysis is improved. In summary, smoothing the time series can improve the quality of the time data, and is one of the important preprocessing steps for performing time correlation analysis. Specifically, the time series is decomposed by utilizing the multi-scale decomposition capability of wavelet analysis, and different frequency components are extracted. And the high-frequency components are reduced or zeroed, so that high-frequency noise is effectively restrained. And then reconstructing the signal to complete wavelet denoising of the signal. The application can eliminate random disturbance of time sequence and improve smoothness of sequence. Bilinear interpolation resampling: according to the original time sequence, fixed sampling time intervals are set. At irregular time points, bilinear interpolation is performed by using two adjacent actual sample points, and new time point values are estimated. Finally, a new time sequence with fixed time intervals is obtained. The application can fill the time sequence missing and make the sequence equal interval continuous. The two are combined to eliminate random noise in time sequence and make the sequence time sequence interval continuous. The quality of the time sequence is improved, and the time correlation analysis is facilitated.
Calculating a time interval difference value of the time stamp sequence to obtain a time interval characteristic; after the smoothed time stamp sequence is obtained, a time difference between adjacent time points may be calculated, resulting in a time interval feature. The time interval can directly reflect the time distance between the events, and whether the two events are close to each other in time or not is judged. The time interval is calculated without a predefined time window, and the event time correlation in any time range can be focused. Different event type combinations may employ different time interval ranges to determine their time relevance. The time interval provides a more visual time distance index than the time stamp, and the time correlation strength of the event is more convenient to judge. The time interval feature can also be used for analyzing the time sequence by combining the time stamp sequence and carrying out comprehensive time association judgment. Time interval analysis is introduced, so that time correlation analysis means based on time stamps are enriched. Finally, the time correlation mode mining can be more targeted, and the effect of time correlation analysis is improved. A more comprehensive and controllable time correlation judging method is provided for the time feature-based security event correlation analysis. In summary, computing the time interval features is an important complementary approach to time-stamp based analysis of the time-correlation of security events.
Counting the number of security events in the time stamp sequence by adopting a sliding time window to obtain a time frequency characteristic; the sliding time window can slide on the time sequence, and the occurrence times of the events in the window are dynamically counted. The time window frequency reflects how densely events occur within a time frame. The high frequency time window represents an aggregation of events in time, and the low frequency window represents a sparseness in time. The dynamic sliding window may be adaptive focusing on different time granularity, finding frequency patterns in different time ranges. The frequency characteristics intuitively reflect the time correlation of the events, and the time-correlated events can be directly positioned through the frequent windows. The time window frequency does not need to define a specific range of the time correlation in advance, and the flexibility is improved. In combination with the time stamp sequence and time interval feature, multiple dimension determinations of time correlation may be provided. Finally, the time correlation mode mining of the complex time sequence can be more accurate and controllable, and the time correlation analysis quality of the security event is improved. A multi-granularity time correlation analysis means is provided for security correlation analysis based on the time stamp. In summary, the sliding window statistics time frequency is an important supplement to the time correlation analysis, and provides multi-dimensional time correlation judgment. Specifically, a sliding window with the size of w is set by utilizing the statistics times of the sliding time window, the latest w time stamp data are stored in the window, when the oldest time stamp exceeds w, the window slides forwards, and when the oldest time stamp exceeds w, the number of events in the current window is recorded, and finally, a time window frequency sequence is obtained. More specifically, in addition to the sequence-based sliding window, preferably, the time axis is segmented into fixed-length time segments based on the time segment sliding window, each segment serves as one sliding window, the number of events is counted, and the time segments are re-segmented when the window slides. Based on the sliding window of the data stream, the event data is used as stream input to perform online processing, a window with limited size is set, the latest events are stored, when a new event arrives, the oldest event is removed, and the window is updated. Based on a sliding window of a bitmap, whether an event occurs at each time point is recorded by using the bitmap, and the sliding window obtains the event number in a section through bit operation. Based on the sliding window of the probability data structure, the probability data structure such as Bloom filter is used for estimating the number of events in the sliding window, and the memory and speed of window calculation are improved through probability statistics.
And constructing a circulating neural network model of an attention mechanism, taking the circulating neural network model as a time correlation learner, inputting time interval characteristics and time frequency characteristics, and outputting time correlation characteristics. The recurrent neural network can simulate the sequential logic of time series data due to its inherent recurrent structure. The ability of the RNN to encode historical information is tailored to the association pattern of the learning time data. The attention mechanism may automatically learn the importance weights of different temporal features. And inputting multidimensional time characteristics comprising time intervals and frequencies, and comprehensively describing event time correlation. The RNN structure extracts nonlinear associations between temporal features, and the attention mechanism learns feature weights. And finally, scoring the event association possibility in a certain time range, and realizing end-to-end learning of the time association. Compared with RULES RULES, RNN learning time association has more flexibility, and the manual definition of a time association range is avoided. And extracting time correlation characteristics, and providing time dimension judgment materials for subsequent security correlations. Finally, the security event association effect based on the time stamp can be improved, and the intelligent level of time association analysis is enhanced. In summary, the application utilizes the advantages of RNN and attention mechanism to realize end-to-end time correlation learning, and can remarkably enhance the performance of time correlation analysis.
Further, constructing the spatial correlation feature includes the steps of: constructing a grid index or geographic hash-based spatial index based on a spatial database algorithm; the spatial database has support for spatial data types, such as points, lines, planes, etc. The grid index divides the space into a plurality of grids, and the index is established according to the grids where the objects are located. The geographic hash maps the spatial location to a hash value, approximately determining the spatial distance. The spatial index technology can be used for rapidly judging the proximity relation of the spatial objects. Support is provided for determining whether security event sources are spatially aggregated. The use of spatial index may reduce the temporal complexity of determining spatial correlation compared to linear scanning. The grid index can accurately judge the spatial distance between objects. The geographic hash is more efficient but has errors. And constructing a spatial index by combining the coordinate or geographic position information of the spatial object. Finally, the efficiency of judging the spatial association between the security events can be greatly improved. In summary, the application can remarkably improve the performance of space association judgment by using the index technology in the space database, so that the analysis result of the space association is more accurate. Specifically, the grid index divides the space area into a plurality of grids, each grid establishes the index of the space object, the grid where the space object is located is judged according to the coordinates of the space object, surrounding objects are quickly searched through the grid index, the granularity of the index can be controlled by adjusting the size of the grid, and the space association judgment in different ranges is realized. Geographic hash, using a hash function to map spatial coordinates to hash values, spatially close objects, whose hash values are close,
The distance between the space objects is judged by comparing the hash values, the hash function can be customized, and the error of the space distance is controlled. Space association candidates are rapidly extracted through the space indexes, unnecessary association calculation is reduced, and efficiency and effect of space association analysis are improved. For example, a security event set is input, including a source IP address or geographic coordinates, a grid index or geographic hash is constructed, a spatial distance between event sources is determined, and a security event pair within the spatial distance is output as a spatial correlation candidate.
Based on the constructed spatial index, calculating the spatial distance between each security event by adopting Manhattan distance or Chebyshev distance; manhattan distance (MANHATTAN DISTANCE) and Chebyshev distance (Chebyshev distance) are common spatial distance metrics. The Manhattan distance calculates the sum of absolute values of the distance differences of the two points in each dimension, reflecting the overall distance. Chebyshev distance takes the maximum value of the distance difference of each dimension, reflecting the dimension maximum difference. Based on the constructed spatial index, the coordinate difference between the spatial objects can be obtained quickly. And inputting the coordinate difference value into a distance formula, and efficiently calculating the space distance. The distance measure normalizes the value of the spatial distance, facilitating setting a distance threshold. The distance threshold may control the spatial association range that needs to be of interest. Different scenes may choose an appropriate distance calculation method such as a diagonal distance or a manhattan distance. Finally, the distance correlation between the space objects can be judged rapidly and flexibly, and the space correlation analysis effect is improved. In summary, the application fully utilizes the spatial index and selects the proper distance measurement, thereby effectively improving the performance of spatial correlation analysis.
Specifically, a spatial clustering mode and spatial correlation among security events are judged by using a spatial autocorrelation algorithm; spatial autocorrelation analyzes the correlation and dependency of spatial object property values. Global spatial autocorrelation detects the aggregate pattern of the entire dataset. The local spatial autocorrelation identifies a local spatial cluster. The spatial weight matrix defines a spatially correlated decay pattern. And considering the event position coordinates, judging the spatial correlation of the event attribute values. Spatially correlated security event cluster patterns may be identified. And compared with the distance threshold judgment, the spatial correlation strength is measured more comprehensively. Correlation patterns with different spatial cluster ranges can be detected. Finally, the spatial correlation knowledge of the event can be learned more accurately, and the effect of the spatial correlation measure is improved. And a richer space association analysis means is provided for the security event association analysis based on the position information. In summary, the spatial autocorrelation technique can detect complex spatial correlation patterns, enhancing the spatial correlation analysis capability of security events. Specifically, the spatial autocorrelation algorithm in the present application is: global molan I index, detecting the spatial aggregator discrete mode of the whole data set, wherein a value greater than 0 represents positive correlation and a value less than 0 represents negative correlation; global cover Rayleigh index, considering space weight, detecting aggregation degree, and stronger global space autocorrelation detection index; detecting local spatial clustering and spatial heterogeneity by using a local index Moran ' sI, and identifying a spatial ' hot spot '; local G coefficients, local versions of the Gership indexes, and combining a spatial weight matrix to identify local clusters; and (5) chi-square space autocorrelation test, and detecting the space autocorrelation significance based on chi-square statistics.
According to the obtained spatial distance, spatial clustering mode and spatial correlation, a machine learning method is adopted to establish a spatial correlation analysis model so as to establish spatial correlation characteristics reflecting the spatial correlation of the security event; spatial distance, cluster mode and relevance are taken as input features. And performing spatial correlation mode learning by using a neural network, a tree model and other machine learning algorithms. The model structure may learn complex nonlinear relationships between spatial features. The attention mechanism may learn importance weights for different spatial features. The mapping relation between the spatial features and the security event association is learned end to end. And establishing a data-driven spatial association analysis model, and setting spatial association rules independently of manual work. A score or probability reflecting the likelihood of spatial correlation is output. And extracting the spatial correlation characteristics and providing spatial correlation judgment for subsequent integral security correlation analysis. And finally, a more intelligent and interpretable space association analysis process is realized. And improving the security event association analysis effect based on the position information. In summary, the application uses machine learning to perform space association mode modeling, so that better space association analysis effect can be obtained, and the intelligent level of analysis can be enhanced. In particular, spatial clustering refers to the phenomenon of clustering of spatially adjacent objects or events. Spatial clustering patterns refer to various features that reflect such clustering phenomena. For example, global spatial clustering refers to an overall pattern in which data sets are aggregated together as a whole. Local spatial clustering refers to a dense aggregation pattern of points within a local region in a dataset. The cluster areas of different densities also reflect different spatial cluster patterns. The shape and the range of the clusters and the like also show the difference of the spatial clustering modes. Spatial cluster indices such as the galey index may quantitatively reflect spatial cluster patterns. In security event analysis, the spatial clustering pattern reflects the aggregation of event source IPs over spatial distribution. Such as source IP diagonal distribution, linear distribution, block distribution, etc., are all different spatial clustering patterns. In summary, the spatial clustering pattern is a set of metrics describing the spatial aggregate features from multiple aspects. By analyzing the spatial clustering mode, various security event aggregation phenomena existing in space can be effectively discovered. Specifically, the spatial correlation analysis model refers to a machine learning model for learning and modeling spatial correlations. Inputting spatial features: spatial distance, spatial cluster pattern, spatial correlation as features. The spatial correlation likelihood, i.e., a score or probability reflecting the strength of the spatial correlation between events, is output. The model structure may be a neural network, decision tree, etc. for learning complex mappings between inputs and outputs. Through model learning, patterns between spatial features and event associations can be automatically discovered. And the space association rule is not required to be defined manually, so that more intelligent space association analysis is realized. The attention mechanism may learn the importance of different spatial features. The model can autonomously learn spatially-correlated knowledge from the data and make inferential predictions. Compared with the traditional method, a more accurate and interpretable space association analysis model can be established. And finally outputting the spatial correlation characteristic to provide a spatial correlation measure for the whole security event correlation analysis. In conclusion, the spatial correlation analysis model realizes intelligent and interpretable spatial correlation learning, and improves analysis effect. More specifically, based on a spatial correlation model of a convolutional neural network, different convolutional kernels learn different spatial modes by utilizing local modes of spatial features of the convolutional layers; based on a spatial correlation model of the graph neural network, representing a spatial object as a graph structure, and learning spatial dependency relations of nodes on the graph; based on a space association model of a random forest, integrating a plurality of decision trees by the random forest, and learning complex interaction relations among space features; based on Xgboost spatial correlation model, xgboost realizes GBDT integration and learns spatial feature weights; based on a spatial correlation model of a Gaussian process, the spatial correlation is modeled by utilizing the Gaussian process, so that high-efficiency Bayesian spatial learning is realized.
And adopting a convolutional neural network of an attention mechanism to abstract and express the space association characteristics in multiple levels, and judging the space association. The local pattern of spatial features is learned using the local connection structure of the convolution layer. Different convolution kernels extract different representations of the spatial features, forming a multi-level spatial feature representation. The pooling operation obtains statistical features of the spatial features in different regions. The attention mechanism automatically learns the importance weights of the different spatial features. The spatial features are pooled through multi-layer convolution to form a high-level abstract spatial representation. And the full connection layer combines the attention weight to perform judgment and reasoning of the spatial association degree. The mapping relation between the spatial features and the spatial association is learned end to end. And establishing a data-driven spatial correlation analysis model without manual definition. And outputting a judging result or probability reflecting the space association possibility. Finally, a more intelligent and efficient space correlation analysis process is realized. In conclusion, the method and the device fully utilize the advantages of the convolutional neural network to perform space learning, and can obtain more accurate space correlation analysis effects. Specifically, multi-level abstraction and expression refers to: the original spatial distance, cluster index, etc. are the underlying spatial feature representations. By means of the convolution layer, the original features are converted by different convolution kernels into higher-level feature representations, which is an abstract process. Different convolution kernels learn different aspects of the spatial features, so that the feature expression is more comprehensive. The pooling layer performs compression statistics on the space characteristics, and the regional characteristics are reserved, so that the method is a characteristic extraction. Through multi-layer convolution pooling, the spatial features form layered abstract expressions reflecting different sides of spatial association. The bottom layer expression reflects fine spatial details, and the high layer expression reflects a global spatial distribution pattern. The multi-level feature expression characterizes spatial association from different granularity angles, so that the model is more comprehensive to understand. The attention mechanism learns the importance of different hierarchical expressions, enabling interpretable feature weighting. And finally, summarizing the data to a full connection layer to perform spatial association joint judgment. And the multi-level characteristic abstract and expression enable the convolutional neural network to better model the spatial association relation. In conclusion, the multi-level abstraction and expression fully exert the learning capability of the convolutional neural network, and the space correlation analysis effect is improved. Specifically, advanced features formed by rolling and pooling spatial features represent input fully connected layers. And integrating spatial features with different granularities by the full-connection layer to perform comprehensive judgment. The attention mechanism learns the importance weights of the different spatial features in the judgment. Parameters of the fully connected layer reflect complex mapping relationships between spatial features and spatial associations. Patterns between spatial feature combinations and spatial associations are automatically learned by end-to-end training. The rule of space association is not required to be defined manually, and intelligent judgment is realized. Finally, a numerical value or probability is output, which represents the possibility of spatial correlation corresponding to the input spatial features. The larger the value, the greater the likelihood of representing spatial correlation. A threshold may be set based on the output, giving a clear judgment of the spatial correlation. Compared with the traditional method, the judgment fuses more space information, and is more intelligent and accurate. And providing important space association basis for subsequent security event association analysis. In conclusion, the spatial correlation judgment is performed based on the neural network, so that a better judgment effect can be obtained, and the intelligent degree of analysis is enhanced.
Further, constructing the safety event sequence feature and constructing the safety event causal chain comprises the following steps: and constructing a structured safety event sequence according to the time stamp sequence by utilizing the preprocessed first risk identification information, wherein the safety event sequence comprises the following steps: a code field indicating the type of the event, an ID field indicating the object of the event, a time stamp field indicating the time of the event; and preprocessing the first risk identification information, and extracting key fields. The same type of event is combined into one code representing the event type. And reserving an independent ID of each event to represent an event target entity. The different time formats are standardized as uniform time stamps. The events are organized in a linear sequence structure according to the time stamp order. The sequence contains three fields, event type code, destination ID, and timestamp. The sequence reflects the time sequence of event evolution with time as a clue. The target ID concatenates the chain of events encountered by the same object. Event type coding reflects the security significance of different events. And a structured sequence is constructed, so that event correlation can be analyzed conveniently by using a sequence model. And extracting event time and object key information, and modeling the evolution relationship of the event. And finally, the method is used for enhancing safety management and improving the risk identification effect. In summary, the application makes full use of the result of the first risk identification by constructing the structured sequence, and provides an important premise for the subsequent security event association analysis.
Utilizing a sequential pattern mining algorithm to acquire frequent sequence patterns in the safety event sequence and constructing candidate sequence features; the preprocessed structured security event sequence data is input. Sequential pattern mining algorithms such as Prefix Span and the like are applied. A sequence of event patterns that frequently occur in the sequence of events is identified. The frequent sequence pattern reflects the rules of association that exist between events. The quality of the pattern is controlled by the support and confidence. The sequence pattern contains more security association information than a single event. The sequence patterns serve as candidate features and are provided for subsequent models to learn potential event correlations. The sequence features do not need to be manually extracted, and the engineering requirements are reduced. The candidate sequence patterns may explain logical sequential relationships between events. Ultimately, knowledge of the association between security events can be discovered more intelligently and efficiently. And the learning effect on the association rule of the complex event is improved. In conclusion, the method effectively builds the candidate sequence association features by using the sequence mode, and lays a foundation for subsequent event association analysis. Specifically, in the present application, the sequential pattern mining algorithm comprises: the Prefix Span algorithm, a classical efficient sequential pattern mining algorithm, discovers frequent sequence patterns by recursively partitioning a sequence database. The SPADE algorithm is a sequential pattern mining method based on depth-first search, and bitmap compression is utilized to improve efficiency. The SPAM algorithm exhibits good expansion performance over large data sets based on depth-first search and sequential pattern mining of bitmap representations. CloSpan algorithm, which improves the mining efficiency by using the closed sequence mode, and can find longer sequence modes. The BIDE algorithm uses bit vector compression to represent sequences and can find hidden sequence patterns. More specifically, the frequent sequence pattern refers to an event sequence pattern that frequently occurs in the event sequence data set. It reflects that there is a certain sequencing rule between events. Such as pattern "< intrusion detection, asset exception >" indicates that intrusion detection is often followed by an asset exception event. Frequent sequence patterns are screened by setting a minimum support threshold in advance. The support measures the frequency of occurrence of patterns in the dataset. The confidence measures the strength of the coupling relationship of the events before and after. The sequence among events in the frequent sequence mode is strongly related and contains rich associated information. It reveals more complex inter-event dependencies than a single event. Frequent sequence patterns may be input as candidate features into the association analysis model. Helping to learn the logical order and association rules between events. In summary, the frequent sequence patterns are key outputs of the sequential pattern mining, and provide important support for subsequent security event association analysis.
Applying a correlation rule mining algorithm to learn causal relationships among events from candidate sequence features and generating an event causal chain; the input is a candidate sequence feature generated by sequential pattern mining. An association rule mining algorithm such as Apriori is applied. The causal relationship between events is learned from the candidate sequences. Such as "intrusion detection- > asset anomalies" reflect their causal relationships. And extracting the strongly-associated causal rule according to the confidence. Connecting causal related events can construct a causal link of event evolution. The causal link clearly reflects the logical relationship between events. And learning of the automatic event causal relationship is realized based on rule mining. There is no need to manually define the causal relationship between events. Finally, deep understanding of the event evolution law is generated, and the safety management level is improved. In conclusion, the method and the system realize intelligent learning of the event causal relationship through rule mining, so that the event association analysis is more efficient and automatic. Specifically, several association rule mining algorithms: the Apriori algorithm and the classical association rule mining algorithm are used for efficiently finding frequent item sets; the FP growth algorithm does not need to generate a candidate set based on association rule mining for stepwise pattern growth. The Eclat algorithm utilizes depth-first search and association rule mining methods for collection intersections. The Charm algorithm is an efficient association rule mining algorithm based on a closed frequent item set. RuleGen algorithm, a method for incrementally learning association rules. CMRules algorithms, which are summarized by the classification association rule mining algorithm for processing the data classification tasks, can be used for efficiently generating association rules for learning event causal relations in event sequence data sets and constructing an event evolution chain.
Pre-calculation based on FP growth is adopted to reduce the generation times of candidate sequence patterns; the FP growth algorithm avoids generating a large number of candidate frequent item sets through the FP tree. And recursively increasing the frequent sequence mode on the FP tree, so as to avoid scanning the database for multiple times. Firstly, pre-calculating the frequency of single events and establishing a frequent event table. The scan database only counts frequent events, building a smaller FP tree. Recursive treeing reduces traversal times only for frequent events. Compared with Apriori, the generation of candidate modes is greatly reduced. Only sequences which are actually possible frequently are tested, so that the computational complexity is reduced. The efficiency of sequential pattern mining is improved. Longer, more complex frequent sequence patterns can be mined. Support is provided for constructing high quality candidate sequence features. In conclusion, the method and the device effectively reduce candidate pattern generation through the pre-calculation of the FP growth, and improve the mining efficiency and quality. Specifically, FP growth is a frequent pattern growth algorithm (Frequent Pattern Growth Algorithm), which is an important algorithm in association rule mining. The candidate frequent item set does not need to be generated, and repeated scanning of the database is avoided. The representation data set is compressed by constructing an FP tree. Frequent patterns are recursively grown on the FP tree. Step growth of frequent patterns is utilized to avoid generating a large number of infrequent candidate patterns. Compared with the Apriori algorithm, the candidate mode test times are greatly reduced. Longer frequent patterns can be efficiently mined. Step growth is realized by constructing a conditional pattern base and a conditional FP tree. The candidate modes of the whole set do not need to be generated in advance, and the memory requirement is reduced. The computational complexity is greatly reduced as a whole. The method is a frequent pattern mining method with high efficiency and strong expandability. In summary, FP growth avoids generating a large number of non-frequent candidate patterns by step growth of frequent patterns, which is an efficient frequent pattern mining algorithm.
Applying an information gain evaluation index and a minimum support threshold value, and selecting a frequent sequence mode with information gain higher than the threshold value and support meeting the requirement from the candidate sequence modes; the information gain of each candidate sequence pattern is calculated and the amount of information contained therein is estimated. A high information gain indicates that the pattern has the ability to distinguish samples. And setting an information gain threshold value, and screening modes with sufficient information quantity. And simultaneously calculating the support degree of the sequence mode in the data set. The support measures the frequency of occurrence of the sequence. A minimum support threshold is set to ensure that the pattern occurs sufficiently frequently. In combination with the two conditions, a sequence pattern that is both frequent and informative is selected. The frequent support degree ensures the statistical significance of the mode, and the information gain ensures the distinguishing capability. The sequence pattern is used as a model input and can effectively represent sample characteristics. And selecting a high-quality sequence mode, and training a more accurate event correlation model. In conclusion, the method effectively extracts high-quality frequent sequence patterns through the evaluation indexes, and provides higher-quality features for event correlation analysis. Specifically, information Gain (Information Gain) is a feature selection method mainly used for evaluating the classification resolution of features. Based on the information entropy, the uncertainty of the data set D is measured. The larger the information entropy, the more random the data. The dataset D is partitioned into different subsets Di according to the features a. Calculating the information gain of the feature A on the data set D: IG (D, a) =information entropy H (D) conditional entropy H (d|a). The conditional entropy H (d|a) measures the uncertainty of the data subset after segmentation by a. The larger the information gain, the more clearly the feature a separates the data set. The characteristic with high information gain can be selected as an evaluation index for characteristic selection. The information gain large specification feature has a strong class distinction capability. Commonly used in decision trees, IDs. And selecting the characteristics in the algorithm. Features that are apparent to the classification of the object can be effectively selected. In summary, the information gain evaluates the classification resolution of the feature by calculating the degree of reduction in data uncertainty before and after feature segmentation. More specifically, a minimum support threshold, a security event data set is collected, and various types of events that occur are recorded. And counting the occurrence times of each event in the data set, and calculating the support degree of each event. The events are ordered by support. And drawing a support degree distribution diagram, and marking a support degree relative concentrated interval. The event type of interest is determined in connection with business analysis. The threshold that causes the minimum support in the event type of interest to be the overall minimum support is selected. Interval minimum or median may also be selected as the threshold. A threshold is set to filter out infrequent events. The threshold may be suitably lowered, retaining more possibly related long tail events. And observing the difference of screening results of different thresholds to find a proper threshold. And carrying out threshold fine adjustment according to the subsequent analysis effect. A reasonable minimum support threshold is finally determined that both ensures frequency and contains sufficient information.
Constructing a security event association diagram based on sequence characteristics by using the selected frequent sequence mode; the input is a frequent sequence pattern that is filtered by information gain and support. Each sequence pattern represents a composite security event. The order among the events reflects the precedence relationship among the events. In the association graph, the events are represented by nodes, and the edges represent the order between the events. Building a weighted directed graph to represent event correlation knowledge. The weight is the support of the sequence mode and represents the association strength. The strongly connected portions of the graph reflect the particularly tightly-correlated regions of event aggregation. The association diagram intuitively displays the event relationship, and is convenient for manual analysis. And can also be used as the characteristic input of the subsequent event association model. The correlation law of the events is analyzed by means of a graph calculation method. Finally, accurate modeling of the security event association can be achieved. In summary, the application carries out visual expression on the associated knowledge of the frequent sequence mode of the event by constructing the association graph, so that the association graph is structured and can be used for subsequent modeling analysis.
And (3) representing a security event association graph by adopting a knowledge graph, carrying out feature learning and fusion by utilizing a graph annotation force network based on GAT, and outputting sequence association features fused with sequence features and association rules. The security event association diagram is expressed in a knowledge graph form and comprises entity nodes and relationship edges. The neural network GAT based on the graph attention mechanism is applied to perform feature learning. GAT may automatically learn node features and edge features. The attention mechanism may focus on neighboring nodes of different association strengths. The learned node characteristics fuse neighbor association information. And outputting an event representation fused with the sequence mode characteristic and the association rule characteristic. GAT can perform end-to-end feature learning directly on the graph. No manual construction of feature engineering is required. The graph neural network can process the graph structure data of which the node sequence is irrelevant. Finally, the event sequence characteristics combining the topological structure and the associated knowledge are obtained. And improving the effect of the association analysis of the subsequent security events. In summary, the application effectively obtains the safety event characteristics of the fusion sequence and the associated knowledge through the graph neural network, and provides powerful support for association analysis. Specifically, GAT (Graph Attention Network) is a graph-meaning network for learning a feature representation of graph structure data, GAT is a graph neural network, and features representations can be learned for nodes in the graph. The degree of association between nodes is automatically learned by the attention mechanism. The attention weight for each node is learned for its neighbor nodes. The attention weight represents the relative strength of association between nodes. And carrying out weighted fusion on the neighbor node characteristics according to the attention weight. And obtaining node characteristic representation of the fusion neighbor association information. The multi-headed gaze mechanism may learn different representation subspaces of nodes. Information leakage is avoided by a self-attention mechanism. End-to-end feature learning can be performed directly on the graph. No artificial feature engineering is required. GAT can handle node order independent graph structure data. By focusing on key correlations in the heterogeneous network, adaptive learning of features is achieved. In summary, GAT is an effective graph neural network that automatically learns node representations using an attention mechanism.
Further, constructing threat intelligence features IOC includes the steps of: acquiring a threat report containing IOC indexes from a cloud platform; a cloud security platform is selected that provides threat intelligence analysis services. The cloud platform integrates threat data of various large security vendors and research institutions. Real-time monitoring of ADVANCED PERSISTENT THREATS (APT) is provided. Open API interfaces and SDK access are supported. And acquiring the access right of the read interface by using the API key authentication. And calling a related interface to acquire the latest threat report containing the IOC index. The IOC index includes IP address, domain name, file Hash, etc. Detailed technical analysis of the threat is provided in the report. And analyzing the report, extracting IOCs and constructing an index list. The IOC indicator may be imported into the security device for threat detection. And acquiring comprehensive and timely threat information by means of a cloud platform. Helping to improve threat awareness capabilities of the enterprise network. In summary, the application utilizes the threat analysis service of the cloud security platform to obtain the latest IOC index, which is helpful for improving the defending capability of enterprises.
Analyzing the obtained threat report by using an XML analyzer, and extracting an atomic IOC index in the report, wherein the atomic IOC index comprises an IP address, a domain name and a file hash; threat reports are often provided in XML format. The report file is parsed using an XML parser. By traversing the XML document tree, the tags of the IOC indicator are found. Atomic IOC tags include IP, domain name, hash, etc. The text content of each IOC tag is parsed. The text content is an atomic IOC index. The extracted atomic IOCs are stored in a database table. And constructing an IP address table, a domain name table and a Hash table. Different types of IOCs may also be stored in the IOC indicator database in a unified manner. And checking the analysis result and filtering invalid contents. And processing the continuously acquired report by adopting an incremental updating mode. The use of a parser may facilitate the rapid extraction of structured IOC content. And automatically obtaining a large number of threat information indexes. The latest threat information is effectively acquired, and the safety monitoring efficiency is improved. In summary, the present application uses an XML parser to efficiently and automatically obtain IOC indicators for threat reports on a large scale.
Performing association analysis on the extracted atomic IOC indexes to generate combined IOC indexes; and carrying out association rule mining on the extracted atomic IOC index. The Apriori algorithm may be employed for frequent item set and association rule analysis. More efficient FP growth and other algorithms may also be used. Atomic IOC combinations that frequently occur simultaneously are found. The quality of the combined IOC is evaluated with support and confidence. The support evaluates the statistical frequency of combined occurrences. Confidence evaluates the logical association strength of the combination. And selecting the combination IOC with higher support and confidence. As an aggregate combined IOC index. The combined IOC can describe more fully an attack. The generated combined IOC is stored in a database. The combined IOC index set realizes automatic expansion through association analysis. The detection capability of the continuously evolving novel attack is improved. In summary, the application efficiently obtains more high-quality combined IOC indexes through association analysis. Specifically, IOC represents In dicators of Compromise, chinese translates into "intrusion trace index". An atomic IOC index is the smallest unit of IOC and generally includes the following classes: IP address: control server address of malware, etc. Domain name: malicious websites, domain names of command control servers, and the like. URL: attack activity related websites. File Hash: MD5, SHA1, etc. hashes of malware samples. Process name: process name of the malware. Internet user ID: a user name to perform the attack, etc. The atomic IOC index describes a single, detectable attack-related entity. It can accurately indicate and prove the intrusion behavior which has occurred. Is the basis for constructing and utilizing threat information. Attacker pattern analysis, association analysis, etc. can be performed with the atomic IOC. Is a core element for implementing threat detection based on intelligence. In summary, the atomic IOC index is an atomic level abstract description of the key technical features of the threat, and is an important basis for intrusion detection and evidence collection.
Verifying the combined IOC index by utilizing the network traffic and the log data; network traffic and various security log data are collected. Traffic data includes North-South and East-West traffic. The log data includes logs of firewalls, IDSs, terminals, applications, etc. The combined IOC related entity is retrieved from the data. Such as IP, domain name, file hash, etc. Statistical correlations between different entities are analyzed. Such as resolution of IP and domain names. The time relationship in the traffic and log is compared. And judging whether the entity indexes of the combined IOC match with the data. It is verified whether the combined IOC is present in the actual environment. The number of verification hits for each combined IOC is counted. And evaluating the quality of the combined IOC according to the verification result. And verifying the combined IOC with high support degree, and further generating tag data. And (5) feeding back a verification result to optimize the IOC association analysis model. In summary, the application improves the quality and reliability of the combined IOC through multi-source heterogeneous data verification. North-South traffic: refers to traffic entering or leaving a data center, typically traffic generated by a user accessing the internet. This portion of the traffic needs to be monitored through boundary security devices such as firewalls, IPS, etc. East-West traffic: the traffic between data centers or enterprise internal systems is traffic generated by the communication of internal systems such as different application servers, databases and the like. This portion of the traffic is typically large but is not monitored by the boundary safety equipment. The reason for distinguishing these two types of traffic: north-South traffic interacts with the external environment and has higher security threats. The East-West traffic is relatively safer inside. The two types of traffic require different monitoring strategies and devices. Analyzing different traffic may more fully discover threats. Such as east-west traffic, may analyze intranet attack paths. Therefore, the two types of traffic are comprehensively utilized to construct a richer safe data source, and the verification of the combined IOC index is more accurate and comprehensive.
Performing one-hot coding and vectorization on the verified combined IOC index to construct a structured IOC feature; and extracting the verified combined IOC index. Each combined IOC contains multiple atomic indicators. Each atomic index is thermally coded (one hot encoding). A binary vector is created for each atomic index. A multi-field cate feature vector space is constructed. Different types of IOCs each establish a field. Such as IP fields, domain name fields, hash fields, etc. Each field marks the presence of an IOC indicator by bit. The field vectors are then concatenated to form a coded representation of the combined IOC. A structured feature vector of fixed length is obtained. The vector represents a digitized description of the IOC composition structure. May be input into machine learning and deep learning models. And carrying out association detection of the attack behaviors by using the model. The characteristic engineering obviously improves the detection effect of the index. In summary, the present application uses coding and vectorization to convert text IOCs into structured digital features, facilitating pattern recognition and security analysis.
Selecting IOC features with TFIDF weight greater than a threshold value and information gain greater than the threshold value from the structural IOC features obtained by encoding by applying a feature selection algorithm based on TFIDF and information gain; TFIDF weights are computed for each structured IOC feature. TFIDF evaluates the importance of features in different samples. Information gain for each feature is calculated for the classification/clustering target. Information gain assessment feature contribution to target discrimination. And setting a TFIDF weight threshold value, and filtering the features with smaller weights. An information gain threshold is set, and a larger gain characteristic is selected. Two threshold filtering yields a decision-making feature subset. Reducing redundancy and interference of irrelevant features. Key classification/clustering features are retained. And the dimension reduction improves the model efficiency. TFIDF screens for high frequency important features. The information gain selects a high discrimination feature. The two are combined to perform multi-angle feature selection. A subset of features that are more sensitive to IOC association detection is obtained. And the performance of subsequent modeling is improved. In summary, the application integrates TFIDF with information gain for structured feature selection, which can obtain more sensitive and effective features for security event detection.
Performing anomaly detection on the selected IOC characteristics by using an isolation forest model, and filtering invalid IOC indexes; constructing a plurality of isolation trees, and training each tree in a sub-sampling mode. Each tree uses a randomly selected subset of features. An anomaly score for the sample on each tree is calculated. All tree outlier results are averaged as the final outlier. Average anomaly of anomalous samples is evident in higher. And setting a reasonable threshold value and detecting an abnormal sample. And carrying out isolated forest anomaly detection on the selected IOC characteristic samples. Samples with high average anomaly are filtered out. These samples may correspond to erroneous or invalid IOCs. The interference of the error IOC to the subsequent modeling is effectively reduced. Incremental training may be employed in view of time factors. The newly added IOC samples continue to perform isolated forest detection. The IOC index set is dynamically adjusted and optimized. Improving the quality of IOC association modeling.
In summary, the application utilizes the isolated forest technology to dynamically detect the IOC abnormality and filter false alarm IOC indexes, thereby improving the downstream analysis effect. Specifically, an Isolation Forest (Isolation Forest) model is an unsupervised anomaly detection algorithm, the Isolation Forest contains a plurality of Isolation trees, and each tree is trained by randomly selecting features. Each sample recursively splits from the root, with the outlier samples being more prone to splitting (segregation), resulting in shorter path lengths. Normal sample splitting paths are longer and more difficult to isolate. The path lengths of the samples over all trees are averaged to obtain an average path length. The shorter the average path length, the more likely the sample is an outlier. Setting a threshold value, and determining that the average path length is abnormal when the average path length is lower than the threshold value. The randomness of the multiple trees enhances the robustness of the detection. No normal/abnormal sample is needed, and the method can be used for unsupervised abnormal detection. The method is suitable for high-dimensional sparse data and has high efficiency. The isolation forest is used for judging abnormal points by using path length in a recursion isolation mode, so that different types of abnormal samples can be effectively detected.
And combining the filtered IOC indexes to construct threat information features. And integrating the IOC indexes subjected to verification, coding, selection and anomaly detection and filtration. These IOC indices are all high quality effective indices. IOC metrics are organized into respective sets according to different types. Such as IP address sets, domain name sets, HASH sets, etc. Inside each set, the occurrence frequency of each index can be counted. Only the index with higher frequency is reserved according to a certain threshold value. Each set is then encoded with one hotization. The codes of different sets are concatenated to form a sample. Each sample represents a combined IOC instance. All samples constitute a feature space and a training set. A portion of the sample may be marked to obtain labeled training data. And performing behavior detection modeling by using a machine learning method. And the behavior clustering analysis can also be performed by adopting an unsupervised learning method. The IOC after combination treatment has stronger distinguishing and expressing capacity. Extraction of attack patterns is facilitated, and unknown threats are found. In summary, the application constructs high-quality threat intelligence features by combining and integrating different types of effective IOCs, and provides important support for security monitoring and defense.
Further, acquiring the association between the first risk identification information and the second risk identification information includes the steps of: constructing an association rule matrix containing security events and risk results; security event data is collected including information on time, type, severity level, etc. Risk inspection result data is collected including assets, risk types, risk values, and the like. And carrying out association analysis on the event data and the risk result data. Association rules are mined using an algorithm such as Apriori or FP growth. The support and confidence of each rule is recorded. And constructing an association rule set from the security event to the risk result. The rule set is represented as an association matrix M. The horizontal axis of the matrix represents security events and the vertical axis represents risk results. The value Matrx [ i ] [ j ] represents the rule confidence for event i to result j. The matrix can visually represent the association between the event and the risk result. The weaker association rule is filtered by the confidence threshold. And obtaining a more accurate incidence matrix of the risk result influenced by the security event. The matrix may be used for evaluation and prediction of the impact of events on asset risk. The risk propagation path may also be analyzed by a matrix. In summary, the quantitative association relationship between the event and the risk result is visually represented by constructing the association matrix.
Calculating the support and confidence of each association rule in the association rule matrix; traversing the event data set and the risk result data set. And counting the occurrence times of each event and result, and calculating the respective support degree. The support is defined as the number of occurrences/total number of records. For each association rule event X- > result Y. The number of simultaneous occurrences of X and Y is counted as n (X, Y). Calculating the support degree of the rule: n (X, Y)/total record number. Calculating the confidence of the rule: n (X, Y)/n (X). n (X) represents the number of records in which event X occurs. The support reflects the frequency with which the rule appears. The confidence reflects the likelihood that event X resulted in Y. And counting the support and confidence of all rules. And filling the result into the corresponding position of the association matrix. The support and confidence may also be used as additional attributes of the rule. Rules with higher confidence and support are preferentially selected. The correlation matrix is filtered through confidence. In summary, the application counts the support and confidence of each rule, builds an interpretable association matrix, and accurately reflects the association relationship between the event and the risk.
Selecting a strong association rule with the support and confidence exceeding preset thresholds from the association rule matrix by using an Apriori algorithm; frequent item sets and association rules are mined using the Apriori algorithm. Minimum support min_sup and minimum confidence min_conf are set. Traversing the dataset to find frequent item sets satisfying min_sup. A strong association rule is generated that satisfies min _ conf. And calculating the support degree and the confidence degree of each rule. And storing the rules meeting the conditions into a candidate rule set. And performing rule matching on the incidence matrix. If a rule is satisfied at the same time: the support > minimum support threshold confidence > minimum confidence threshold; the rule is selected into the strong association rule set. These strong rules are marked in the association matrix. The strong rule corresponds to a high confidence position of the matrix. A strong association of event impact risk is represented by visualization. Strong rules help to focus on highly correlated event-risk combinations. The Apriori algorithm utilizes frequent items to mine stable association patterns. Value rules may be found in conjunction with confidence constraints. In conclusion, the method integrates Apriori and confidence constraint selection strong association rules, and can accurately predict the influence of the event on the risk. Specifically, apriori algorithm is a classical association rule mining algorithm, mainly used for finding frequent item sets and association rules in large-scale data sets. The basic idea is as follows: all frequent item sets are found out first, and the frequent item sets need to meet the minimum support threshold. An association rule is generated from the set of frequent items, the association rule requiring a minimum confidence threshold to be met. Apriori uses the idea of "determining a frequent item set first, and then generating association rules from the frequent item set". Apriori uses a "downward closing attribute" to iteratively find frequent item sets: only a subset of a set of items is frequent, which may be possible. This attribute is utilized in each iteration to avoid checking infrequent item sets, improving efficiency. After finding all the frequent item sets, it is checked whether each frequent item set can generate an association rule with a confidence level greater than a threshold. And outputting the frequent item set and the association rule which meet the condition. In summary, the Apriori algorithm utilizes the characteristic of frequent item sets, and efficiently mines association rules through iterative layer-by-layer derived strategies, so that the Apriori algorithm is a simple and effective association analysis algorithm.
Splitting the left side event and the right side event of the selected strong association rule into a plurality of fields, respectively calculating pearson correlation coefficients among the fields, and calculating the sequence of event occurrence time as time weight; splitting the selected strong association rule. The left event is split into multiple fields, such as event type, event level, etc. The right risk outcome is also split into multiple fields, such as asset type, risk value, etc. Pearson correlation coefficients are calculated for each field combination on both sides of the rule. The strength of the numerical correlation between fields is evaluated. Timestamp information is extracted for the event data. Comparing the time sequence of the events at the left side and the right side. If the left event time is prior, the time weight is set to 1. If the right event time is prior, then the time weight is set to-1. Finally, the correlation matrix and the time weight between the fields are obtained. The correlation matrix reflects the matching relationship of the event field and the risk field. The temporal weight represents temporal logic of event impact risk. The mechanism by which events pose a risk can be analyzed in depth in combination with the correlation matrix and the temporal weights. Support is provided for risk prediction and asset association. In summary, the application fully examines the quantitative association within the rules, which is helpful for analyzing the intrinsic relation of event-triggered risk. The left side event of the strong association rule is a security event causing risk, the right side event is a corresponding risk result, and the security event and the risk result are respectively split into a plurality of fields; the risk results include a risk level and a risk category. To the left of the strong association rule is a security event that causes a risk. The security event is split into multiple fields, such as event type, event level, etc. To the right of the rule is the corresponding risk result. The risk results are split into a risk level field and a risk category field. The risk level may take on three levels, high/medium/low. The risk category represents an asset category, such as a server, a network, etc. And calculating pearson correlation coefficients between the fields at the left side and the right side of the rule. The event fields are analyzed for relevance to risk levels and categories. And simultaneously calculating the time sequence of the event and the risk result. The rule weights are calculated by integrating the time sequence and the correlation. And constructing a correlation model of the security event, the risk level and the category. The new event is evaluated for the likelihood that it will result in a different risk. And outputting the result according to the risk level and the asset class. The assessment results support risk stratification management and critical asset protection. In summary, the application clearly shows the meaning of the left side and the right side of the strong rule, and the thinning of the risk result field on the right side is helpful for in-depth analysis of the association between the event and the fine granularity risk.
Constructing a security risk association degree model, wherein the weight of each association rule in the model is determined by the time weight and pearson correlation coefficient; a time weight and correlation matrix are calculated for each strong association rule. The time weight reflects event time logic. The correlation matrix reflects the correlation between fields. The time weights are normalized to the [ -1,1] range. The correlation matrix is absolute and normalized to the [0,1] range. The time weight Wt and the correlation matrix Wr of the rule are obtained. Calculating the comprehensive weight of the rule: w=α×wt+ (1- α) ×wr; alpha is a time weight coefficient, and the weight proportion of time and correlation is controlled. W fully combines time sequence with association strength. All strong rules and their weights W are integrated into the relevance model. For newly detected events, the likelihood of them posing various risks is assessed according to a model. And returning the risk and the model association weight W. And sorting risks according to the weight, and obtaining an evaluation result. By adjusting the alpha parameter, the effect of the chronological and associative correlations can be dynamically adjusted. In summary, the application establishes a quantitative association model of the security event and risk linkage, and evaluates the possibility of the event causing the risk.
Calculating the relevance between the first risk identification information and the second risk identification information according to the constructed safety risk relevance model; a first piece of risk identification information is entered, containing event content. And calculating the association weight of the event and each risk result according to the association degree model. And returning to a possible related risk list of the top k. And similarly, performing association degree calculation on the second piece of risk information. And obtaining risk result sets A and B respectively associated with the two pieces of risk information. Calculate the intersection c=an_b of a and B. If C is not null: the two pieces of risk information have common association risk and have association. The association may be represented by an intersection size of |C|/min (|A|, |B|). The larger the |c| is, the stronger the correlation between the two pieces of risk information is. If C is empty, the two pieces of risk information have no common association risk, and no association exists. Potential associations between risk identification information are mined by an association model. The risk triggered by the same event sequence is analyzed, and the detection effect is improved. In summary, the application intelligently judges the quantitative relevance between two pieces of risk information by comparing the risk sets linked by the two pieces of risk information based on the constructed relevance model.
Further, calculating the sequence of event occurrence time comprises the following steps: setting a time weight adjustment factor wt for representing the time sequence of occurrence of the event; if the occurrence time t1 of the left event is earlier than the occurrence time t2 of the right event, then setting wt to α, where α is a constant between 0 and 1; if the occurrence time t2 of the right event is earlier than the occurrence time t1 of the left event, setting wt to 1; and extracting time stamps of events on the left side and the right side of each strong association rule. The order of the time stamps is compared. If the left event time t1 is earlier than the right event time t2: setting a time weight adjustment factor wt=α; where α is a constant between (0, 1), such as 0.8. In this case, the left event occurs early, conforming to the temporal logic of the event-induced risk. The setting α indicates the degree of forward action of this time sequence. If the right event time t2 is earlier than the left event time t1: set wt=1. This is not normal, and setting a wt maximum of 1 indicates a negative effect. wt is positively correlated with time series, left previous wt is smaller, RIGHT first occurs with wt of 1. The weight W is calculated by combining the wt with pearson correlation coefficients. The set of wt comprehensively considers the influence of the event time sequence on the rule. The weight proportion of the time sequence is controlled by adjusting the alpha parameter. In summary, setting the time weight adjustment factor wt represents the contribution of the time sequence to the rule, and the wt is positively correlated with the time sequence, so that the effect of the time factor can be flexibly regulated.
Calculating the product of pearson correlation coefficient and time weight adjustment factor wt between each field to obtain an adjusted correlation coefficient so as to represent the correlation after the time factor is considered; and integrating the adjusted correlation coefficients of the fields to obtain the overall correlation of the event pairs. The pearson correlation coefficient P with the risk field is calculated separately for each event field in the rule. P represents the linear correlation between the two fields. A time weight adjustment factor wt for the event pair is calculated. For each correlation coefficient calculation: wt.p. And obtaining an adjusted correlation coefficient P'. P' reflects inter-field correlation and time sequencing at the same time. Integrating all P' in the rule: p_global=Σp'/n. n is the field logarithm and p_global is the overall relevance of the rule. The P_global fully integrates the association strength of each field. The impact of event time on rules is also considered. And combining the P_global with the confidence coefficient to obtain the regular comprehensive weight. Rules with high weights enter the association model. And finally, obtaining a security risk association model with detailed field association and accurate time sequence of the event. To facilitate assessment of the overall relationship of events leading to risk. In summary, the application constructs the association model with accurate event time and fine field relation by adjusting the correlation coefficient and integrating.
Further, preprocessing the collected first risk identification information and second risk identification information, and converting the first risk identification information and the second risk identification information into a structured format includes the following steps: a database is arranged for storing the first risk identification information and the second risk identification information; and creating a relational database and creating a risk identification information table. The table contains the following fields: information ID: primary key information content uniquely identifying each piece of information: text content capture time storing identification information: timestamp information source where information is captured: information sources or detection systems. An index is built to speed up queries by ID and time fields. Each piece of risk identification information is stored in time series. The first risk identification information and the second risk identification information are inserted into the table according to the detection time. More information can be recorded by adding fields such as regions, keywords and the like. The database adopts MySQL, mongoDB and other relational/non-relational databases. A suitable database product is selected based on the data volume. And carrying out structured storage and management on the risk identification information through a database. Data support is provided for risk correlation analysis, model evaluation, and the like. In summary, the database scheme can store multidimensional risk identification information, is convenient for association analysis and retrieval, and improves the data processing capacity of risk management.
Converting unstructured data in a database into structured data using a data model comprising natural language processing and machine learning algorithms according to predefined security risks; a security risk data model is predefined, containing standard fields. The fields include time, place, participant, event type, risk type, etc. And performing natural language processing on unstructured text data in a database. And extracting entity words and keywords in the text by using named entity recognition. Structured fields are extracted based on rule matching and word vector techniques. And using a machine learning model to classify and correct the extraction result. The training sequence annotation model SEQ2SEQ, corrects unrecognized and error extracted fields. The extracted and corrected structured data is mapped into a predefined data model. Converted in a model format and recorded to a structured database. And (3) iteratively optimizing the extraction and error correction model, and improving the structural conversion quality. The structured data can be used directly in association analysis, risk identification, etc. algorithms. The value of unstructured data of the database quantity is effectively integrated. In summary, the application realizes the automatic conversion from unstructured data to structured data through NLP and machine learning algorithm, and combines with a predefined data model to make the result suitable for risk analysis algorithm.
Establishing a data flow tracking mechanism, recording original input, output and running logs of data in preprocessing, wherein the preprocessing comprises log filtering, security risk feature extraction, data desensitization and format verification; and constructing a data flow tracking system and recording the whole flow of data processing. Defining a data processing pipeline, including the steps of log filtering, feature extraction, desensitization, format verification and the like. A log component is added at the pipeline entry to record the original input data. Log record output data is added at each processing component exit. And recording running indexes such as data quantity, processing time and the like. A distributed log collection system, such as a Flume, is used to collect logs of each processing node. Building a log theme, storing an original input log, and outputting the log after processing. The log database adopts an elastic search to support log retrieval and analysis. A log query interface is provided to conditionally retrieve logs. The tracking query interface displays the circulation condition of the data in each processing component. The log records support monitoring and auditing of data processing. The method is beneficial to optimizing the data processing flow and ensuring the processing effect. In summary, the application realizes the full-link monitoring of the data processing by constructing the circulation tracking and log management mechanism, and ensures the quality and safety of the data processing.
When detecting data processing errors, determining an error component according to the feedback log, and executing corresponding error steps by using the corrected data; and collecting the processing log through a constructed data flow tracking mechanism. The log is analyzed, and when the processing data amount is found to be abnormal or the processing time is overtime, the processing error can be judged. And locating an error processing component, such as a feature extraction component, according to the log. And (5) inquiring logs before and after processing, and analyzing error reasons, such as feature unextraction caused by regular expression errors. Modifying component business logic, such as modifying regular expressions. The original input of the erroneous data is queried. The corrected component is used to reprocess the erroneous data. And outputting a log after updating processing, and recording a reprocessing result. And verifying whether the error is corrected or not according to the results before and after the specific gravity treatment. Subsequent components are triggered to reprocess the processed data as needed. The correction process is recorded to optimize the process and avoid repeated errors. By means of rapid positioning, correction and reprocessing, anomalies in the data processing process are effectively solved. In summary, the application can quickly find and repair processing errors and regenerate correct processing results through the constructed data processing tracking mechanism, thereby ensuring the reliability of downstream analysis.
When new type data is detected, extracting features of the new type data by utilizing a feature extraction algorithm, training a data model by utilizing the new features, and converting the new type data into structured data by utilizing the trained data model. The occurrence of new types of unknown data is detected using a log system. A certain amount of new type data samples are collected. The new data is heuristically analyzed and Word vector features are extracted using feature extraction algorithms such as Word2 Vec. The field semantics and data distribution of the new data are analyzed. And updating the data model according to the analysis result, and expanding field definition and characteristic representation. More new types of data are collected, and field information is marked to construct a new training data set. Model SEQ2SEQ is annotated with new dataset retraining sequences. The training goal is to promote the structured conversion effect of the new type of data. A new model is configured in the processing pipeline, and the new model data is processed online. The log is continuously tracked, and the model effect is monitored. And continuously collecting the error mark data according to the log, and performing incremental training to continuously optimize the model. The above process is repeated to continuously extend the processing capacity of the model for the new type of data. In summary, the application realizes the self-adaption and conversion of the system to the new type data through feature learning, model customization training and continuous optimization, and ensures the continuous effectiveness of the structuring process. Specifically, the new type of data generally includes: the data generated by the newly added data sources is characterized and distributed in a distinct manner. The data generated by the new event or scene appears in the original data source and has new semantics which are not seen by the system. The data format or expression pattern changes significantly, such as from unstructured to semi-structured. New components in the system vocabulary, such as new vocabularies, name entities, etc., are present in the dataset. New domain knowledge in the data, such as new business lines and professional vocabulary unfamiliar with the system. Other unprecedented semantic, grammatical features are not within the recognition range of the model. New type data is data that indicates new features that are present in the system log or dataset that were not seen, or not covered, by the previous machine learning model. By detecting and learning the new data, the system can acquire new knowledge, thereby better processing the new data.
Another aspect of the embodiments of the present disclosure further provides a data information security processing system, which executes a data information security processing method of the present disclosure, including: the first risk identification information acquisition module is used for acquiring first risk identification information comprising equipment logs, monitoring data and alarm information from the service processing equipment; the second risk identification information acquisition module is used for acquiring second risk identification information comprising threat information, a safety knowledge base and a historical analysis model from the cloud platform; the data preprocessing module is used for preprocessing the acquired first risk identification information and second risk identification information and converting the first risk identification information and the second risk identification information into a structural format; the feature construction module is used for constructing time-associated features, space-associated features and sequence-associated features of the first risk identification information converted into the structural format; constructing threat information features IOC of the second risk identification information converted into the structured format as professional features; the risk association model training module is used for training a safety risk association degree model based on an Apriori algorithm and pearson correlation coefficients by using the constructed association features and the professional features, and acquiring the association between the first risk identification information and the second risk identification information; and the security scheme generation module is used for generating a security scheme containing resource configuration and monitoring strategies by utilizing the acquired relevance.
3. Advantageous effects
Compared with the prior art, the invention has the advantages that:
(1) By collecting diversified data information from service processing equipment and a cloud platform, including equipment logs, monitoring data, alarm information, threat information, a safety knowledge base, a historical analysis model and the like, comprehensive utilization of data of a plurality of information sources can provide more comprehensive safety information and context information, and the capabilities of risk detection and association analysis are enhanced, so that the accuracy of risk identification is improved;
(2) By preprocessing and converting the collected first risk identification information and second risk identification information into a structural format, subsequent feature construction and model training can be better performed. The unstructured data are converted into structured data, so that key features can be extracted and represented, the complexity of data processing is effectively reduced, the processing efficiency is improved, and the accuracy of risk identification is improved;
(3) By constructing time-associated features, space-associated features and sequence-associated features, and threat intelligence features IOC as professional features, security event-associated information of different dimensions can be captured. Through feature extraction and association analysis in time, space, sequence and other aspects, the inherent association between security events can be better revealed, so that the accuracy and precision of risk detection are improved.
Drawings
The present specification will be further described by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:
FIG. 1 is an exemplary flow chart of a method of secure processing of data information according to some embodiments of the present disclosure;
FIG. 2 is an exemplary flow chart of data preprocessing shown in accordance with some embodiments of the present description;
FIG. 3 is an exemplary flow chart for extracting associated features according to some embodiments of the present description;
FIG. 4 is an exemplary flow chart for acquiring data associations according to some embodiments of the present disclosure;
FIG. 5 is an exemplary block diagram of a data information security processing system according to some embodiments of the present description.
Detailed Description
The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.
Fig. 1 is an exemplary flowchart of a data information security processing method according to some embodiments of the present specification, and as shown in fig. 1, a data information security processing method includes: s110, collecting first risk identification information from service processing equipment, wherein the first risk identification information comprises unstructured data such as equipment logs, monitoring data, alarm information and the like. S120, acquiring second risk identification information from the cloud security platform, wherein the second risk identification information comprises threat information, a security knowledge base, a historical analysis model and other structural information. Wherein the secure knowledge base is a predefined event-response template; the historical analysis model is a machine learning based security event matching model. S130, preprocessing the first risk identification information and the second risk identification information, including filtering, format checking, desensitizing and the like, and outputting structured data. S140, constructing time correlation features, space correlation features and sequence correlation features of the first risk identification information. And constructing threat information IOC characteristics of the second risk identification information. S150, training a security risk association degree model based on an Apriori algorithm and pearson correlation coefficients, and learning association rules between the first information and the second information. And evaluating the relevance and the degree of relevance of the newly detected security event and the known risk according to the relevance rule. S160, generating a security scheme corresponding to the risk according to the association degree, wherein the security scheme comprises resource allocation and monitoring strategies. And continuously optimizing the association characteristics, and incrementally learning the association degree model to adapt to new risks and business changes. By constructing the security association rule, intelligent analysis and processing of massive security events and risk information can be realized, risks can be effectively predicted, a targeted scheme can be formulated, and the security protection capability of the information system can be improved.
Specific examples: and collecting device running logs from the server, wherein the logs contain information such as error codes, access IP, access time and the like. And acquiring in-out flow monitoring data from the firewall, wherein the in-out flow monitoring data comprises information such as source IP, destination IP, protocol, flow size and the like. And obtaining attack detection alarm from the IDS, wherein the attack detection alarm comprises a vulnerability name, an attack source, detection time and the like. The log, monitoring and alarm data constitute first risk identification information. And acquiring a known attack source IP address set from the third-party threat information platform as an IOC index. And acquiring the SQL injection attack feature set and the corresponding scheme from the secure knowledge base. And obtaining DDoS attack traffic detection rules from the machine learning model. The IOC, knowledge base and detection model form second risk identification information. And preprocessing the first information and the second information, and converting the first information and the second information into a structured data format. And constructing time correlation features, space correlation features and attack sequence features of the first information. And constructing IOC index features of the second information. Training a relevancy model to find out the relevancy between SQL injection attack and threat information IP. And generating a patch server and a security scheme for monitoring the access traffic according to the association relation. The newly detected attack event is continuously analyzed and the security scheme is updated.
FIG. 2 is an exemplary flow chart of data preprocessing shown in some embodiments of the present description, as shown in FIG. 2, preprocessing the collected first risk identification information and second risk identification information, and converting to a structured format, comprising the steps of:
S111 first, the setting database stores the collected first risk identification information and second risk identification information. S112 then, according to a predefined security risk data model, unstructured data stored in the database is converted into a structured data format by applying algorithms comprising natural language processing and machine learning. S113, in the data preprocessing process, a tracking mechanism of data processing is established, original input data, output data and running logs in a preprocessing flow are recorded, and preprocessing comprises the steps of log filtering, safety risk feature extraction, data desensitization, format verification and the like. S114, if a data processing error is detected in preprocessing, determining an error component according to a feedback log, and re-executing the error step by using the corrected data. S115 additionally, if new type of data is detected, features of the new type of data are extracted using a feature extraction algorithm, the data model is trained using the extracted new features, and the new type of data is also converted into a structured format using the trained model. The application realizes the pretreatment of the collected multi-source heterogeneous risk identification information including equipment logs, monitoring data, alarm information, threat information and the like, and uniformly converts the multi-source heterogeneous risk identification information into a structured data format for the subsequent construction of association features and the training of a risk assessment model. Meanwhile, a data processing tracking mechanism is established, preprocessing errors can be positioned and corrected, adaptation to new types of data is realized, and robustness of a preprocessing flow is ensured.
Specific examples: two tables are built in the MySQL database, and original first risk identification information and second risk identification information are stored respectively. And extracting structured fields such as log time, equipment ID, log keywords and the like from the unstructured equipment log in the first information by using an NLP technology. Training semantic feature representation of log text by using Bi LSTM model, and supporting identification of new log. And extracting an event-scheme field from the text knowledge base in the second information by using the SEQ2SEQ model. A log node is provided in the data processing pipeline, recording the original input and each component output. The flight gathers log data to Hadoop. Log retrieval and analysis was performed using elastiscearch. The find keyword extraction component mismatches the log text, resulting in keyword deletions. And extracting a regular expression according to the log correction keywords, and reprocessing the error log. Collecting new Chinese character log data, and extracting new vocabulary features by Word2 Vec. The Bi LSTM model is trained using the new vocabulary feature delta. The new log is reprocessed using the enhancement model, outputting structured data. The vocabulary is continuously expanded, the model is optimized, and more types of logs are updated.
FIG. 3 is an exemplary flow chart of extracting associated features, as shown in FIG. 3, for constructing time-associated features, space-associated features, and sequence-associated features of first risk identification information converted into a structured format, according to some embodiments of the present description; constructing threat intelligence features IOC of the second risk identification information converted into the structured format, wherein the threat intelligence features IOC are professional features and comprise the following steps:
S131, constructing time correlation characteristics: acquiring a time stamp of a security event, and generating a time stamp sequence; smoothing and resampling the time stamp sequence to obtain an equidistant time stamp sequence; calculating the time interval of the time stamp sequence to obtain the time interval characteristic; counting the times of events in the time stamp sequence by using a sliding time window to obtain a time frequency characteristic; a recurrent neural network model using an attention mechanism learns the time-dependent features. S132, constructing a spatial correlation feature: constructing a spatial index based on a spatial database algorithm; calculating a spatial distance between security events; judging a spatial clustering mode by utilizing spatial autocorrelation analysis; establishing a space association analysis model to construct space association characteristics; spatial correlation features are learned using convolutional neural networks. S133, constructing sequence association features: constructing a time-ordered security event sequence; obtaining a frequent sequence pattern by using a sequential pattern mining algorithm; mining causal relation of learning events by applying association rules; s134 learns the sequence association feature using the iconic network. Constructing threat information features IOC: analyzing and extracting an atomic IOC index from the threat report; generating and verifying a combined IOC indicator; coding, selecting and filtering the IOC; threat intelligence features including source IP, destination IP, URL, etc. are constructed. And the security event is represented by constructing multidimensional association features, so that the accuracy of risk identification is improved.
Specific examples: and extracting a log time stamp from the device log in the first information, and constructing a time distribution characteristic taking 1 hour as a sliding window. And calculating the distance between the geographical coordinates of the logs, performing DBSCAN clustering, and generating spatial clustering features. And excavating a frequent sequence mode of the device restart log and the network connection failure log by using an Apriori algorithm. And alarming the IDS in the second information, and analyzing the attack source public network IP by using the NAT log. And (3) extracting an attack source IP and an access URL as IOC indexes by association analysis of the grabbed attack request message. And defining SQL injection attack events, including source IP and target database ports. The above described temporal, spatial, sequence and IOC features are constructed. And storing the constructed structural association features into a feature library. The security risk model is then trained using these associated features. The association features are continuously enriched, and the association dimension of the security event is expanded.
Specifically, a timestamp of the security event is obtained, and a sequence of timestamps is generated from the timestamp. Denoising and interpolating resampling are carried out on the constructed time stamp sequence, specifically, a wavelet denoising method is adopted to restrain noise, and a bilinear interpolation method is used for carrying out fixed-interval resampling, so that an equidistant time stamp sequence is obtained. And calculating the time interval difference value of the resampled time stamp sequence to obtain the time interval characteristic representing the time interval information. And counting the time stamp sequence by adopting sliding time windows, and calculating the number of security events in each time window to obtain time frequency characteristics representing time frequency information. And constructing a circulating neural network model of an attention mechanism as a learner of the time-associated features. And taking the extracted time interval features and time frequency features as inputs, and outputting time correlation features integrating the time interval and the time frequency through feature extraction of a circulating neural network level and feature aggregation of an attention mechanism. Finally, the time correlation characteristic reflecting the time correlation is obtained and is used as one of the characteristics of the training safety risk correlation model, the sensitivity of the model to time factors is improved, and the accuracy of risk identification is enhanced. According to the application, time-associated features are comprehensively generated by using time stamp information, and a security risk assessment model sensitive to time factors is constructed.
Specific examples: network intrusion detection logs are collected, and a time stamp of each detection log is extracted. Wavelet denoising is carried out on the time stamp, and random jitter is reduced. And (4) performing timing resampling by bilinear interpolation, and converting into an equidistant time sequence. And calculating the difference value of the time sequences, obtaining the time interval characteristics and reflecting the time distance between the events. Setting a 1-hour time sliding window, and counting the number of detection logs in the window to obtain a time frequency characteristic. And constructing a attention mechanism Bi LSTM model, and inputting time intervals and frequency characteristics. The Bi LSTM model outputs time-related features and learns a time-related log pattern. The time correlation features fuse the time correlation and input into a subsequent risk degree model. Dynamically updating the time sequence and the characteristics according to the new detection log. And (3) iteratively optimizing the Bi LSTM model, and improving the quality of the time-associated features. And continuously constructing time correlation features, and supporting the identification of the risk degree model on more complex time correlation attacks.
Specifically, based on a spatial database algorithm, a spatial index based on grid index or geographic hash is constructed for quickly searching the spatial distance. Based on the constructed spatial index, the spatial distance between each security event is calculated, and an algorithm such as Manhattan distance or Chebyshev distance can be adopted. And judging the spatial clustering mode and the spatial correlation between the security events by using a spatial autocorrelation analysis algorithm. And establishing a spatial correlation analysis model by using a machine learning method according to the calculated spatial distance, the spatial clustering mode and the spatial correlation. And constructing a spatial correlation characteristic reflecting the spatial correlation of the security event through a spatial correlation analysis model. And the convolutional neural network adopting the attention mechanism performs multi-level abstraction and expression on the spatial correlation characteristics, and performs spatial correlation judgment. Finally, the spatial correlation characteristics of fusion spatial distance, clustering mode and correlation are obtained. According to the application, the space correlation characteristics are constructed by integrating the space factors, so that the space sensitive safety risk assessment model is trained, and the space correlation of risk identification is improved.
Specific examples: a network attack log containing latitude and longitude coordinates is collected. And performing geographic hash coding on the space coordinates by using a Geohash algorithm to obtain the space index of the log. Based on the spatial index, manhattan distances between log coordinate points are calculated, and spatial distance characteristics are obtained. And analyzing the spatial clustering mode of the log by using a DBSCAN clustering algorithm. The spatial autocorrelation is calculated by local Moran's I spatial autocorrelation statistics, reflecting the region aggregation, which is a statistical method for detecting the spatial data aggregation pattern. It may measure the correlation between the values of spatial units within a region and neighboring region units to determine if clusters similar to their values exist around a single spatial unit. . Spatial distance, cluster pattern, autocorrelation are taken as inputs, and PCA dimension reduction is used. And constructing a convolutional neural network CNN model, and inputting space characteristics after dimension reduction. The CNN network outputs the abstract spatial correlation characteristics. And determining the spatial correlation of the attack log by using the spatial correlation characteristic. And updating the spatial index and the feature extraction according to the new log. And (3) iteratively optimizing the CNN model, and improving the accuracy of space association judgment. The spatial correlation features are continuously enriched, and the risk model is supported to detect attacks with spatial correlation.
Specifically, a structured security event sequence is constructed in time sequence by using the preprocessed first risk identification information, wherein the structured security event sequence comprises an event type field, an event target ID field and a timestamp field. And acquiring frequent sequence patterns from the safety event sequence by using a sequential pattern mining algorithm, and generating candidate sequence features. And (3) applying a correlation rule mining algorithm to learn the causal relationship between the events from the candidate sequence features and obtain a causal chain of the safety event. And the pre-calculation method based on FP growth is used, so that the generation times of candidate sequence modes are reduced, and the efficiency is improved. And selecting frequent sequence modes with high information gain and high support degree meeting requirements from the candidate sequences according to the information gain and the minimum support degree threshold. A security event association graph based on sequential features is constructed using the selected frequent sequence patterns. And (3) representing a security event association graph by adopting a knowledge graph, and performing feature learning and fusion by using a graph attention network-based model. And finally outputting the sequence association characteristic fused with the sequence characteristic and the association rule information. Through sequence feature extraction, association rule learning and fusion of the graph annotation meaning network, sequence association features reflecting event sequence and causal relationship are constructed, and sequence association judging capability of the safety event is improved.
Specific examples: security events are extracted from the preprocessed log, including event type, target ID, and time stamp. The sequence of events is constructed according to the time stamp ordering. Candidate frequent sequence patterns are calculated in terms of support using the Apriori algorithm. And (4) optimizing by applying an FP growth algorithm, and reducing the candidate generation times. And calculating the information gain of each candidate sequence, and selecting a frequent sequence mode with high information gain. Event causal relationships are learned from frequent sequences using association rule algorithms. And constructing an event association map based on the frequent sequence. And learning the characteristic representation of the association graph by using the GAT network. The GAT network outputs sequence features that fuse the order and association rules. And updating the association graph and the GAT model according to the new event. And (3) continuously iterating and optimizing sequence feature learning, and improving association precision. And (3) inspiring a security risk model by applying sequence characteristics, and identifying event association and attack chains.
Specifically, a threat report containing IOC indexes is obtained from a cloud platform, and an XML parser is used for parsing and extracting atomic IOC indexes in the report, including IP addresses, domain names and file hashes. And performing association analysis on the extracted atomic IOC indexes to generate combined IOC indexes. The validity of the generated combined IOC indicator is verified using the network traffic and log data. And carrying out one hot coding and vectorization on the verified combined IOC index, and constructing a structured IOC feature. And selecting IOC features with weights and information gains larger than a threshold value by applying a feature selection algorithm based on the TF-IDF and the information gains. And performing anomaly detection on the selected IOC characteristics by using the isolated forest model, and filtering invalid IOC indexes. And combining the effective IOC indexes subjected to filtering treatment to construct a structured threat information feature. Finally, the verified and optimized threat information characteristic IOC is obtained and used for training a security risk assessment model. Through analysis, verification, filtration and optimization, effective IOC indexes are extracted from threat information reports, threat information features are constructed, and accuracy of safety risk identification is improved.
Specific examples: and acquiring a threat report file containing the IOC from a third party threat information platform. An XML parser is used to extract atomic IOCs in the report, such as IP addresses, URLs, file Hash. Through association analysis, a combined IOC, such as IP+ port, domain name +URI, is generated. The combined IOC is validated using the weblog and the traffic data, and the invalid IOC is filtered. And carrying out one hot coding on the effective IOC, and converting the effective IOC into vector representation. TF-IDF weights for the encoded IOC features are calculated. And selecting the IOC characteristics with high effective TF-IDF weight according to the information gain. Abnormal IOC noise was filtered using Isolation Forest. Valid IOC features that pass authentication are retained. The selected IOC features are input into a risk model. The IOC extraction and processing flow is updated as new reports are imported. And (3) continuously optimizing IOC feature selection and processing, and improving the accuracy of the risk model.
FIG. 4 is an exemplary flowchart of acquiring a data association, as shown in FIG. 4, according to some embodiments of the present description, the acquiring an association between first risk identification information and second risk identification information includes the steps of:
S151 builds an association rule matrix containing security events and risk results. S152 calculates the support and confidence of each association rule in the association rule matrix. S153, selecting a strong association rule with a support degree and a confidence degree exceeding a threshold value from the matrix by using an Apriori algorithm. And splitting the left security event and the right risk result of the selected strong association rule into a plurality of fields respectively. S154 calculates pearson correlation coefficients between the security event field and the risk result field. And calculating the time sequence of the event occurrence as a time weight. S155, constructing a security risk association model, wherein the weight of each association rule is determined by the time weight and pearson correlation coefficient. S156 calculates a correlation between the first risk identification information and the second risk identification information according to the constructed model. The left side event of the strong association rule represents a security event that causes a risk, and the right side result represents a corresponding risk level and risk category. Through the scheme, the safety risk association degree model combining time sequence and statistical correlation is constructed, so that the association among the multi-source heterogeneous information can be effectively evaluated, and the accuracy of risk identification is improved.
Specific examples: an association rule matrix is constructed that contains security events and risk results. And calculating the support degree and the confidence degree of each rule. And selecting a strong association rule with high support and confidence by using an Apriori algorithm. Splitting the left event and the right event, and respectively counting the field pearson correlation coefficient. The event time order is calculated as a time weight. And constructing a relevance model, wherein the rule weight is determined by pearson coefficients and time weights. And calculating the relevance between the newly detected security event and the risk result according to the relevance model. The security event field includes source IP, port, vulnerability type, etc. The risk result field includes a risk level, a risk type, and the like. And judging the association degree of the network intrusion event and the data leakage risk according to the correlation between the source IP and the port. When new data is imported, the association rule matrix and model are updated. And continuously iterating and optimizing the relevance model, and improving the relevance judgment accuracy.
Specifically, a time weight adjustment factor wt is set to indicate the sequence of occurrence time of events on the left and right sides of the association rule. If the occurrence time t1 of the left event is earlier than the occurrence time t2 of the right event, then wt is set to α, where α is a constant between 0 and 1. If the occurrence time t2 of the right event is earlier than the occurrence time t1 of the left event, wt is set to 1. And respectively calculating pearson correlation coefficients among the fields. And multiplying the pearson correlation coefficient among the fields by a time weight adjustment factor wt to obtain an adjusted correlation coefficient. The adjusted correlation coefficient represents the magnitude of the correlation taking into account the time factor. And integrating the adjusted correlation coefficients of all the fields, and averaging or summing to obtain the overall relevance of the event pairs. The influence of the time sequence of the occurrence of the events on the relevance is considered through the setting of the time weight adjustment factors. Finally, a security event correlation metric is obtained that integrates the temporal order and the statistical correlation. According to the scheme, the time information is fully utilized to guide the relevance analysis, so that the accuracy of the security risk assessment is improved.
Specific examples: a time weight adjustment factor wt is defined for representing the time sequence of events. For each rule in the association rule matrix, two events: their occurrence times t1 and t2 are compared. If t1< t2, i.e. the left event occurs first, set wt=0.8. If t1> t2, i.e. the right event occurs first, wt=1 is set. And calculating pearson correlation coefficients between the fields of the two events. Multiplying the correlation coefficient by the wt to obtain an adjusted correlation coefficient. And combining the adjustment correlation coefficients of each field, and calculating the overall relevance of the two events. The same correlation magnitude indicates that the event sequence is consistent with the association rule after considering the time sequence, and the association is stronger. When new data is imported, the chronological weight calculation is updated. And the relevance model is more accurate through time sequence adjustment. And the fusion of time sequence factors is continuously optimized, so that the learning capability of the model on the sequence of the events is improved.
FIG. 5 is an exemplary block diagram of a data information security processing system, as shown in FIG. 5, according to some embodiments of the present description, a data information security processing system 200, comprising: a first risk identification information acquisition module 210, configured to acquire first risk identification information including an equipment log, monitoring data and alarm information from a service processing equipment; the second risk identification information collection module 220 is configured to obtain second risk identification information including threat information, a security knowledge base, and a historical analysis model from the cloud platform. The data preprocessing module 230 is configured to preprocess the collected first risk identification information and second risk identification information, and convert the first risk identification information and the second risk identification information into a structured format. A feature construction module 240, configured to construct a time-associated feature, a space-associated feature, and a sequence-associated feature of the first risk identification information; and constructing threat information features IOC of the second risk identification information as professional features. The risk correlation model training module 250 is configured to acquire a security risk correlation model of the correlation between the first risk identification information and the second risk identification information based on Apriori algorithm and pearson correlation coefficient training by using the constructed correlation features and the professional features. The security scheme generating module 260 is configured to generate a security scheme including a resource configuration and a monitoring policy according to the obtained association. Through preprocessing, feature construction, relevance model training and safety scheme generation of multi-source heterogeneous risk identification information, safety risk intelligent relevance and assessment based on big data analysis are realized, and the initiative and effectiveness of safety protection are improved. In a specific embodiment, the risk acquisition module acquires threat information reports, a security knowledge base and the like from the third party cloud platform as second risk identification information. And the data preprocessing module cleans, analyzes and converts the first risk information and the second risk information into structured data. The feature construction module extracts time, space and sequence associated features from the first information. IOC threat intelligence features are extracted from the second information. The risk correlation model module trains a correlation model by using an Apriori algorithm and pearson coefficients. And inputting the association characteristics of the first information and the second information, and calculating the association between the first information and the second information. And the security scheme module automatically generates a security policy scheme for resource isolation and network access control according to the relevance. The resource isolation container is deployed, and the abnormal application is restricted from accessing the core database. The access control system is configured to shield network connections from suspicious IPs. And generating a continuous iterative optimization association model and a security policy, so as to realize automatic risk driving analysis and coping.

Claims (8)

1. A data information security processing method, characterized by comprising:
acquiring first risk identification information comprising equipment logs, monitoring data and alarm information from service processing equipment;
Acquiring second risk identification information comprising threat information, a safety knowledge base and a history analysis model from a cloud platform, wherein the safety knowledge base is a structured knowledge base comprising safety event characteristics and corresponding schemes, and the history analysis model is a safety event matching model based on machine learning training;
preprocessing the collected first risk identification information and second risk identification information, and converting the first risk identification information and the second risk identification information into a structural format;
Constructing time-associated features, space-associated features and sequence-associated features of the first risk identification information converted into the structured format; constructing threat information features IOC of the second risk identification information converted into the structured format as professional features;
Training a security risk association degree model based on an Apriori algorithm and pearson correlation coefficients by using the constructed association features and professional features to acquire the association between the first risk identification information and the second risk identification information;
generating a security scheme containing resource configuration and monitoring strategies by using the obtained relevance;
The acquiring of the association between the first risk identification information and the second risk identification information comprises the following steps:
Constructing an association rule matrix containing security events and risk results;
calculating the support and confidence of each association rule in the association rule matrix;
selecting a strong association rule with the support and confidence exceeding preset thresholds from the association rule matrix by using an Apriori algorithm;
Splitting the left side event and the right side event of the selected strong association rule into a plurality of fields, respectively calculating pearson correlation coefficients among the fields, and calculating the sequence of event occurrence time as time weight;
Constructing a security risk association degree model, wherein the weight of each association rule in the model is determined by the time weight and pearson correlation coefficient;
Calculating the relevance between the first risk identification information and the second risk identification information according to the constructed safety risk relevance model;
the left side event of the strong association rule is a security event causing risk, the right side event is a corresponding risk result, and the security event and the risk result are respectively split into a plurality of fields; the risk results include a risk level and a risk category;
calculating the sequence of event occurrence time comprises the following steps:
setting a time weight adjustment factor wt for representing the time sequence of occurrence of the event;
If the occurrence time t1 of the left event is earlier than the occurrence time t2 of the right event, then setting wt to α, where α is a constant between 0 and 1;
If the occurrence time t2 of the right event is earlier than the occurrence time t1 of the left event, setting wt to 1;
calculating the product of pearson correlation coefficient and time weight adjustment factor wt between each field to obtain an adjusted correlation coefficient so as to represent the correlation after the time factor is considered;
and integrating the adjusted correlation coefficients of the fields to obtain the overall correlation of the event pairs.
2. The data information security processing method according to claim 1, wherein:
Constructing time-associated features, space-associated features and sequence-associated features of the first risk identification information converted into the structured format; constructing threat intelligence features IOC of the second risk identification information converted into the structured format, wherein the threat intelligence features IOC are professional features and comprise the following steps:
constructing time correlation features by using a time stamp smoothing and counting method, wherein the time correlation features comprise a time stamp sequence, a time interval and a sliding time window frequency;
constructing a spatial correlation characteristic comprising a spatial distance of a safety event and a spatial clustering mode by using a spatial distance algorithm and spatial autocorrelation analysis;
constructing a safety event sequence feature containing a frequent sequence mode by using a sequence model mining algorithm, and constructing a safety event causal chain by using a correlation rule algorithm;
constructing threat information features IOC comprising source IP, destination IP and URL by utilizing NAT analysis and alarm association analysis;
Wherein the security event represents a record related to a system or network security state change, the security event comprising a source IP and a destination port.
3. The data information security processing method according to claim 2, wherein:
the construction of the time-related features includes the steps of:
Acquiring a time stamp of a security event and generating a time stamp sequence;
performing wavelet denoising and bilinear interpolation resampling on the time stamp sequence to obtain an equidistant time stamp sequence;
Calculating a time interval difference value of the time stamp sequence to obtain a time interval characteristic;
counting the number of security events in the time stamp sequence by adopting a sliding time window to obtain a time frequency characteristic;
and constructing a circulating neural network model of an attention mechanism, taking the circulating neural network model as a time correlation learner, inputting time interval characteristics and time frequency characteristics, and outputting time correlation characteristics.
4. The data information security processing method according to claim 2, wherein:
The construction of the spatial correlation feature comprises the following steps:
constructing a grid index or geographic hash-based spatial index based on a spatial database algorithm;
Based on the constructed spatial index, calculating the spatial distance between each security event by adopting Manhattan distance or Chebyshev distance;
Judging a spatial clustering mode and spatial correlation among the security events by using a spatial autocorrelation algorithm;
According to the obtained spatial distance, spatial clustering mode and spatial correlation, a machine learning method is adopted to establish a spatial correlation analysis model so as to establish spatial correlation characteristics reflecting the spatial correlation of the security event;
And adopting a convolutional neural network of an attention mechanism to abstract and express the space association characteristics in multiple levels, and judging the space association.
5. The data information security processing method according to claim 2, wherein:
The construction of the safety event sequence features and the construction of the safety event causal chain comprises the following steps:
and constructing a structured safety event sequence according to the time stamp sequence by utilizing the preprocessed first risk identification information, wherein the safety event sequence comprises the following steps: a code field indicating the type of the event, an ID field indicating the object of the event, a time stamp field indicating the time of the event;
Utilizing a sequential pattern mining algorithm to acquire frequent sequence patterns in the safety event sequence and constructing candidate sequence features;
Applying a correlation rule mining algorithm to learn causal relationships among events from candidate sequence features and generating an event causal chain;
pre-calculation based on FP growth is adopted to reduce the generation times of candidate sequence patterns;
applying an information gain evaluation index and a minimum support threshold value, and selecting a frequent sequence mode with information gain higher than the threshold value and support meeting the requirement from the candidate sequence modes;
Constructing a security event association diagram based on sequence characteristics by using the selected frequent sequence mode;
and (3) representing a security event association graph by adopting a knowledge graph, carrying out feature learning and fusion by utilizing a graph annotation force network based on GAT, and outputting sequence association features fused with sequence features and association rules.
6. The data information security processing method according to claim 2, wherein:
the construction of threat intelligence feature IOC includes the steps of:
Acquiring a threat report containing IOC indexes from a cloud platform;
analyzing the obtained threat report by using an XML analyzer, and extracting an atomic IOC index in the report, wherein the atomic IOC index comprises an IP address, a domain name and a file hash;
Performing association analysis on the extracted atomic IOC indexes to generate combined IOC indexes;
verifying the combined IOC index by utilizing the network traffic and the log data;
performing one-hot coding and vectorization on the verified combined IOC index to construct a structured IOC feature;
Selecting IOC features with TFIDF weight greater than a threshold value and information gain greater than the threshold value from the structural IOC features obtained by encoding by applying a feature selection algorithm based on TFIDF and information gain;
Performing anomaly detection on the selected IOC characteristics by using an isolation forest model, and filtering invalid IOC indexes;
and combining the filtered IOC indexes to construct threat information features.
7. The data information security processing method according to claim 1, wherein:
preprocessing the acquired first risk identification information and second risk identification information, and converting the first risk identification information and the second risk identification information into a structural format comprises the following steps:
a database is arranged for storing the first risk identification information and the second risk identification information;
converting unstructured data in a database into structured data using a data model comprising natural language processing and machine learning algorithms according to predefined security risks;
Establishing a data flow tracking mechanism, recording original input, output and running logs of data in preprocessing, wherein the preprocessing comprises log filtering, security risk feature extraction, data desensitization and format verification;
when detecting data processing errors, determining an error component according to the feedback log, and executing corresponding error steps by using the corrected data;
when new type data is detected, extracting features of the new type data by utilizing a feature extraction algorithm, training a data model by utilizing the new features, and converting the new type data into structured data by utilizing the trained data model.
8. A system based on the data information security processing method of any one of claims 1 to 7, comprising:
The first risk identification information acquisition module is used for acquiring first risk identification information comprising equipment logs, monitoring data and alarm information from the service processing equipment;
The second risk identification information acquisition module is used for acquiring second risk identification information comprising threat information, a safety knowledge base and a historical analysis model from the cloud platform;
The data preprocessing module is used for preprocessing the acquired first risk identification information and second risk identification information and converting the first risk identification information and the second risk identification information into a structural format;
The feature construction module is used for constructing time-associated features, space-associated features and sequence-associated features of the first risk identification information converted into the structural format; constructing threat information features IOC of the second risk identification information converted into the structured format as professional features;
The risk association model training module is used for training a safety risk association degree model based on an Apriori algorithm and pearson correlation coefficients by using the constructed association features and the professional features, and acquiring the association between the first risk identification information and the second risk identification information;
And the security scheme generation module is used for generating a security scheme containing resource configuration and monitoring strategies by utilizing the acquired relevance.
CN202311491262.8A 2023-11-10 Data information security processing method and system Active CN117473571B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311491262.8A CN117473571B (en) 2023-11-10 Data information security processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311491262.8A CN117473571B (en) 2023-11-10 Data information security processing method and system

Publications (2)

Publication Number Publication Date
CN117473571A CN117473571A (en) 2024-01-30
CN117473571B true CN117473571B (en) 2024-05-14

Family

ID=

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544519A (en) * 2022-10-20 2022-12-30 深圳供电局有限公司 Method for carrying out security association analysis on threat information of metering automation system
CN116436659A (en) * 2023-04-04 2023-07-14 浙江中烟工业有限责任公司 Quantitative analysis method and device for network security threat

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544519A (en) * 2022-10-20 2022-12-30 深圳供电局有限公司 Method for carrying out security association analysis on threat information of metering automation system
CN116436659A (en) * 2023-04-04 2023-07-14 浙江中烟工业有限责任公司 Quantitative analysis method and device for network security threat

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
浅析政务云信息安全监管;常凯;张雅菲;齐俊鹏;鲍春鸣;禹东山;;网信军民融合;20200229(02);全文 *

Similar Documents

Publication Publication Date Title
CN109347801B (en) Vulnerability exploitation risk assessment method based on multi-source word embedding and knowledge graph
CN107241352A (en) A kind of net security accident classificaiton and Forecasting Methodology and system
CN110620759A (en) Network security event hazard index evaluation method and system based on multidimensional correlation
CN111600919B (en) Method and device for constructing intelligent network application protection system model
Kotenko et al. Systematic literature review of security event correlation methods
CN112910859B (en) Internet of things equipment monitoring and early warning method based on C5.0 decision tree and time sequence analysis
CN105471882A (en) Behavior characteristics-based network attack detection method and device
CN109670306A (en) Electric power malicious code detecting method, server and system based on artificial intelligence
CN113904881A (en) Intrusion detection rule false alarm processing method and device
Gonaygunta Machine learning algorithms for detection of cyber threats using logistic regression
Liu et al. Multi-step attack scenarios mining based on neural network and Bayesian network attack graph
CN117220920A (en) Firewall policy management method based on artificial intelligence
CN115632821A (en) Transformer substation threat safety detection and protection method and device based on multiple technologies
Al-Ghuwairi et al. Intrusion detection in cloud computing based on time series anomalies utilizing machine learning
CN115544519A (en) Method for carrying out security association analysis on threat information of metering automation system
Kaiser et al. Attack hypotheses generation based on threat intelligence knowledge graph
CN117220978B (en) Quantitative evaluation system and evaluation method for network security operation model
CN116074092B (en) Attack scene reconstruction system based on heterogram attention network
CN117221087A (en) Alarm root cause positioning method, device and medium
CN117473571B (en) Data information security processing method and system
Laptiev et al. Algorithm for Recognition of Network Traffic Anomalies Based on Artificial Intelligence
CN116545679A (en) Industrial situation security basic framework and network attack behavior feature analysis method
CN115883213A (en) APT detection method and system based on continuous time dynamic heterogeneous graph neural network
Chen et al. Unsupervised Anomaly Detection Based on System Logs.
CN115643153A (en) Alarm correlation analysis method based on graph neural network

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240422

Address after: No. 13, Jinghuafu Store, Xijiang, Jianshe Road, Donghai Town, Lufeng City, Shanwei City, Guangdong Province, 516500

Applicant after: Guangdong Deep Technology Information Technology Co.,Ltd.

Country or region after: China

Address before: 266426, No. 575 Hongliuhe Road, Hongshiya Street, Huangdao District, Qingdao City, Shandong Province (formerly Room 305, Building 4, Wanghuang Road, Hongshiya City)

Applicant before: Qingdao Zhongqi Yingcai Group Business Management Co.,Ltd.

Country or region before: China

GR01 Patent grant