CN117729003A

CN117729003A - Threat information credibility analysis system and method based on machine learning

Info

Publication number: CN117729003A
Application number: CN202311696581.2A
Authority: CN
Inventors: 王小军; 吴悦婷; 廖秀聆; 赖孝友
Original assignee: Fujian Yunchuang Xin'an Information Technology Co ltd
Current assignee: Fujian Yunchuang Xin'an Information Technology Co ltd
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-03-19

Abstract

The application relates to the technical field of intelligent analysis, and particularly discloses a threat information credible analysis system and method based on machine learning. Thus, the collected information can be automatically analyzed in reliability, thereby helping security team or analyst to better understand threat information and timely take corresponding response measures.

Description

Threat information credibility analysis system and method based on machine learning

Technical Field

The present disclosure relates to the field of intelligent analysis technologies, and more particularly, to a threat intelligence credible analysis system and method based on machine learning.

Background

In recent years, with the continuous development of information technology and the continuous progress of network security technology, the novel network forms such as 5G communication and internet of things and the novel service modes such as online social networks are presented, and the novel network security system has the characteristics of being convenient for people and simultaneously being open, heterogeneous, mobile and credible. However, networks have led to the enjoyment of these services while also facing the tremendous loss and damage that illegal network penetration brings.

In the field of network security, analysis and identification of threat intelligence is one of the important means to prevent and deal with network attacks. However, due to the fact that the information is numerous and miscellaneous, the existing Weirib information analysis method generally has the problems of insufficient identification accuracy and the like, and an effective analysis means is not formed for threat information in the field of network security.

Therefore, a threat intelligence credible analysis system and method based on machine learning are desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a threat information credibility analysis system and method based on machine learning, which comprises the steps of firstly constructing a network threat information library containing latest malicious software means, attack modes, malicious domain names, malicious websites and malicious IP addresses, extracting context semantic association features of information to be analyzed through a deep learning technology, and comparing the context semantic association features of the information to be analyzed with the network threat information library features in a difference mode to judge whether the information to be analyzed is credible or not. Thus, the collected information can be automatically analyzed in reliability, thereby helping security team or analyst to better understand threat information and timely take corresponding response measures.

Accordingly, in accordance with one aspect of the present application, there is provided a machine learning based threat intelligence trust analysis system comprising:

the information acquisition module is used for constructing a network threat information library and acquiring information to be analyzed, wherein the network threat information library comprises the latest malicious software means, an attack mode, a malicious domain name, a malicious website and a malicious IP address;

the information to be analyzed semantic understanding module is used for enabling the information to be analyzed to pass through a context encoder comprising an embedded layer to obtain a plurality of information semantic feature vectors;

the Gaussian fusion module is used for fusing the plurality of information semantic feature vectors based on the Gaussian density map to obtain an information context semantic association feature matrix;

the multi-scale semantic association feature extraction module is used for enabling the information context semantic association feature matrix to obtain a multi-scale information semantic association feature matrix through a convolutional neural network model comprising a plurality of mixed convolutional layers;

the network threat information characteristic matrix construction module is used for generating a network threat information characteristic matrix based on the network threat information library;

the comparison analysis module is used for carrying out order prior-based feature engineering parameterization fusion on the multi-scale information semantic association feature matrix and the network threat information feature matrix to obtain a classification feature matrix;

And the analysis result generation module is used for passing the classification characteristic matrix through a classifier to obtain a classification result, wherein the classification result is used for representing a probability value of credibility of information to be analyzed.

In the threat intelligence credible analysis system based on machine learning, the intelligence semantic understanding module to be analyzed comprises: the word segmentation unit is used for carrying out word segmentation processing on the information to be analyzed to obtain a plurality of information data items; the embedding unit is used for respectively carrying out word embedding encoding on text data of each information data item in the plurality of information data items by using a learnable embedding matrix of the word embedding layer so as to obtain a plurality of word embedding vectors; the data integration unit is used for respectively adding the numerical data of each information data item in the plurality of information data items to the tail part of each word embedding vector to obtain a plurality of information embedding vectors; and the context coding unit is used for performing global context semantic coding on the plurality of intelligence embedded vectors by using a Bert model based on a converter of the context coder so as to obtain a plurality of intelligence semantic feature vectors.

In the threat intelligence credible analysis system based on machine learning, the context coding unit comprises: a one-dimensional arrangement subunit, configured to perform one-dimensional arrangement on the plurality of information embedded vectors to obtain an information global embedded vector; a self-attention generation subunit, configured to calculate a product between the information global embedding vector and a transpose vector of each of the plurality of information embedding vectors to obtain a plurality of self-attention correlation matrices; the standardized self-attention subunit is used for respectively carrying out standardized processing on each self-attention incidence matrix in the plurality of self-attention incidence matrices to obtain a plurality of standardized self-attention incidence matrices; the weight generation subunit is used for obtaining a plurality of probability values through activating functions by each normalized self-attention correlation matrix in the normalized self-attention correlation matrices; and the weight applying subunit is used for weighting each information embedding vector in the information embedding vectors by taking each probability value in the plurality of probability values as a weight so as to obtain the plurality of information semantic feature vectors.

In the threat intelligence credible analysis system based on machine learning, the gaussian fusion module comprises: a gaussian density map construction unit for constructing gaussian density maps of the plurality of intelligence semantic feature vectors in a fusion formula as follows; wherein, the fusion formula is:wherein μ represents a per-position mean vector between the plurality of informative semantic feature vectors, and the value of each position of σ represents a variance between feature values of each position of the plurality of informative semantic feature vectors, ++>Representing a gaussian density probability function, x representing the variables of the gaussian density map; and the Gaussian discretization unit is used for discretizing the Gaussian distribution of each position of the Gaussian density map to obtain the information context semantic association characteristic matrix.

In the threat intelligence credible analysis system based on machine learning, each mixed convolution layer of the convolution neural network model comprises a first convolution branch structure, a second convolution branch structure, a third convolution branch structure and a fourth convolution branch structure which are parallel, and a multi-scale fusion structure connected with the first to fourth convolution branch structures, wherein the first convolution branch uses a first convolution kernel with a first size, the second convolution branch uses a second convolution kernel with a first size and a first void ratio, the third convolution branch uses a third convolution kernel with a first size and a second void ratio, and the fourth convolution branch uses a fourth convolution kernel with a first size and a third void ratio.

In the threat intelligence credible analysis system based on machine learning, the multi-scale semantic association feature extraction module is used for: each mixed convolution layer using the convolutional neural network model performs respective processing on input data in forward transfer of the layer: performing multi-scale convolution coding on input data to obtain a multi-scale convolution characteristic diagram; pooling the multi-scale convolution feature map along a channel dimension to generate a pooled feature map; performing nonlinear activation processing on the pooled feature map to generate an activation feature map; the output of the last layer of the convolutional neural network model comprising a plurality of mixed convolutional layers is the multi-scale information semantic association characteristic matrix.

In the threat intelligence credible analysis system based on machine learning, the comparison analysis module is configured to: carrying out order prior-based feature engineering parameterization fusion on the multi-scale information semantic association feature matrix and the network threat information feature matrix by using the following fusion formula to obtain a classification feature matrix; wherein, the fusion formula is:

wherein M is ₁ Representing the multi-scale information semantic association feature matrix, M ₂ Representing the network threat intelligence feature matrix,representing the mean matrix between the multi-scale information semantic association feature matrix and the network threat information feature matrix, exp (·) representing the exponential operation of the matrix, log representing the logarithmic function value based on 2, < ->Representing the addition of the matrix by position +.>Representing the per-position subtraction of the matrix, M _c Representing the classification feature matrix.

According to another aspect of the present application, there is provided a threat intelligence credibility analysis method based on machine learning, including:

constructing a network threat information library and acquiring information to be analyzed, wherein the network threat information library comprises the latest malicious software means, an attack mode, a malicious domain name, a malicious website and a malicious IP address;

passing the information to be analyzed through a context encoder comprising an embedded layer to obtain a plurality of information semantic feature vectors;

fusing the plurality of information semantic feature vectors based on the Gaussian density map to obtain an information context semantic association feature matrix;

the information context semantic association feature matrix is processed through a convolutional neural network model comprising a plurality of mixed convolutional layers to obtain a multi-scale information semantic association feature matrix;

Generating a network threat information feature matrix based on the network threat information library;

carrying out order prior-based feature engineering parameterization fusion on the multi-scale information semantic association feature matrix and the network threat information feature matrix to obtain a classification feature matrix;

and the classification feature matrix passes through a classifier to obtain a classification result, wherein the classification result is used for representing a probability value of credibility of information to be analyzed.

Compared with the prior art, the threat intelligence credible analysis system and method based on machine learning, provided by the application, firstly constructs a network threat intelligence library containing the latest malicious software means, attack modes, malicious domain names, malicious websites and malicious IP addresses, extracts context semantic association features of the information to be analyzed through a deep learning technology, and compares the context semantic association features of the information to be analyzed with the network threat intelligence library features in a difference mode to judge whether the information to be analyzed is credible or not. Thus, the collected information can be automatically analyzed in reliability, thereby helping security team or analyst to better understand threat information and timely take corresponding response measures.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a block diagram of a machine learning based threat intelligence trust analysis system in accordance with an embodiment of the application.

Fig. 2 is a schematic architecture diagram of a machine learning-based threat intelligence trust analysis system according to an embodiment of the application.

Fig. 3 is a block diagram of an intelligence semantic understanding module to be analyzed in a threat intelligence trusted analysis system based on machine learning according to an embodiment of the application.

Fig. 4 is a block diagram of a context encoding unit in a machine learning based threat intelligence trust analysis system in accordance with an embodiment of the application.

Fig. 5 is a block diagram of a gaussian fusion module in a machine learning based threat intelligence confidence analysis system in accordance with an embodiment of the application.

Fig. 6 is a flow chart of a machine learning based threat intelligence trust analysis method in accordance with an embodiment of the application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

FIG. 1 is a block diagram of a machine learning based threat intelligence trust analysis system in accordance with an embodiment of the application. Fig. 2 is a schematic architecture diagram of a machine learning-based threat intelligence trust analysis system according to an embodiment of the application. As shown in fig. 1 and 2, a machine learning-based threat intelligence trust analysis system 100 according to an embodiment of the application includes: the information acquisition module 110 is configured to construct a network threat information library and acquire information to be analyzed, where the network threat information library includes a latest malicious software means, an attack mode, a malicious domain name, a malicious website, and a malicious IP address; the information to be analyzed semantic understanding module 120 is configured to pass the information to be analyzed through a context encoder including an embedded layer to obtain a plurality of information semantic feature vectors; the gaussian fusion module 130 is configured to fuse the plurality of intelligence semantic feature vectors based on a gaussian density map to obtain an intelligence context semantic association feature matrix; the multi-scale semantic association feature extraction module 140 is configured to obtain a multi-scale information semantic association feature matrix by using the information context semantic association feature matrix through a convolutional neural network model including a plurality of mixed convolutional layers; the network threat information feature matrix construction module 150 is configured to generate a network threat information feature matrix based on the network threat information library; the contrast analysis module 160 is configured to perform order prior-based feature engineering parameterization fusion on the multi-scale information semantic association feature matrix and the network threat information feature matrix to obtain a classification feature matrix; the analysis result generation module 170 is configured to pass the classification feature matrix through a classifier to obtain a classification result, where the classification result is used to represent a probability value that the information to be analyzed is credible.

In the threat information credible analysis system 100 based on machine learning, the information acquisition module 110 is configured to construct a network threat information library, and acquire information to be analyzed, where the network threat information library includes a latest malicious software means, an attack pattern, a malicious domain name, a malicious website, and a malicious IP address. Network threat intelligence (Cyber Threat Intelligence) refers to network security related information collected, analyzed, and interpreted from various sources. Such information may include data from malware, hacking, cyber attacks, exploits, cyber crime organizations, and other threatening activities. The organization can be aided by collecting cyber threat intelligence to learn about current and potential cyber threats in order to take appropriate measures to protect its information assets and network infrastructure.

The source of threat information may be divided into internal threat information, external threat information, and open source threat information. Internal threat intelligence refers to internal security data from security and compliance systems such as network security monitoring systems, intrusion detection systems (IDS/IPS), log records, event responses, etc. within an enterprise or organization that provide a record of threats and cyber attacks faced by the organization's internal network activity that may help the organization discover evidence of previously unrecognized internal or external threats. External threat intelligence refers to threat information provided from external sources regarding new threats, exploits, malware activity, and other cyber attacks, including public security announcements, vulnerability databases, hacking forums, social media, professional security research institutions, government institutions, other organizations or industry partners that cooperate with the organization, and the like. By sharing information about threat activity, attack trends, and defense strategies, the overall network security capabilities are enhanced. Open source threat intelligence refers to information from publicly accessible sources, such as websites on the internet, blogs, social media, news stories, and the like.

Because open source threat intelligence may come from anonymous users, unverified websites, social media posts, etc., a strict authentication process is not typically performed. Open source threat intelligence lacks specialized auditing and verification mechanisms compared to traditional intelligence sources, and thus may present spurious, misleading, or inaccurate content. Particularly in rapidly changing network environments, some intelligence may have failed or no longer be applicable due to the timeliness of the information. Moreover, the information of the open source threat intelligence may be tampered with by a malicious actor to propagate false information or mislead an analyst, which may cause the analyst to make a false decision based on the wrong information.

Although the open source threat intelligence is relatively less reliable, it can still be an important supplement to threat intelligence collection. When open source threat intelligence or other unverified intelligence is used, the accuracy and credibility of the information can be confirmed by comparing and verifying the information with other reliable sources. Based on the above, in the technical scheme of the application, a network threat information library is firstly constructed, wherein the network threat information library comprises the latest malicious software means, attack modes, malicious domain names, malicious websites and malicious IP addresses from reliable information sources. And then acquiring information to be analyzed.

In the threat intelligence credible analysis system 100 based on machine learning, the intelligence to be analyzed semantic understanding module 120 is configured to obtain a plurality of intelligence semantic feature vectors by passing the intelligence to be analyzed through a context encoder including an embedded layer. To better understand and interpret the contextual semantic features of the intelligence to be analyzed, the intelligence to be analyzed is further semantically encoded using a context encoder. It should be appreciated that the context encoder is a neural network model for processing text data. It is generally composed of two main components: an embedded Layer (Embedding Layer) and a Context Encoder (Context Encoder). An embedded layer is a layer used to convert text data into a continuous vector representation. Each word or character is mapped into a low-dimensional vector space to capture semantic information and context of the vocabulary, so that discrete text data is converted into continuous vector representations, and subsequent calculation and processing are facilitated. The context encoder is a neural network model for learning context information of text data. It captures semantic relationships and context information in text by semantically encoding the text.

In one example of the present application, the context encoder is a converter-based Bert model, which is a pre-trained language model proposed by Google in 2018, that learns generic language representations by pre-training on a large-scale text corpus. The BERT model adopts a two-stage training mode. First, pre-training is performed on large scale unlabeled text, and generic language representations are learned by masking language models (Masked Language Model, MLM) and next sentence prediction (Next Sentence Prediction, NSP) tasks. Then fine tuning is performed on the specific task, training the BERT model as a feature extractor or with additional task specific layers added to it. The core of the BERT model is a transducer architecture, which consists of multiple encoder layers. Each encoder layer contains a multi-headed self-attention mechanism and a feed-forward neural network. The self-attention mechanism allows the model to take all positions in the input sequence into account simultaneously during encoding, capturing global context information, and deep context encoding the input sequence through a stack of multiple encoder layers.

In the technical scheme of the application, the Bert model based on the converter takes the embedded vector sequence output by the embedded layer as input, performs semantic coding on the embedded vector sequence, and adaptively applies different weights to each embedded vector based on a self-attention mechanism to generate a fixed length vector representing semantic features of each data item, thereby obtaining a plurality of information semantic feature vectors.

Fig. 3 is a block diagram of an intelligence semantic understanding module to be analyzed in a threat intelligence trusted analysis system based on machine learning according to an embodiment of the application. As shown in fig. 3, the to-be-analyzed intelligence semantic understanding module 120 includes: a word segmentation unit 121, configured to perform word segmentation processing on the information to be analyzed to obtain a plurality of information data items; an embedding unit 122, configured to perform word embedding encoding on text data of each of the plurality of information data items by using a learnable embedding matrix of the word embedding layer, so as to obtain a plurality of word embedding vectors; a data integration unit 123, configured to add the numerical data of each of the plurality of information data items to the tail of each word embedding vector to obtain a plurality of information embedding vectors; a context coding unit 124, configured to perform global context semantic coding on the plurality of intelligence embedded vectors using a Bert model based on a converter of the context encoder to obtain a plurality of intelligence semantic feature vectors.

Fig. 4 is a block diagram of a context encoding unit in a machine learning based threat intelligence trust analysis system in accordance with an embodiment of the application. As shown in fig. 4, the context encoding unit 124 includes: a one-dimensional arrangement subunit 1241, configured to perform one-dimensional arrangement on the plurality of information embedded vectors to obtain an information global embedded vector; a self-attention generation subunit 1242, configured to calculate a product between the information global embedded vector and a transpose vector of each of the plurality of information embedded vectors to obtain a plurality of self-attention correlation matrices; a normalized self-attention subunit 1243, configured to perform normalization processing on each self-attention correlation matrix in the plurality of self-attention correlation matrices to obtain a plurality of normalized self-attention correlation matrices; a weight generating subunit 1244, configured to obtain a plurality of probability values by activating a function for each normalized self-attention correlation matrix in the plurality of normalized self-attention correlation matrices; the weighting applying subunit 1245 is configured to weight each of the plurality of intelligence embedding vectors with each of the plurality of probability values as a weight to obtain the plurality of intelligence semantic feature vectors.

In the threat intelligence credible analysis system 100 based on machine learning, the gaussian fusion module 130 is configured to fuse the plurality of intelligence semantic feature vectors based on a gaussian density chart to obtain an intelligence context semantic association feature matrix. Considering that the intelligence semantic feature vector of each individual intelligence data item may not completely capture the complex relationship of the intelligence global, the plurality of intelligence semantic feature vectors are further fused based on a gaussian density map. The gaussian density map (Gaussian Density Map) is a probability distribution map representing the strength of the association between different elements. It computes the similarity or association between elements based on a gaussian function (also known as normal distribution), with good mathematical properties and statistical properties, where the value of each element represents the strength of the association between that element and other elements. Through the construction of the Gaussian density diagram, semantic association among all information data items can be converted into probability distribution, and a more accurate basis is provided for subsequent analysis and decision. Specifically, first, the mean and variance between feature values of the respective positions of the plurality of informative semantic feature vectors are calculated, and a gaussian density map is constructed using the mean vector and the variance vector as inputs of a gaussian function, wherein each element represents the strength of association between two informative semantic feature vectors. Then, discretizing the Gaussian density map to generate a context semantic association feature matrix containing context semantic association information among the information data items.

Fig. 5 is a block diagram of a gaussian fusion module in a machine learning based threat intelligence confidence analysis system in accordance with an embodiment of the application. As shown in fig. 5, the gaussian fusion module 130 includes: a gaussian density map construction unit 131 for constructing a gaussian density map of the plurality of intelligence semantic feature vectors in a fusion formula as follows; wherein, the fusion formula is:wherein μ represents a per-position mean vector between the plurality of informative semantic feature vectors, and the value of each position of σ represents a variance between feature values of each position of the plurality of informative semantic feature vectors, ++>Representing a gaussian density probability function, x representing the variables of the gaussian density map; a Gaussian discretization unit 132 for Gaussian division of each position of the Gaussian density mapAnd discretizing the cloth to obtain the information context semantic association feature matrix.

In the threat intelligence credible analysis system 100 based on machine learning, the multi-scale semantic association feature extraction module 140 is configured to obtain a multi-scale intelligence semantic association feature matrix by passing the intelligence context semantic association feature matrix through a convolutional neural network model including a plurality of mixed convolutional layers. In order to extract and characterize semantic association features of the information global on different scales, the information context semantic association feature matrix is further subjected to multi-scale feature mining through a convolutional neural network model. The convolutional neural network model is a deep learning model, and can automatically learn and extract important features in the information context semantic association feature matrix. Particularly, in the technical scheme of the application, the convolutional neural network model comprises a plurality of mixed convolutional layers, convolution operation is carried out on input data on different scales by using convolution kernels with different void rates, and the model structure can better extract features in the input data and gradually improve the abstract degree of the features through the convolutional layers, the pooling layers and the activation functions which are stacked layer by layer. By applying a plurality of mixed convolution layers, information semantic association features are extracted at different levels of abstraction. The shallow convolution layer can capture local semantic association features, and the deep convolution layer can capture more abstract and global semantic association features, so that richer and diversified information feature representation is provided, the accuracy of analyzing the credibility of information to be analyzed is further improved, the dimension of the features is reduced, and the calculation efficiency is improved.

Accordingly, in one specific example, each hybrid convolutional layer of the convolutional neural network model includes a parallel first convolutional branch structure, a second convolutional branch structure, a third convolutional branch structure, and a fourth convolutional branch structure, and a multi-scale fusion structure connected with the first to fourth convolutional branch structures, wherein the first convolutional branch uses a first convolutional kernel having a first size, the second convolutional branch uses a second convolutional kernel having a first size and having a first void fraction, the third convolutional branch uses a third convolutional kernel having a first size and having a second void fraction, and the fourth convolutional branch uses a fourth convolutional kernel having a first size and having a third void fraction.

Specifically, the multi-scale semantic association feature extraction module 140 is configured to: each mixed convolution layer using the convolutional neural network model performs respective processing on input data in forward transfer of the layer: performing multi-scale convolution coding on input data to obtain a multi-scale convolution characteristic diagram; pooling the multi-scale convolution feature map along a channel dimension to generate a pooled feature map; performing nonlinear activation processing on the pooled feature map to generate an activation feature map; the output of the last layer of the convolutional neural network model comprising a plurality of mixed convolutional layers is the multi-scale information semantic association characteristic matrix.

In the above machine learning-based threat intelligence reliability analysis system 100, the cyber threat intelligence feature matrix construction module 150 is configured to generate a cyber threat intelligence feature matrix based on the cyber threat intelligence library. Here, the encoding process of the network threat intelligence feature matrix is consistent with the encoding process of the analysis intelligence. That is, based on the cyber threat intelligence library, a cyber threat intelligence feature matrix having multi-scale contextual semantic association features between individual intelligence data items is generated. More specifically, first, the cyber threat intelligence library is passed through a context encoder comprising an embedded layer to obtain a plurality of cyber threat intelligence semantic feature vectors. And then, fusing the plurality of network threat intelligence semantic feature vectors based on the Gaussian density map to obtain a network threat intelligence context semantic association feature matrix. And then, the network threat intelligence context semantic association feature matrix is processed through a convolutional neural network model comprising a plurality of mixed convolutional layers to obtain the network threat intelligence feature matrix.

In the machine learning-based threat intelligence credible analysis system 100, the comparison analysis module 160 is configured to perform order-priori-based feature engineering parameterization fusion on the multi-scale intelligence semantic association feature matrix and the network threat intelligence feature matrix to obtain a classification feature matrix. In order to capture semantic association and change information between the information to be analyzed and the network threat information library, the multi-scale information semantic association feature matrix and the network threat information feature matrix are further fused. The feature fusion is carried out on the information to be analyzed and the threat information, so that the difference and the change between the information to be analyzed and threat information contained in the network threat information library are captured, more specific and fine-grained information is provided, and the semantic information and the information credibility of the information to be analyzed are further understood.

In particular, in the technical solution of the present application, it is considered that the multi-scale intelligence semantic association feature matrix and the network threat intelligence feature matrix may be obtained through different feature extraction methods or different data sources, so that their feature dimensions may be different. For example, a multi-scale intelligence semantic association feature matrix may contain features of multiple scales, while a cyber threat intelligence feature matrix may contain only limited features. The problem of dimension mismatch can be caused by directly fusing the two feature matrices, so that the feature density of the classified feature matrix after fusion is reduced. Likewise, different feature extraction methods or feature representation modes may be adopted in consideration of the multi-scale intelligence semantic association feature matrix and the network threat intelligence feature matrix. For example, the multi-scale intelligence semantic association feature matrix may be obtained through a context encoder and convolutional neural network model, while the cyber threat intelligence feature matrix may be generated through a cyber threat intelligence library. The feature representation modes of the two feature matrices may be inconsistent, including aspects such as semantic meaning of the features, distribution of the features, and scale of the features. The problem of inconsistent feature representation can be caused by directly fusing the two feature matrices, so that the feature density of the classified feature matrix after fusion is reduced. In order to solve the problem of low feature density of the feature matrix after fusion, feature engineering parameterization fusion based on order priori is carried out on the multi-scale information semantic association feature matrix and the network threat information feature matrix so as to improve the quality and the density of feature representation.

Aiming at the problem of low feature density of the feature matrix after fusion, in the technical scheme of the application, the feature engineering parameterization based on order priori is utilized to treat the low feature density of the feature matrix after fusion as a structural imbalance, so that the feature density of the feature matrix after fusion is improved by adopting a structural optimization technology. Specifically, according to the form and the characteristics of the multi-scale information semantic association feature matrix and the network threat information feature matrix, a feature engineering parameterization strategy based on order priori is designed, and feature values of different categories and dimensions are ordered and grouped according to a certain order rule, so that information redundancy and noise interference in a fusion process are reduced. Furthermore, by using a structure optimization technology, a support matrix serving as a mean value is selected as an interaction seed, and growth of point correlation is performed from an interaction center to an interaction end point, so that sparsity matching of the feature matrix in dimensions and scales is converted into dense matching, and therefore the feature density of the fused classification feature matrix is improved, and the classification effect based on the fused classification feature matrix is improved.

Accordingly, in one specific example, the contrast analysis module 160 is configured to: carrying out order prior-based feature engineering parameterization fusion on the multi-scale information semantic association feature matrix and the network threat information feature matrix by using the following fusion formula to obtain a classification feature matrix; wherein, the fusion formula is:

In the threat intelligence credible analysis system 100 based on machine learning, the analysis result generation module 170 is configured to pass the classification feature matrix through a classifier to obtain a classification result, where the classification result is used to represent a probability value that the intelligence to be analyzed is credible. The classifier is a supervised learning method that classifies new unlabeled data by learning patterns and rules from labeled training data, learning associations between features in the training data and corresponding labels. The goal of the classifier is to assign it correctly into predefined categories based on the characteristics of the input data. Here, by inputting the classification feature matrix into a trained classifier, the classifier can classify the information to be analyzed according to the learned mode and rule, and give a result indicating the credibility. For example, the classifier may output a probability value indicating the probability that the intelligence to be analyzed belongs to a trusted category. In this way, the decision maker or the analyst can be helped to better evaluate the credibility of the information to be analyzed according to the classification result, and can be used as auxiliary information for subsequent decision making or further analysis.

In summary, the threat intelligence credible analysis system based on machine learning according to the embodiment of the application is clarified, firstly, a network threat intelligence library containing the latest malicious software means, attack modes, malicious domain names, malicious websites and malicious IP addresses is constructed, context semantic association feature extraction is carried out on the information to be analyzed through a deep learning technology, and difference comparison is carried out on the context semantic association feature of the information to be analyzed and the network threat intelligence library feature to judge whether the information to be analyzed is credible or not. Thus, the collected information can be automatically analyzed in reliability, thereby helping security team or analyst to better understand threat information and timely take corresponding response measures.

Fig. 6 is a flow chart of a machine learning based threat intelligence trust analysis method in accordance with an embodiment of the application. As shown in fig. 6, a threat intelligence credibility analysis method based on machine learning according to an embodiment of the application includes the steps of: s110, constructing a network threat information library and acquiring information to be analyzed, wherein the network threat information library comprises the latest malicious software means, attack modes, malicious domain names, malicious websites and malicious IP addresses; s120, passing the information to be analyzed through a context encoder comprising an embedded layer to obtain a plurality of information semantic feature vectors; s130, fusing the plurality of information semantic feature vectors based on the Gaussian density map to obtain an information context semantic association feature matrix; s140, passing the information context semantic association feature matrix through a convolutional neural network model comprising a plurality of mixed convolutional layers to obtain a multi-scale information semantic association feature matrix; s150, generating a network threat information characteristic matrix based on the network threat information library; s160, carrying out order prior-based feature engineering parameterization fusion on the multi-scale information semantic association feature matrix and the network threat information feature matrix to obtain a classification feature matrix; and S170, the classification feature matrix passes through a classifier to obtain a classification result, wherein the classification result is used for representing a probability value of credibility of the information to be analyzed.

Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described machine learning-based threat intelligence reliability analysis method have been described in detail in the above description of the machine learning-based threat intelligence reliability analysis system with reference to fig. 1 to 5, and thus, repetitive descriptions thereof will be omitted.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A machine learning based threat intelligence trust analysis system, comprising:

2. The machine learning based threat intelligence trust analysis system of claim 1, wherein the intelligence to be analyzed semantic understanding module comprises:

the word segmentation unit is used for carrying out word segmentation processing on the information to be analyzed to obtain a plurality of information data items;

the embedding unit is used for respectively carrying out word embedding encoding on text data of each information data item in the plurality of information data items by using a learnable embedding matrix of the word embedding layer so as to obtain a plurality of word embedding vectors;

the data integration unit is used for respectively adding the numerical data of each information data item in the plurality of information data items to the tail part of each word embedding vector to obtain a plurality of information embedding vectors;

and the context coding unit is used for performing global context semantic coding on the plurality of intelligence embedded vectors by using a Bert model based on a converter of the context coder so as to obtain a plurality of intelligence semantic feature vectors.

3. The machine learning based threat intelligence trust analysis system of claim 2, wherein the context encoding unit comprises:

A one-dimensional arrangement subunit, configured to perform one-dimensional arrangement on the plurality of information embedded vectors to obtain an information global embedded vector;

a self-attention generation subunit, configured to calculate a product between the information global embedding vector and a transpose vector of each of the plurality of information embedding vectors to obtain a plurality of self-attention correlation matrices;

the standardized self-attention subunit is used for respectively carrying out standardized processing on each self-attention incidence matrix in the plurality of self-attention incidence matrices to obtain a plurality of standardized self-attention incidence matrices;

the weight generation subunit is used for obtaining a plurality of probability values through activating functions by each normalized self-attention correlation matrix in the normalized self-attention correlation matrices;

and the weight applying subunit is used for weighting each information embedding vector in the information embedding vectors by taking each probability value in the plurality of probability values as a weight so as to obtain the plurality of information semantic feature vectors.

4. The machine learning based threat intelligence trust analysis system of claim 3, wherein the gaussian fusion module comprises:

A gaussian density map construction unit for constructing gaussian density maps of the plurality of intelligence semantic feature vectors in a fusion formula as follows;

wherein, the fusion formula is:

wherein μ represents a per-position mean vector between the plurality of informative semantic feature vectors, and the value of each position of σ represents a variance between feature values of each position in the plurality of informative semantic feature vectors,representing a gaussian density probability function, x representing the variables of the gaussian density map;

and the Gaussian discretization unit is used for discretizing the Gaussian distribution of each position of the Gaussian density map to obtain the information context semantic association characteristic matrix.

5. The machine learning based threat intelligence trust analysis system of claim 4, wherein each hybrid convolution layer of the convolutional neural network model comprises a first convolutional branch structure, a second convolutional branch structure, a third convolutional branch structure, and a fourth convolutional branch structure in parallel, and a multi-scale fusion structure connected with the first through fourth convolutional branch structures, wherein the first convolutional branch uses a first convolutional kernel having a first size, the second convolutional branch uses a second convolutional kernel having a first size and having a first void fraction, the third convolutional branch uses a third convolutional kernel having a first size and having a second void fraction, and the fourth convolutional branch uses a fourth convolutional kernel having a first size and having a third void fraction.

6. The machine learning based threat intelligence trust analysis system of claim 5, wherein the multi-scale semantic association feature extraction module is configured to: each mixed convolution layer using the convolutional neural network model performs respective processing on input data in forward transfer of the layer:

performing multi-scale convolution coding on input data to obtain a multi-scale convolution characteristic diagram;

pooling the multi-scale convolution feature map along a channel dimension to generate a pooled feature map;

performing nonlinear activation processing on the pooled feature map to generate an activation feature map;

the output of the last layer of the convolutional neural network model comprising a plurality of mixed convolutional layers is the multi-scale information semantic association characteristic matrix.

7. The machine learning based threat intelligence trust analysis system of claim 6, wherein the contrast analysis module is configured to: carrying out order prior-based feature engineering parameterization fusion on the multi-scale information semantic association feature matrix and the network threat information feature matrix by using the following fusion formula to obtain a classification feature matrix; wherein, the fusion formula is:

Wherein M is ₁ Representing the multi-scale information semantic association feature matrix, M ₂ Representing the network threat intelligence feature matrix,representing the mean matrix between the multi-scale information semantic association feature matrix and the network threat information feature matrix, exp (·) representing the exponential operation of the matrix, log representing the logarithmic function value based on 2, < ->The per-position addition of the representation matrix,representing the per-position subtraction of the matrix, M _c Representing the classification feature matrix.

8. A machine learning based threat intelligence trust analysis method, comprising:

9. The machine learning based threat intelligence trust analysis method of claim 8, wherein passing the intelligence to be analyzed through a context encoder comprising an embedded layer to obtain a plurality of intelligence semantic feature vectors, comprising:

word segmentation processing is carried out on the information to be analyzed so as to obtain a plurality of information data items;

respectively carrying out word embedding coding on text data of each information data item in the plurality of information data items by using a learnable embedding matrix of the word embedding layer so as to obtain a plurality of word embedding vectors;

respectively adding the numerical data of each information data item in the plurality of information data items to the tail part of each word embedding vector to obtain a plurality of information embedding vectors;

and performing global-based context semantic coding on the plurality of intelligence embedded vectors by using a converter-based Bert model of the context encoder to obtain a plurality of intelligence semantic feature vectors.

10. The machine learning based threat intelligence trust analysis method of claim 9, wherein performing global-based contextual semantic encoding on the plurality of intelligence embedded vectors using a transformer-based Bert model of the contextual encoder to obtain a plurality of intelligence semantic feature vectors, comprising:

one-dimensional arrangement is carried out on the plurality of information embedded vectors to obtain an information global embedded vector;

calculating the product between the information global embedded vector and the transpose vector of each information embedded vector in the plurality of information embedded vectors to obtain a plurality of self-attention association matrixes;

respectively carrying out standardization processing on each self-attention correlation matrix in the plurality of self-attention correlation matrices to obtain a plurality of standardized self-attention correlation matrices;

each normalized self-attention correlation matrix in the normalized self-attention correlation matrices is activated to obtain a plurality of probability values;

and weighting each information embedding vector in the information embedding vectors by taking each probability value in the probability values as a weight so as to obtain the information semantic feature vectors.