CN113420127A - Threat information processing method, device, computing equipment and storage medium - Google Patents

Threat information processing method, device, computing equipment and storage medium Download PDF

Info

Publication number
CN113420127A
CN113420127A CN202110761075.1A CN202110761075A CN113420127A CN 113420127 A CN113420127 A CN 113420127A CN 202110761075 A CN202110761075 A CN 202110761075A CN 113420127 A CN113420127 A CN 113420127A
Authority
CN
China
Prior art keywords
intelligence
knowledge
target
threat
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110761075.1A
Other languages
Chinese (zh)
Inventor
王晓波
徐菲
郑然德
谢兰天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xin'an Tiantu Technology Co ltd
Original Assignee
Beijing Xin'an Tiantu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xin'an Tiantu Technology Co ltd filed Critical Beijing Xin'an Tiantu Technology Co ltd
Priority to CN202110761075.1A priority Critical patent/CN113420127A/en
Publication of CN113420127A publication Critical patent/CN113420127A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a threat information processing method, a threat information processing device, a computing device and a storage medium, wherein the method comprises the following steps: acquiring unstructured data in threat intelligence source data; performing word segmentation processing on the unstructured data to obtain a plurality of intelligence knowledge; semantic understanding is carried out on each intelligence knowledge by utilizing a natural language processing technology to obtain semantic expression of each intelligence knowledge; and associating each intelligence knowledge with the known threat intelligence type respectively according to the semantic expression of each intelligence knowledge and the known threat intelligence type. According to the scheme, the identification quantity of threat information can be improved, and further effective utilization of threat information source data is achieved.

Description

Threat information processing method, device, computing equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of network security, in particular to a threat information processing method, a threat information processing device, computing equipment and a storage medium.
Background
With the continuous progress and development of information technology, the guarantee of large-scale network space security increasingly depends on the extraction, understanding, construction and sharing of threat intelligence. The threat intelligence can be expressed aiming at a specific attack vector used by an attacker in a specific industry or a geographic area range, and a decision basis is provided for threat response.
Threat intelligence source data includes unstructured data, semi-structured data, and structured data, which needs to be converted into natural language for analysis, for example, mail, web pages, text, etc. In the related art, when extracting threat information from unstructured data, a neural network model is generally trained by using a text data set with label labels, and then the trained neural network model is used to identify threat information entities.
Disclosure of Invention
Based on the problem that only a few threat intelligence can be identified in the related technology and effective utilization of threat intelligence source data cannot be realized, the embodiment of the invention provides a threat intelligence processing method, a threat intelligence processing device, computing equipment and a storage medium, which can improve the identification amount of the threat intelligence and realize effective utilization of the threat intelligence source data.
In a first aspect, an embodiment of the present invention provides a threat intelligence processing method, including:
acquiring unstructured data in threat intelligence source data;
performing word segmentation processing on the unstructured data to obtain a plurality of intelligence knowledge;
semantic understanding is carried out on each intelligence knowledge by utilizing a natural language processing technology to obtain semantic expression of each intelligence knowledge;
and associating each intelligence knowledge with the known threat intelligence type respectively according to the semantic expression of each intelligence knowledge and the known threat intelligence type.
Preferably, the intelligence knowledge comprises: at least one of a word, a sentence, a paragraph, and a text.
Preferably, the semantic understanding of each intelligence knowledge by using the natural language processing technology to obtain the semantic expression of each intelligence knowledge includes:
determining a convolutional neural network which is trained in advance; the convolutional neural network is obtained by training by using sample intelligence knowledge and corresponding sample semantics as sample pairs;
and respectively calculating each intelligence knowledge by utilizing a hidden layer in the convolutional neural network to obtain a multidimensional characteristic vector corresponding to each intelligence knowledge, and determining the obtained multidimensional characteristic vector as semantic expression of the corresponding intelligence knowledge.
Preferably, the convolutional neural network is trained using:
obtaining a plurality of sample pairs, wherein each sample pair comprises sample intelligence knowledge and corresponding sample semantics;
according to a set characteristic dimension range, respectively determining each characteristic dimension belonging to the set characteristic dimension range as the characteristic dimension of a hidden layer in the convolutional neural network, and executing the next step after each characteristic dimension of the hidden layer in the convolutional neural network is determined;
for each sample pair, taking sample intelligence knowledge in the sample pair as the input of the convolutional neural network, taking sample semantics in the sample pair as the output of the convolutional neural network, and training the convolutional neural network;
determining recall rate and calculation rate respectively corresponding to each characteristic dimension corresponding to the convolutional neural network;
and determining a target feature dimension of a hidden layer in the convolutional neural network according to the recall rate and the calculation rate respectively corresponding to each feature dimension, and determining the convolutional neural network which is trained and corresponds to the target feature dimension as a final convolutional neural network.
Preferably, the associating each intelligence knowledge with a known threat intelligence type according to the semantic expression of each intelligence knowledge and the known threat intelligence type includes:
respectively obtaining semantic expression of each known threat intelligence type;
calculating the distance between each target intelligence knowledge and each known threat intelligence type in each intelligence knowledge according to the semantic expression of each intelligence knowledge and the semantic expression of each known threat intelligence type;
and determining a target threat intelligence type corresponding to the target intelligence knowledge according to the distance between the target intelligence knowledge and each known threat intelligence type, and associating the target intelligence knowledge to the target threat intelligence type.
Preferably, the determining a target threat intelligence type corresponding to the target intelligence knowledge according to a distance between the target intelligence knowledge and each known threat intelligence type includes:
determining whether a target distance which is not greater than a set distance threshold exists in the distances between the target intelligence knowledge and the known threat intelligence types; if so, determining the threat intelligence type corresponding to the target distance with the minimum distance as the target threat intelligence type; and if the target threat intelligence information does not exist, generating a threat intelligence type corresponding to the target intelligence knowledge, and determining the generated threat intelligence type as the target threat intelligence type.
Preferably, before the generating threat intelligence types corresponding to the target intelligence knowledge, the method further includes:
and determining whether the semantics of the target intelligence knowledge has intelligence significance according to the semantic expression of the target intelligence knowledge, if so, executing the generation of the threat intelligence type corresponding to the target intelligence knowledge, and if not, deleting the target intelligence knowledge.
In a second aspect, an embodiment of the present invention further provides a threat information processing apparatus, including:
the data acquisition unit is used for acquiring unstructured data in threat intelligence source data;
the word segmentation processing unit is used for carrying out word segmentation processing on the unstructured data to obtain a plurality of information knowledge;
the semantic understanding unit is used for carrying out semantic understanding on each intelligence knowledge by utilizing a natural language processing technology to obtain semantic expression of each intelligence knowledge;
and the intelligence type association unit is used for associating each intelligence knowledge with the known threat intelligence type according to the semantic expression of each intelligence knowledge and the known threat intelligence type.
In a third aspect, an embodiment of the present invention further provides a computing device, including a memory and a processor, where the memory stores a computer program, and the processor, when executing the computer program, implements the method described in any embodiment of this specification.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to execute the method described in any embodiment of the present specification.
The embodiment of the invention provides a threat information processing method, a device, a computing device and a storage medium, wherein a word segmentation processing mode is adopted to obtain a plurality of information knowledge from unstructured data, the information knowledge is only original characters in the unstructured data at the moment, the information knowledge can be semantically understood by using a natural language processing technology to obtain semantic expression of the information knowledge, the information knowledge can be identified and processed by a machine by using the semantic expression, and thus the information knowledge and the threat information type can be associated according to the semantic expression. Therefore, each intelligence knowledge is analyzed, so that the identification quantity of threat intelligence can be improved, and effective utilization of threat intelligence source data is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a threat intelligence processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a semantic understanding method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for associating information knowledge according to an embodiment of the present invention;
FIG. 4 is a diagram of a hardware architecture of a computing device according to an embodiment of the present invention;
fig. 5 is a structural diagram of a threat information processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention, and based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the scope of the present invention.
The massive threat intelligence source data is derived from the alarm or situation repeated analysis that a user suffers from security attack, or the analysis report made by a well-known security manufacturer or a security technology team, and the massive scattered information needs to be comprehensively associated to better form threat intelligence. While the multi-disk analysis, or analysis report, of most users has no fixed format and contains a large amount of natural language descriptions, the unstructured data of the natural language descriptions often contains a large amount of key information.
In the related art, when extracting threat information from unstructured data, a neural network model is generally trained by using a text data set with label labels, and then the trained neural network model is used to identify threat information entities. However, the related technology can only identify less threat intelligence, and cannot realize effective utilization of threat intelligence source data.
The word segmentation processing can be carried out on unstructured data in threat intelligence source data to analyze each intelligence knowledge obtained after the word segmentation processing, so that the identification quantity of threat intelligence is improved, and effective utilization of the threat intelligence source data is realized.
Specific implementations of the above concepts are described below.
Referring to fig. 1, an embodiment of the present invention provides a threat intelligence processing method, including:
step 100, unstructured data in threat intelligence source data is obtained.
And 102, performing word segmentation processing on the unstructured data to obtain a plurality of intelligence knowledge.
And 104, performing semantic understanding on each intelligence knowledge by using a natural language processing technology to obtain semantic expression of each intelligence knowledge.
And 106, associating each intelligence knowledge with the known threat intelligence type respectively according to the semantic expression of each intelligence knowledge and the known threat intelligence type.
In the embodiment of the invention, a word segmentation processing mode is adopted to obtain a plurality of intelligence knowledge from unstructured data, the intelligence knowledge is only original characters in the unstructured data at the moment, a natural language processing technology can be utilized to carry out semantic understanding on the intelligence knowledge so as to obtain semantic expression of the intelligence knowledge, the intelligence knowledge can be identified and processed by a machine by utilizing the semantic expression, and thus the intelligence knowledge can be associated with threat intelligence types according to the semantic expression. Therefore, each intelligence knowledge is analyzed, so that the identification quantity of threat intelligence can be improved, and effective utilization of threat intelligence source data is realized.
The manner in which the various steps shown in fig. 1 are performed is described below.
First, in step 100, unstructured data in threat intelligence source data is obtained.
Threat intelligence is some evidence-based knowledge, including context, mechanism, label, meaning, and actionable advice, that is relevant to a threat or hazard that an asset is exposed to, and that can be used to provide information support for the asset-related subject's response to or handling decisions about the threat or hazard.
In most cases, the source of the vast amount of threat intelligence is multi-modal, with no uniform format and standard. The data structure of threat intelligence may include: unstructured data, semi-structured data, and structured data. Among other things, unstructured data, such as web pages, emails, documents, etc., needs to be converted to natural language and can be machine recognized and processed.
Then, in step 102, the unstructured data is word segmented to obtain a plurality of intelligence knowledge.
In an embodiment of the present invention, the word segmentation processing may be performed on the unstructured data based on a word segmentation method based on string matching, a word segmentation method based on understanding, a word segmentation method based on statistics, and the like. Different word segmentation granularity can be set in the word segmentation processing process so as to obtain word segmentation results with different granularity.
In one embodiment of the present invention, it is impossible to predict which intelligence knowledge belongs to threat intelligence for information in unstructured data, and in order to improve the effective utilization of unstructured data in threat intelligence source data, the intelligence knowledge may include: at least one of a word, a sentence, a paragraph, and a text. For example, each word, each sentence, each paragraph, or even the whole text in the unstructured data can be used as informative knowledge, and then each informative knowledge is further screened through subsequent processing.
Next, in step 104, semantic understanding is performed on each intelligence knowledge by using a natural language processing technique to obtain a semantic expression of each intelligence knowledge.
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. The simple way to speak is to use computer to process, understand and use human language (such as Chinese and English), which belongs to a branch of artificial intelligence and is a cross discipline of computer science and linguistics, also commonly called computational linguistics.
In one embodiment of the present invention, please refer to fig. 2, the intelligence knowledge can be understood semantically at least by one of the following ways:
step 200, determining the convolutional neural network which is trained in advance.
The convolutional neural network is obtained by training by using sample intelligence knowledge and corresponding sample semantics as sample pairs. Wherein the convolutional neural network comprises an input layer, a hidden layer, and an output layer.
In an embodiment of the present invention, the training mode of the convolutional neural network may be trained by one of the following modes:
s1: a number of sample pairs are obtained, each sample pair including sample intelligence knowledge and corresponding sample semantics. Wherein, the sample semantics can be understood and identified by human beings on the sample intelligence knowledge.
S2: and according to a set characteristic dimension range, respectively determining each characteristic dimension belonging to the set characteristic dimension range as the characteristic dimension of a hidden layer in the convolutional neural network, and executing S3 after each characteristic dimension of the hidden layer in the convolutional neural network is determined.
Since the quantity of information knowledge obtained after word segmentation is performed on unstructured data is large, if the feature dimension number in the hidden layer is small, then it may not be possible to perform effective and distinct semantic expression on a large quantity of different information knowledge, and if the feature dimension number is large, then there may be a large deviation in the expression result when performing semantic expression on information knowledge of similar semantics, so that a feature dimension range, for example, 10-15 dimensions, may be set in advance.
The feature dimension in the hidden layer is determined as each feature dimension in the set feature dimension range, for example, the feature dimensions are 10, 11, 12, 13, 14 and 15, respectively.
S3: and aiming at each sample pair, taking the sample intelligence knowledge in the sample pair as the input of the convolutional neural network, taking the sample semantics in the sample pair as the output of the convolutional neural network, and training the convolutional neural network.
S4: and respectively determining the recall rate and the calculation rate corresponding to each characteristic dimension of the convolutional neural network.
When the feature dimensions in the hidden layer are different, the recall rate and the calculation rate of the convolutional neural network trained by using the sample acquired in S1 are also different. It is understood that the calculated rate is an average rate at which the convolutional neural network outputs each semantic after inputting each test intelligence knowledge into the convolutional neural network.
S5: and determining a target feature dimension of a hidden layer in the convolutional neural network according to the recall rate and the calculation rate respectively corresponding to each feature dimension, and determining the convolutional neural network which is trained and corresponds to the target feature dimension as a final convolutional neural network.
In an embodiment of the present invention, in order to ensure that the hidden layer in the convolutional neural network is used to calculate the intelligence knowledge in the subsequent process, and balance the accuracy and the calculation rate of the calculation result, the convolutional neural network corresponding to the feature dimension with the fastest calculation rate in the recall rate meeting the set condition may be determined as the final convolutional neural network. For example, the set condition is more than 99.5%.
The training of the convolutional neural network is realized by utilizing the steps S1-S5, so that the trained convolutional neural network can meet the prediction accuracy, the prediction rate can be ensured, and the processing rate of unstructured data and the accuracy of a processing result are improved.
It should be noted that, in addition to the above training method, a feature dimension may be given to the hidden layer of the convolutional neural network, and then the convolutional neural network is trained, and when a satisfactory recall rate is reached, it is determined that the training of the convolutional neural network is completed.
Step 202, utilizing the hidden layer in the convolutional neural network to calculate each intelligence knowledge respectively to obtain a multidimensional feature vector corresponding to each intelligence knowledge respectively, and determining the obtained multidimensional feature vector as the semantic expression of the corresponding intelligence knowledge.
In an embodiment of the present invention, since after the convolutional neural network is trained, each parameter value of the function in the hidden layer is already determined, the function in the hidden layer may be used to calculate the intelligence knowledge, and a multidimensional feature vector is calculated by using the hidden layer, where the dimension of the multidimensional feature vector is the same as the dimension of the target feature.
The multi-dimensional characteristic vector comprises a plurality of elements, wherein each element in the multi-dimensional characteristic vector expresses the semantics of the information knowledge in a numerical value mode, and the numerical values of the elements in the multi-dimensional characteristic vector obtained by calculation aiming at different information knowledge are different. It can be understood that, when the number of elements with a value of 0 in the multidimensional feature vector is larger, the semantic meaning is simpler or the intelligence meaning is smaller.
By utilizing the method shown in the figure 2 to carry out semantic understanding on the intelligence knowledge and expressing the semantics of the intelligence knowledge in a multi-dimensional feature vector mode, the semantic expression of each intelligence knowledge can be identified and processed by a machine, and the accuracy and the understanding rate of the semantic understanding of the intelligence knowledge can be improved.
In addition to the semantic understanding of the intelligence knowledge in the manner shown in fig. 2, the intelligence knowledge may be semantically understood by other manners, for example, semantic expression may be performed in the form of a feature matrix, or the intelligence knowledge may be semantically understood by directly using a convolutional neural network and then output the semantics of the intelligence knowledge.
Finally, in step 106, each intelligence knowledge is associated with a known threat intelligence type according to the semantic expression of each intelligence knowledge and the known threat intelligence type.
The granularity of the threat intelligence types can be divided according to different requirements, for example, the threat intelligence types can include attack types, attack flows, attack tactics, attack subjects and the like.
In an embodiment of the present invention, in order to determine the threat intelligence type to which each intelligence knowledge belongs to associate the intelligence knowledge with the corresponding threat intelligence type, please refer to fig. 3, this step 106 can be implemented by one of the following ways:
step 300, semantic expressions of each known threat intelligence type are respectively obtained.
In order to accurately determine the threat information types to which the threat information awareness belongs, in step 300, when semantic expression is performed on each known threat information type, semantic expression can be performed on the known threat information types by using the method shown in fig. 2, so that the semantic expression of the information knowledge and the threat information types is obtained by using the hidden layer in the same convolutional neural network, and when the distance between the information knowledge and the threat information types is calculated, the calculation result is more accurate.
When semantic expression is performed on the known threat intelligence type, a feature word can be extracted from the known threat intelligence type, for example, the known threat intelligence type is attack tactics, and the "attack tactics" can be used as the feature word.
Step 302, calculating the distance between each target intelligence knowledge and each known threat intelligence type in each intelligence knowledge according to the semantic expression of each intelligence knowledge and the semantic expression of each known threat intelligence type.
In one embodiment of the invention, since the semantic expression may be a multi-dimensional feature vector, a multi-dimensional feature matrix, etc., the distance between the target intelligence knowledge and the known threat intelligence type may be determined by calculating the distance between the multi-dimensional feature vector/multi-dimensional feature matrix.
For example, if the intelligence knowledge includes X and the known threat intelligence type includes Y, Y distances can be obtained for each intelligence knowledge, and X × Y distances can be obtained for X intelligence knowledge.
When calculating the distance between two multidimensional feature vectors or two multidimensional feature matrices, the euclidean distance, the cosine similarity, the manhattan distance, the chebyshev distance, and the like can be used for calculation.
In an embodiment of the present invention, the distance between two multidimensional feature vectors can be calculated by using cosine similarity, wherein the closer the calculation result is to 1, the greater the vector correlation is, and the closer to 0, the smaller the vector correlation is, when the cosine similarity is used for calculation. Because the cosine similarity focuses more on the difference of vectors, deviation is easy to occur during calculation, and in order to reduce the deviation, the cosine similarity can be corrected by using Euclidean distance, so that the distance between the intelligence knowledge and the known threat intelligence type can be calculated by the following formula:
Figure BDA0003149120620000091
wherein C is used to characterize the distance between an m-dimensional vector P and an m-dimensional vector Q, where the vector P and the vector Q respectively characterize the intelligence knowledge, the semantic expression of known threat intelligence types, | PkI and QkI is the modulo length of vector P and vector Q, respectively, k ═ 1, 2, …, m.
The calculation formula only relates to parameters of two vectors, numerical values of elements in the vectors can be quickly substituted into the calculation formula, calculation efficiency is improved, and the cosine similarity is corrected by utilizing the Euclidean distance, so that the calculation result is more accurate.
Step 304, according to the distance between the target intelligence knowledge and each known threat intelligence type, determining the target threat intelligence type corresponding to the target intelligence knowledge, and associating the target intelligence knowledge with the target threat intelligence type.
In an embodiment of the present invention, since attack means are continuously changing and new threat intelligence types may exist in the obtained threat intelligence source data, this step 302 may include: determining whether a target distance which is not greater than a set distance threshold exists in the distance between the target intelligence knowledge and each known threat intelligence type; if so, determining the threat intelligence type corresponding to the target distance with the minimum distance as the target threat intelligence type; if not, generating a threat intelligence type corresponding to the target intelligence knowledge, and determining the generated threat intelligence type as the target threat intelligence type.
For example, the number of known threat information types is 5, the distance corresponding to each of the 5 known threat information types can be calculated for the target information knowledge, and the smaller the distance is, the closer the target information knowledge is to the threat information type, but in one case, the distances between the target information knowledge and each known threat information type are very large, so that the threat information type to which the target information knowledge belongs can be determined by setting a distance threshold. For example, if 3 target distances out of 5 distances are not greater than the distance threshold, it indicates that the type of threat intelligence to which the target intelligence knowledge belongs is known, and the corresponding type of threat intelligence with the smallest distance is selected from the 3 target distances and determined as the type of threat intelligence to which the target intelligence knowledge belongs. If the 5 distances are all larger than the distance threshold value, the target intelligence knowledge is a newly added threat intelligence type, and thus, a threat type corresponding to the target intelligence knowledge can be generated.
In an embodiment of the invention, in order to further improve the accuracy of the threat information type to which the information knowledge belongs, considering that the word after word segmentation processing is composed of a plurality of characters, the sentence is composed of the words and/or the characters, the segment is composed of at least one of the sentence, the word and the characters, and the text is composed of at least one of the segment, the sentence, the word and the characters, therefore, when the distance of the word, the sentence, the segment and the text is calculated, the distance can be further corrected by using the weight so as to improve the accuracy of distance calculation.
The description will be given by taking intelligence knowledge as a sentence. The sentence can obtain a plurality of words and characters after word segmentation processing, and the distance between the sentence, each word and each character in the sentence and the known threat intelligence type can be obtained according to the step 302, so that corresponding weights can be assigned to the sentence, each word and each character in the sentence, wherein the assignment of the weights can be carried out by combining expert knowledge, or the assignment can be carried out automatically according to the weight type set by the expert knowledge. For example, words are weighted more heavily than sentences and words.
And recalculating the distance between the sentence and the known threat intelligence type according to the weight given to the sentence, each word and each character in the sentence, and determining the distance between the sentence and the known threat intelligence type according to the recalculated distance.
In an embodiment of the present invention, considering that some intelligence knowledge obtained by using word segmentation processing may not have intelligence meaning, for example, verb-assisted words, word-of-voice words, etc., there may be a case that the distance between this type of intelligence knowledge and a known threat intelligence type is very large when calculating the distance, in order to avoid generating a new threat intelligence type for some words without intelligence meaning, before generating a threat intelligence type corresponding to target intelligence knowledge, the method may further include: and determining whether the semantics of the target intelligence knowledge has intelligence significance according to the semantic expression of the target intelligence knowledge, if so, executing the generation of the threat intelligence type corresponding to the target intelligence knowledge, and if not, deleting the target intelligence knowledge.
When determining whether the semantics of the target information knowledge has the information significance, specifically, the numerical values of the elements in the multidimensional feature vector corresponding to the semantic expression of the target information knowledge can be determined, for example, an element threshold value is set according to the total number of the elements in the multidimensional feature vector, if the number of the element numerical values 0 in the multidimensional feature vector corresponding to the target information knowledge is greater than the element threshold value, the target information knowledge is determined not to have the information significance, otherwise, the target information knowledge is determined to have the information significance.
When the target information knowledge has information significance, a new threat information type is generated aiming at the target information knowledge, so that the accuracy of the threat information type can be ensured, and the effective utilization of the threat information is improved.
As shown in fig. 4 and 5, an embodiment of the present invention provides a threat information processing apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. From a hardware aspect, as shown in fig. 4, a hardware architecture diagram of a computing device in which a threat intelligence processing apparatus according to an embodiment of the present invention is located is provided, where the computing device in which the apparatus is located in the embodiment may generally include other hardware, such as a forwarding chip responsible for processing a message, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4. Taking a software implementation as an example, as shown in fig. 5, as a logical means, the device is formed by reading a corresponding computer program in a non-volatile memory into a memory by a CPU of a computing device where the device is located and running the computer program. The threat information processing apparatus provided by the embodiment includes:
a data obtaining unit 501, configured to obtain unstructured data in threat intelligence source data;
a word segmentation processing unit 502, configured to perform word segmentation processing on the unstructured data to obtain a plurality of intelligence knowledge;
a semantic understanding unit 503, configured to perform semantic understanding on each intelligence knowledge by using a natural language processing technology to obtain a semantic expression of each intelligence knowledge;
an intelligence type association unit 504, configured to associate each intelligence knowledge with a known threat intelligence type according to the semantic expression of each intelligence knowledge and the known threat intelligence type.
In one embodiment of the invention, the intelligence knowledge comprises: at least one of a word, a sentence, a paragraph, and a text.
In an embodiment of the present invention, the semantic understanding unit 503 is specifically configured to determine a convolutional neural network trained in advance; the convolutional neural network is obtained by training by using sample intelligence knowledge and corresponding sample semantics as sample pairs; and respectively calculating each intelligence knowledge by utilizing a hidden layer in the convolutional neural network to obtain a multidimensional characteristic vector corresponding to each intelligence knowledge, and determining the obtained multidimensional characteristic vector as semantic expression of the corresponding intelligence knowledge.
In one embodiment of the invention, the convolutional neural network is trained using:
obtaining a plurality of sample pairs, wherein each sample pair comprises sample intelligence knowledge and corresponding sample semantics;
according to a set characteristic dimension range, respectively determining each characteristic dimension belonging to the set characteristic dimension range as the characteristic dimension of a hidden layer in the convolutional neural network, and executing the next step after each characteristic dimension of the hidden layer in the convolutional neural network is determined;
for each sample pair, taking sample intelligence knowledge in the sample pair as the input of the convolutional neural network, taking sample semantics in the sample pair as the output of the convolutional neural network, and training the convolutional neural network;
determining recall rate and calculation rate respectively corresponding to each characteristic dimension corresponding to the convolutional neural network;
and determining a target feature dimension of a hidden layer in the convolutional neural network according to the recall rate and the calculation rate respectively corresponding to each feature dimension, and determining the convolutional neural network which is trained and corresponds to the target feature dimension as a final convolutional neural network.
In an embodiment of the present invention, the intelligence type associating unit 504 is specifically configured to obtain semantic expressions of each known threat intelligence type; calculating the distance between each target intelligence knowledge and each known threat intelligence type in each intelligence knowledge according to the semantic expression of each intelligence knowledge and the semantic expression of each known threat intelligence type; and determining a target threat intelligence type corresponding to the target intelligence knowledge according to the distance between the target intelligence knowledge and each known threat intelligence type, and associating the target intelligence knowledge to the target threat intelligence type.
In an embodiment of the present invention, the intelligence type associating unit 504, when performing the determining of the target threat intelligence type corresponding to the target intelligence knowledge according to the distance between the target intelligence knowledge and each known threat intelligence type, is specifically configured to determine whether there is a target distance not greater than a set distance threshold in the distance between the target intelligence knowledge and each known threat intelligence type; if so, determining the threat intelligence type corresponding to the target distance with the minimum distance as the target threat intelligence type; and if the target threat intelligence information does not exist, generating a threat intelligence type corresponding to the target intelligence knowledge, and determining the generated threat intelligence type as the target threat intelligence type.
In an embodiment of the present invention, the intelligence type associating unit 504 is further configured to determine whether the semantics of the target intelligence knowledge has intelligence significance according to the semantic expression of the target intelligence knowledge, if so, perform the generating of the threat intelligence type corresponding to the target intelligence knowledge, and if not, delete the target intelligence knowledge.
It is to be understood that the illustrated configuration of the embodiments of the present invention is not to be construed as a specific limitation on a threat intelligence processing apparatus. In other embodiments of the invention, a threat intelligence processing apparatus may include more or fewer components than shown, or some components may be combined, some components may be separated, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Because the content of information interaction, execution process, and the like among the modules in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.
The embodiment of the invention also provides a computing device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the threat intelligence processing method in any embodiment of the invention.
Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, causes the processor to execute a threat intelligence processing method according to any of the embodiments of the present invention.
Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion module connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion module to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
The embodiments of the invention have at least the following beneficial effects:
1. in one embodiment of the invention, a word segmentation processing mode is adopted to obtain a plurality of intelligence knowledge from unstructured data, the intelligence knowledge is only original characters in the unstructured data at the moment, a natural language processing technology can be utilized to carry out semantic understanding on the intelligence knowledge so as to obtain semantic expression of the intelligence knowledge, the intelligence knowledge can be identified and processed by a machine by utilizing the semantic expression, and thus the intelligence knowledge and threat intelligence types can be associated according to the semantic expression. Therefore, each intelligence knowledge is analyzed, so that the identification quantity of threat intelligence can be improved, and effective utilization of threat intelligence source data is realized.
2. In one embodiment of the invention, characters, words, sentences, segments and texts obtained after word segmentation processing is carried out on unstructured data are used as intelligence knowledge, and then each intelligence knowledge is screened respectively, so that the probability of missing the intelligence knowledge can be reduced, and the effective utilization of unstructured data in threat intelligence source data can be improved.
3. In one embodiment of the invention, the proper feature dimension is selected for the hidden layer in the convolutional neural network to select the finally trained convolutional neural network, so that the trained convolutional neural network can meet the prediction accuracy rate and ensure the prediction rate, thereby improving the rate of processing unstructured data and the accuracy rate of the processing result.
4. In one embodiment of the invention, the hidden layer in the convolutional neural network is used for carrying out semantic understanding on the intelligence knowledge, and the semantics of the intelligence knowledge is expressed in a multi-dimensional characteristic vector mode, so that not only can the semantic expression of each intelligence knowledge be identified and processed by a machine, but also the accuracy and the understanding rate of the semantic understanding of the intelligence knowledge can be improved.
5. In one embodiment of the invention, when the distance between the intelligence knowledge and the known threat intelligence type is calculated, cosine similarity is adopted for distance calculation, and Euclidean distance is utilized for correcting the calculation result, so that the accuracy of the calculation result can be improved.
6. In one embodiment of the invention, the new threat information types are generated only aiming at the information knowledge with the information significance by deleting the information knowledge without the information significance in the information knowledge, so that the generation of new threat information types aiming at some words without the information significance can be avoided, the accuracy of the threat information types is ensured, and the effective utilization of threat information is improved.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an …" does not exclude the presence of other similar elements in a process, method, article, or apparatus that comprises the element.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A threat intelligence processing method, comprising:
acquiring unstructured data in threat intelligence source data;
performing word segmentation processing on the unstructured data to obtain a plurality of intelligence knowledge;
semantic understanding is carried out on each intelligence knowledge by utilizing a natural language processing technology to obtain semantic expression of each intelligence knowledge;
and associating each intelligence knowledge with the known threat intelligence type respectively according to the semantic expression of each intelligence knowledge and the known threat intelligence type.
2. The method of claim 1, wherein the intelligence knowledge comprises: at least one of a word, a sentence, a paragraph, and a text.
3. The method of claim 1, wherein the semantic understanding of each intelligence knowledge using natural language processing techniques to obtain the semantic representation of each intelligence knowledge comprises:
determining a convolutional neural network which is trained in advance; the convolutional neural network is obtained by training by using sample intelligence knowledge and corresponding sample semantics as sample pairs;
and respectively calculating each intelligence knowledge by utilizing a hidden layer in the convolutional neural network to obtain a multidimensional characteristic vector corresponding to each intelligence knowledge, and determining the obtained multidimensional characteristic vector as semantic expression of the corresponding intelligence knowledge.
4. The method of claim 3, wherein the convolutional neural network is trained using:
obtaining a plurality of sample pairs, wherein each sample pair comprises sample intelligence knowledge and corresponding sample semantics;
according to a set characteristic dimension range, respectively determining each characteristic dimension belonging to the set characteristic dimension range as the characteristic dimension of a hidden layer in the convolutional neural network, and executing the next step after each characteristic dimension of the hidden layer in the convolutional neural network is determined;
for each sample pair, taking sample intelligence knowledge in the sample pair as the input of the convolutional neural network, taking sample semantics in the sample pair as the output of the convolutional neural network, and training the convolutional neural network;
determining recall rate and calculation rate respectively corresponding to each characteristic dimension corresponding to the convolutional neural network;
and determining a target feature dimension of a hidden layer in the convolutional neural network according to the recall rate and the calculation rate respectively corresponding to each feature dimension, and determining the convolutional neural network which is trained and corresponds to the target feature dimension as a final convolutional neural network.
5. The method of claim 1, wherein associating each informative knowledge with a known threat intelligence type based on the semantic representation of each informative knowledge and the known threat intelligence type comprises:
respectively obtaining semantic expression of each known threat intelligence type;
calculating the distance between each target intelligence knowledge and each known threat intelligence type in each intelligence knowledge according to the semantic expression of each intelligence knowledge and the semantic expression of each known threat intelligence type;
and determining a target threat intelligence type corresponding to the target intelligence knowledge according to the distance between the target intelligence knowledge and each known threat intelligence type, and associating the target intelligence knowledge to the target threat intelligence type.
6. The method of claim 5, wherein determining a target threat intelligence type corresponding to the target intelligence knowledge based on a distance between the target intelligence knowledge and each known threat intelligence type comprises:
determining whether a target distance which is not greater than a set distance threshold exists in the distances between the target intelligence knowledge and the known threat intelligence types; if so, determining the threat intelligence type corresponding to the target distance with the minimum distance as the target threat intelligence type; and if the target threat intelligence information does not exist, generating a threat intelligence type corresponding to the target intelligence knowledge, and determining the generated threat intelligence type as the target threat intelligence type.
7. The method of claim 6, further comprising, prior to said generating threat intelligence types corresponding to said target intelligence knowledge:
and determining whether the semantics of the target intelligence knowledge has intelligence significance according to the semantic expression of the target intelligence knowledge, if so, executing the generation of the threat intelligence type corresponding to the target intelligence knowledge, and if not, deleting the target intelligence knowledge.
8. A threat intelligence processing apparatus, comprising:
the data acquisition unit is used for acquiring unstructured data in threat intelligence source data;
the word segmentation processing unit is used for carrying out word segmentation processing on the unstructured data to obtain a plurality of information knowledge;
the semantic understanding unit is used for carrying out semantic understanding on each intelligence knowledge by utilizing a natural language processing technology to obtain semantic expression of each intelligence knowledge;
and the intelligence type association unit is used for associating each intelligence knowledge with the known threat intelligence type according to the semantic expression of each intelligence knowledge and the known threat intelligence type.
9. A computing device comprising a memory having stored therein a computer program and a processor that, when executing the computer program, implements the method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-7.
CN202110761075.1A 2021-07-06 2021-07-06 Threat information processing method, device, computing equipment and storage medium Pending CN113420127A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110761075.1A CN113420127A (en) 2021-07-06 2021-07-06 Threat information processing method, device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110761075.1A CN113420127A (en) 2021-07-06 2021-07-06 Threat information processing method, device, computing equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113420127A true CN113420127A (en) 2021-09-21

Family

ID=77720314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110761075.1A Pending CN113420127A (en) 2021-07-06 2021-07-06 Threat information processing method, device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113420127A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662576A (en) * 2023-07-26 2023-08-29 北京天云海数技术有限公司 Association method and association system for security vulnerabilities and laws and regulations

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902297A (en) * 2019-02-13 2019-06-18 北京航空航天大学 A kind of threat information generation method and device
CN111552855A (en) * 2020-04-30 2020-08-18 北京邮电大学 Network threat information automatic extraction method based on deep learning
CN111581355A (en) * 2020-05-13 2020-08-25 杭州安恒信息技术股份有限公司 Method, device and computer storage medium for detecting subject of threat intelligence
WO2021017614A1 (en) * 2019-07-31 2021-02-04 平安科技(深圳)有限公司 Threat intelligence data collection and processing method and system, apparatus, and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902297A (en) * 2019-02-13 2019-06-18 北京航空航天大学 A kind of threat information generation method and device
WO2021017614A1 (en) * 2019-07-31 2021-02-04 平安科技(深圳)有限公司 Threat intelligence data collection and processing method and system, apparatus, and storage medium
CN111552855A (en) * 2020-04-30 2020-08-18 北京邮电大学 Network threat information automatic extraction method based on deep learning
CN111581355A (en) * 2020-05-13 2020-08-25 杭州安恒信息技术股份有限公司 Method, device and computer storage medium for detecting subject of threat intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662576A (en) * 2023-07-26 2023-08-29 北京天云海数技术有限公司 Association method and association system for security vulnerabilities and laws and regulations

Similar Documents

Publication Publication Date Title
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
RU2628431C1 (en) Selection of text classifier parameter based on semantic characteristics
CN108717408B (en) Sensitive word real-time monitoring method, electronic equipment, storage medium and system
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN112926327B (en) Entity identification method, device, equipment and storage medium
JP6705318B2 (en) Bilingual dictionary creating apparatus, bilingual dictionary creating method, and bilingual dictionary creating program
Huang et al. Large-scale heterogeneous feature embedding
CN111666766A (en) Data processing method, device and equipment
CN114528845A (en) Abnormal log analysis method and device and electronic equipment
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN113095080A (en) Theme-based semantic recognition method and device, electronic equipment and storage medium
US11836331B2 (en) Mathematical models of graphical user interfaces
CN113986950A (en) SQL statement processing method, device, equipment and storage medium
CN114416926A (en) Keyword matching method and device, computing equipment and computer readable storage medium
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
Pei et al. Combining multi-features with a neural joint model for Android malware detection
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN113569578B (en) User intention recognition method and device and computer equipment
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
JP5824429B2 (en) Spam account score calculation apparatus, spam account score calculation method, and program
CN112632229A (en) Text clustering method and device
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium
CN110909777A (en) Multi-dimensional feature map embedding method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination