CN117743925A - Threat report classification method, threat report classification device, electronic equipment and storage medium - Google Patents

Threat report classification method, threat report classification device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117743925A
CN117743925A CN202311755761.3A CN202311755761A CN117743925A CN 117743925 A CN117743925 A CN 117743925A CN 202311755761 A CN202311755761 A CN 202311755761A CN 117743925 A CN117743925 A CN 117743925A
Authority
CN
China
Prior art keywords
word
threat
sentence
information
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311755761.3A
Other languages
Chinese (zh)
Inventor
刘微
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202311755761.3A priority Critical patent/CN117743925A/en
Publication of CN117743925A publication Critical patent/CN117743925A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides a threat report classification method, a threat report classification device, electronic equipment and a storage medium. The method comprises the following steps: extracting features of threat reports to be classified to obtain word embedding of words contained in each sentence in the threat reports; determining a first information word of each sentence, and determining a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence; and obtaining the threat category of the corresponding paragraph according to the first average word embedding of the first context word of each paragraph in the threat report. According to the embodiment of the invention, the classification of each paragraph in the threat report is identified by determining the information word and the context word in the threat report and based on the context word, so that the classification is more fine-grained, and the classification accuracy is improved.

Description

Threat report classification method, threat report classification device, electronic equipment and storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a threat report classification method, apparatus, electronic device, and storage medium.
Background
In recent years, a number of network attacks including Advanced Persistent Threats (APT) have been rapidly increasing. Most APT attacks can evade detection and use complex intrusion routes for potentially damaging long-term attack activities. To address this problem, many security operators, engineers and researchers have come to pay attention to the field of threat intelligence, involving collecting vulnerability and threat information and analyzing and organizing it for easy access. By utilizing threat intelligence, it is desirable to be able to predict future attacks from existing attacks and estimate the relevant actions between the various attacks. Therefore, it is necessary to comprehensively analyze a plurality of threat information. To raise network security awareness, various organizations often share attack analysis information in the form of security reports. Such as: the MITRE ATT & CK lifecycle framework describes attacks by Tactics, techniques, procedures (TTPs). In order to cope with increasing network attacks, the defending party needs to know the threat and the corresponding risk accurately in time.
The existing threat classification method generally extracts characteristics of threat information, classifies the threat information based on the extracted characteristics, and the classification method has the problem of low classification accuracy.
Disclosure of Invention
An embodiment of the application aims to provide a threat report classification method, a threat report classification device, electronic equipment and a storage medium, which are used for improving the accuracy of threat report classification.
In a first aspect, an embodiment of the present application provides a threat report classification method, including:
extracting features of threat reports to be classified to obtain word embedding of words contained in each sentence in the threat reports;
determining a first information word of each sentence, and determining a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence;
and obtaining the threat category of the corresponding paragraph according to the first average word embedding of the first context word of each paragraph in the threat report.
According to the embodiment of the invention, the classification of each paragraph in the threat report is identified by determining the information word and the context word in the threat report and based on the context word, so that the classification is more fine-grained, and the classification accuracy is improved.
In any embodiment, feature extraction is performed on the threat report to obtain word embedding of words contained in each sentence in the threat report, including:
Analyzing the threat report to obtain sentences contained in the threat report;
and extracting features of the input sentences by using the BERT model to obtain word embedding of the words in each sentence.
According to the embodiment of the application, the feature extraction is carried out on the threat report by utilizing the BERT model, so that the features of words contained in each sentence in the threat report are obtained, and a basis is provided for the subsequent classification of the threat report.
In any embodiment, determining the first information word of each sentence includes:
the first information word inquires the score of each word in each sentence; the score of the word is obtained by calculation according to the training threat report;
the first information word of each sentence is determined based on the score.
According to the method and the device for classifying the sentences, the first information words of the sentences are determined according to the score of each word, and the first information words are words which can express sentence topics most, so that when the first information words are used for carrying out a subsequent classifying process, the calculated amount is reduced, and the classifying accuracy is improved.
In any embodiment, determining the first context word according to the first information word includes:
for each sentence, if the sentence contains a first information word, the first information word is used as a first context word of the sentence;
And if the sentence does not contain the first information word, taking the first information word of the last sentence of the sentence as the first context word.
Since there may be no information word in the sentences, in order to enable each sentence to have a corresponding word expressing its topic, and two adjacent sentences describe the same topic with a high probability, the information word of the previous sentence may be used as a context word of the present sentence, so that each sentence has a corresponding context word.
In any embodiment, the method further comprises:
acquiring a training sample, wherein the training sample comprises a plurality of training threat reports and tactical classification of each paragraph corresponding to each training threat report;
extracting features of the training threat reports to obtain word embedding of words corresponding to each sentence in each training threat report;
according to the training threat reports, calculating a score corresponding to each word;
determining a second information word of each sentence according to the score;
determining a second context word according to the second information word of each sentence;
calculating a second average word embedding for the second context word of each paragraph;
training the classifier according to the second average word embedding and tactical classification of the paragraph to obtain a trained classifier.
According to the embodiment of the application, the threat report is classified through the trained classifier, the threat types of each paragraph of the threat report can be analyzed, and the accuracy of threat report classification is improved.
In any embodiment, obtaining the threat category of each paragraph according to the first average word embedding of the first context word of the corresponding paragraph in the threat report includes:
and embedding a first average word of the first context word into the input classifier to obtain the threat category of the corresponding paragraph output by the classifier.
According to the embodiment of the application, the first average word embedding of the first context words for representing the characteristics of each paragraph of the threat report is analyzed through the classifier, so that the accuracy of determining the threat category of each paragraph of the threat report is improved.
In either embodiment, calculating the score for each word from the plurality of training threat reports includes:
according to the formulaCalculating a score corresponding to each word;
wherein score (w) is the score of word w; p (w) is the probability of word w occurring in multiple training threat reports, p (w|k) is the probability of word w occurring in all training threat reports describing tactical class K, and K is the tactical class set.
According to the method and the device for classifying the threat reports, the score of each word is calculated through the plurality of training threat reports, so that the score corresponding to the threat reports to be classified can be directly used when the threat reports to be classified are actually classified, and the score obtained through calculation of the plurality of training threat reports is more reasonable.
In a second aspect, embodiments of the present application provide a threat report classification apparatus, including:
the feature extraction module is used for extracting features of the threat reports to be classified to obtain word embedding of words contained in each sentence in the threat reports;
the word determining module is used for determining a first information word of each sentence and determining a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence;
and the classification module is used for obtaining the threat category of the corresponding paragraph according to the first average word embedding of the first context word of each paragraph in the threat report.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory, and a bus, wherein,
the processor and the memory complete communication with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a non-transitory computer readable storage medium comprising:
the non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the method of the first aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a threat report classification method according to an embodiment of the present application;
Fig. 2 is a schematic flowchart of a classifier training method according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a threat report classification apparatus according to an embodiment of the disclosure;
fig. 4 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the technical solutions of the present application will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical solutions of the present application, and thus are only examples, and are not intended to limit the scope of protection of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions.
In the description of the embodiments of the present application, the technical terms "first," "second," etc. are used merely to distinguish between different objects and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated, a particular order or a primary or secondary relationship. In the description of the embodiments of the present application, the meaning of "plurality" is two or more unless explicitly defined otherwise.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In the description of the embodiments of the present application, the term "and/or" is merely an association relationship describing an association object, which means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In the description of the embodiments of the present application, the term "plurality" refers to two or more (including two), and similarly, "plural sets" refers to two or more (including two), and "plural sheets" refers to two or more (including two).
In the description of the embodiments of the present application, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured" and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally formed; or may be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the embodiments of the present application will be understood by those of ordinary skill in the art according to the specific circumstances.
Currently, there are many methods for identifying the type of threat information, for example: extracting the document characteristics and the information security element characteristics of threat information; the threat types of threat information are classified through information security element extraction, information security element relation construction, feature engineering, neural network model based and the like. However, in the prior art, there is no classification of the attack intention of threat intelligence. The embodiment of the application provides a threat report classification method, which classifies threat categories of each paragraph in a threat report based on average word embedding of first context words by extracting features of the threat report to be classified and acquiring the first context words of each sentence, thereby improving the accuracy of threat report classification.
It can be understood that the model training method and the threat report classification method provided by the embodiment of the application can be applied to electronic equipment, and the electronic equipment comprises a terminal and a server; the terminal can be a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assitant, PDA) and the like; the server may be an application server or a Web server. In addition, the model training method and the threat report classification method can be executed by the same terminal equipment or different terminal equipment.
In order to facilitate understanding, the technical solution provided in the embodiments of the present application will be described below by taking a terminal device as an execution body as an example, where application scenarios of the model training method and the threat report classification method provided in the embodiments of the present application are described.
Fig. 1 is a schematic flow chart of a threat report classification method according to an embodiment of the application, as shown in fig. 1, where the method includes:
step 101: and extracting features of the threat report to be classified to obtain word embedding of words contained in each sentence in the threat report.
The threat report to be classified may be a document to be determined of its attack intention, which is crawled from the network, for example: may be a CTI report, etc., and may also be other text, which is not specifically limited in this embodiment of the present application.
And extracting the characteristics of each sentence in the threat report to be classified, and obtaining word embedding of words contained in each sentence in the threat report. It is understood that word embedding may be represented in the form of vectors.
Step 102: determining a first information word of each sentence, and determining a first context word according to the first information word; wherein the first information word is used for representing the subject of the corresponding sentence.
In a specific implementation process, the first information word refers to a word capable of representing the subject of the sentence where the first information word is located, that is, a word with obvious meaning, and the embodiment of the application can determine the score of the word possibly involved in the threat report, query the score of the word for the word in each sentence in the threat report, and then determine the first information word according to the score. It is understood that the score of a word may be pre-calculated. The higher the score, the more the corresponding word can represent the subject of the sentence. After the first information word of each sentence is obtained, the first information word of the sentence is taken as a first context word. In practical applications, a score for a word in a sentence may not be queried, indicating that the sentence does not have a first information word. For this case, it is considered that two adjacent sentences will describe the same topic with a high probability, for example: if the subject of the current sentence is command and control, then the next sentence is likely to be discussing command and control as well. Thus, the first context word of the immediately preceding sentence can be acquired as the first context word of the sentence.
Thus, the first context word of each sentence may also be understood as a word for characterizing the subject of the corresponding sentence.
Step 103: and obtaining the threat category of the corresponding paragraph according to the first average word embedding of the first context word of each paragraph in the threat report.
The threat report is composed of at least one paragraph, each paragraph has at least one sentence, and after the first context word corresponding to each sentence is obtained in the threat report, the set of the first context words corresponding to the sentences contained in each paragraph is taken as the first context word of the paragraph. And averaging word embeddings corresponding to the first context words of each paragraph to obtain a first average word embedment. And embedding the first average word into a pre-trained classifier to obtain threat categories of corresponding paragraphs output by the classifier.
It will be appreciated that the union of threat categories corresponding to the various paragraphs may serve as the threat report corresponding threat category. In addition, the execution sequence of the steps can be adjusted according to actual requirements.
According to the embodiment of the invention, the classification of each paragraph in the threat report is identified by determining the information word and the context word in the threat report and based on the context word, so that the classification is more fine-grained, and the classification accuracy is improved.
On the basis of the foregoing embodiment, the feature extraction of the threat report to obtain word embedding of words contained in each sentence in the threat report includes:
Analyzing the threat report to obtain sentences contained in the threat report;
and extracting features of the input sentences by using the BERT model to obtain word embedding of the words in each sentence.
In a specific implementation process, since the threat report is an article, in order to facilitate subsequent analysis, the threat report may be broken into sentences, so as to obtain multiple sentences contained in the threat report. In the case of sentence breaking, the sentence breaking can be specifically performed according to punctuation marks, for example: sentence breaking can be performed according to periods, semicolons and the like. And sentences with more than three words are screened out from the sentences, and for longer sentences, the sentences can be divided into a plurality of short sentences so as to avoid too many words being truncated due to the limitation of the maximum input length when the processing is carried out subsequently. It should be noted that, in the case of screening sentences, the number of words in the screening rule may be set according to the actual situation, for example, sentences of more than four words may be set. In addition, in determining a longer sentence, sentences exceeding a preset number of words may be referred to as longer sentences, for example: sentences of more than 10 words are called longer sentences or the like.
After obtaining a plurality of sentences, in order to enable each subsequent module to recognize each sentence, a head identifier may be added at the first position of the input sequence, that is, a [ CLS ] tag may be adopted; a tail identifier is added at the tail of each sentence, and a [ SEP ] tag can be used. Multiple sentences are distinguished by [ SEP ] to obtain individual tagged sentences. It should be noted that the head mark and the tail mark may also use other marks as long as sentences can be distinguished from sentences, which is not particularly limited in the embodiment of the present application.
And generating a sentence sequence according to the position of the sentence in the threat report, and processing the sentence sequence by a text processing module to obtain a representation vector with the marked sentence.
The sentence sequence can be expressed as:
wherein x is i Is the ith sentence in the text.
Inputting the sentence sequence into a BERT model, and extracting features of the input sentences by the BERT model to obtain word embedding of words contained in each sentence.
According to the embodiment of the application, the feature extraction is carried out on the threat report by utilizing the BERT model, so that the features of words contained in each sentence in the threat report are obtained, and a basis is provided for the subsequent classification of the threat report.
On the basis of the above embodiment, determining the first information word of each sentence includes:
querying the score of each word in each sentence; the score of the word is obtained by calculation according to the training threat report;
the first information word of each sentence is determined based on the score.
In a specific implementation process, a plurality of training threat reports are obtained, and a score corresponding to each word contained in the plurality of training threat reports is calculated according to the following calculation formula:
wherein score (w) is the score of word w; p (w) is the probability of occurrence of word w in the plurality of training threat reports, in particular obtainable by dividing the number of occurrences of word w in the plurality of training threat reports by the total vocabulary, p (w|k) is the probability of occurrence of word w in all training threat reports describing tactical class K, in particular obtainable by dividing the number of occurrences of word w in the threat report describing K by the total vocabulary in the training threat report describing K, K being a tactical class set.
Wherein the tactical category set comprises: scouting, resource development, initial access, enforcement, persistence, rights promotion, defense avoidance, credential access, discovery, lateral movement, collection, command and control, data theft, and compromise.
Reconnaissance: an attacker gathers useful information before invading an enterprise to plan for future actions. Scouting involves the attacker actively or passively collecting some information for targeting the attack. Such information may include detailed information of the victim organization, infrastructure, or employee. The attacker may also utilize this information to assist in attacks at other stages of the attack lifecycle, such as using the collected information to plan and perform initial accesses, to determine range of action and target priority after intrusion, or to facilitate further scout work.
And (3) resource development: meaning that an attacker will set up some resources for future combat. Resource development involves an attacker creating, purchasing, or stealing resources that can be used to lock the target of the attack. Such resources include infrastructure, accounts, or functions. An attacker may also use these resources for other stages of the attack lifecycle, such as using purchased domain names to implement commands and controls, phishing with mail accounts to implement "initial access", or stealing code signature certificates to implement defensive bypasses.
Initial access: in general, "initial access" refers to an attacker establishing a foothold in an enterprise environment. For enterprises, from this point forward, the attacker can use different techniques to achieve initial access based on various information collected prior to intrusion. For example, an attacker uses a harpoon-type fishing accessory to attack. The accessory may utilize some type of vulnerability to achieve this level of access, such as PowerShell or other scripting techniques. If the execution is successful, the attacker can employ other strategies and techniques to achieve the final objective.
Performing: among all tactics taken by an attacker, the most widely applied tactics are "performed". Attackers will choose to "execute" this tactic when considering the use of off-the-shelf malware, luxury software, or APT attacks. To validate the malware, it must be run so the defender has the opportunity to block or detect it. However, malicious executable files of all malware cannot be easily found with antivirus software. In addition, the command line interface or PowerShell is very useful to an attacker. Many file-free malware utilize one or a combination of both of these techniques.
Persistence: after the attacker realizes persistent access, even if the operation and maintenance personnel take measures such as restarting, changing certificates and the like, the computer can still be infected with viruses again or maintain the existing connection of the computer. For example, registry run keys, startup folders, are the most common technique that are executed each time a computer is started. Thus, an attacker may implement persistence when launching a commonly used application such as a Web browser or Microsoft Office. Among all ATT & CK tactics, persistence is one of the most interesting tactics.
And (3) rights promotion: ATT & CK proposes that "should focus on preventing attack tools from running in the early stages of the active chain and re-focus on identifying subsequent malicious behavior". This means that a deep defense needs to be exploited to prevent infection with viruses, such as the peripheral defense system of the terminal or the application whitelist. For rights promotion beyond the ATT & CK range, the prevention is to use a reinforcement baseline on the terminal. Another approach to deal with rights promotion is audit log records. When attackers employ some of the techniques in rights promotion, they often leave spider silk trails, exposing their purpose. Particularly for the log of the host side, all operation and maintenance commands of the server need to be recorded so as to facilitate evidence collection and real-time audit.
Defense circumvention: refers to techniques used by an attacker to avoid being discovered by defenses throughout the attack. Techniques to defend against bypass use include offloading/disabling security software or obfuscating/encrypting data and scripts. An attacker can also exploit and misuse trusted processes to hide and disguise malware. One interesting point of this tactic is that some malware (e.g., luxury software) is not of any concern for defensive bypassing. Their only goal is to execute once on the device and then be discovered as soon as possible. Some technologies may trick anti-virus (AV) products into failing to detect them at all or bypass the application of whitelisting techniques.
Credential access: any attacker intrusion into an enterprise would like to maintain a degree of privacy. An attacker wishes to steal as many credentials as possible. Of course, they can be broken by violence, but this attack is too much static. There are many examples of stealing hash passwords, and hash transfer or offline cracking of hash passwords. Among all the information to be stolen, the attacker prefers to steal the plaintext password. The plaintext password may be stored in a plaintext file, a database, or even a registry. One very common behavior is that an attacker hacks a system to steal the local hash code and crack the local administrator code. The simplest way to deal with credential access is to use a complex password. It is recommended to use case, number and special character combinations in order to make it difficult for an attacker to crack the password. Finally, it is necessary to monitor the usage of the active account, since in many cases data leakage occurs through the active credentials.
The discovery is as follows: including techniques used by an attacker to obtain information about the system and internal networks. These techniques may help an attacker observe the environment and determine direction before deciding how to take action. The attacker can use these techniques to explore what they can control and what is near the point of entry and to help them achieve the purpose of the attack based on the information that has been obtained. An attacker can also use a local operating system tool to realize the purpose of information collection after invasion.
And (3) transversely moving: an attacker typically attempts to move laterally within the network after exploiting a single system vulnerability. Even the lux software for a single system also tries to move laterally in the network to find other attack targets. An attacker will typically first find a foothold and then start moving through the various systems, looking for higher access rights, in order to achieve the final goal. In mitigating and detecting lateral movement, appropriate network segments can mitigate the risk of lateral movement to a large extent.
And (3) collecting: refers to the technology used by an attacker to collect information and collect therefrom sources of information related to the purpose of implementing the attacker. Typically, the next step after collecting the data is to steal the data. Common attack sources include various drive types, browsers, audio, video, and email. Common collection methods include capturing screenshots and keyboard inputs. Enterprises may use various techniques in this tactic to learn more about how malware handles data in organizations. An attacker may attempt to steal the user's information including what is on the screen, what the user is inputting, what the user discusses, and the user's appearance characteristics.
Command and control: consists of techniques used by an attacker to communicate with an intruded system within the victim network. An attacker typically avoids itself being discovered by mimicking the normal expected traffic. Depending on the network architecture and defensive power of the victim, the attacker can establish commands and controls of different stealth levels in a variety of ways. Most malware now uses command and control tactics to some extent. An attacker can receive data through the command and control server and tell the malware what instructions to execute next. For each command and control, the attacker accesses the network from a remote location. Thus, understanding what happens on the network is critical to effectively cope with these technologies.
Data theft: including techniques used by an attacker to steal data from a user network. After the attacker gains access rights, the attacker searches the relevant data everywhere and then starts to steal the data, but not all malware can reach this stage. In the case of an attacker stealing data through a network, especially large amounts of data (such as a customer database), establishing a network intrusion detection or prevention system helps to identify when the data is transmitted.
Hazard: an attacker attempts to manipulate, interrupt, or destroy the enterprise's systems and data. Techniques for "compromise" include destroying or tampering with the data. In some cases, the business process looks good, but it is possible that the data has been tampered with by an attacker. These techniques may be used by attackers to accomplish their ultimate goal or to provide a shield for their hacking secrets.
By calculating the scores corresponding to the words in the training threat report, the scores corresponding to the important words possibly appearing in the threat report can be obtained, and the calculated words and the scores corresponding to the important words are stored. Thus, in subsequent applications, the direct query may obtain the score of each word. It can be understood that after the score corresponding to each word is obtained by calculation, the word with the lower score can be removed, and only the word with the higher score and the corresponding score are reserved.
After obtaining the score of the words contained in each sentence, the first information word of each sentence may be determined according to the score, for example: words with a score greater than a preset score may be used as the first information word, or the first few words with a score greater than the preset score may be used as the first information word.
According to the method and the device for classifying the sentences, the first information words of the sentences are determined according to the score of each word, and the first information words are words which can express sentence topics most, so that when the first information words are used for carrying out a subsequent classifying process, the calculated amount is reduced, and the classifying accuracy is improved.
On the basis of the above embodiment, fig. 2 is a schematic flowchart of a classifier training method according to an embodiment of the present application, as shown in fig. 2, where the method includes:
step 201: acquiring a training sample, wherein the training sample comprises a plurality of training threat reports and tactical classification of each paragraph corresponding to each training threat report; wherein each paragraph in a training threat report may correspond to a plurality of tactical classifications, the specific tactical classifications being found in the above embodiments.
Step 202: extracting features of the training threat reports to obtain word embedding of words corresponding to each sentence in each training threat report; it can be understood that, the method for extracting features of the training threat report may be referred to the method for extracting features of the threat report in the foregoing embodiment, which is not described herein.
Step 203: according to the training threat reports, calculating a score corresponding to each word; the calculation method of the score of each word is referred to the above embodiment, and will not be described here again.
Step 204: determining a second information word of each sentence according to the score; the method for determining the second information word of each sentence may be the same as the method for determining the first information word corresponding to each sentence in the threat report in the above embodiment, and will not be described herein.
Step 205: a second context word is determined from the second information word of each sentence. The method for determining the second context word may be the same as the method for determining the first context word corresponding to each sentence in the threat report in the above embodiment, and will not be described herein.
Step 206: calculating a second average word embedding for the second context word of each paragraph; when calculating the second average word embedding, the word embedding corresponding to the second context word of each paragraph can be added and averaged to obtain the second average word embedding.
Step 207: training the classifier according to the second average word embedding and tactical classification of the paragraph to obtain a trained classifier. And embedding the second average time into an input classifier, outputting a prediction result by the classifier, and reversely optimizing the internal parameters of the classifier by using the prediction result and tactical classification of the corresponding paragraph so as to realize training of the classifier. Wherein the classifier can be constructed using a single layer forward network of sigmoid activation functions. Since it is a multi-tag classifier, if the probability difference between the highest and next highest tactical tags is within 10%, then the text has two tactical tags. During model training, all threat reports are given the ATT & CK tactics.
According to the embodiment of the application, the threat report is classified through the trained classifier, the threat types of each paragraph of the threat report can be analyzed, and the accuracy of threat report classification is improved.
Fig. 3 is a schematic structural diagram of a threat report classification apparatus provided in an embodiment of the application, where the apparatus may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the embodiment of the method of fig. 1 described above, and is capable of performing the steps involved in the embodiment of the method of fig. 1, and specific functions of the apparatus may be referred to in the foregoing description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The device comprises: a feature extraction module 301, a word determination module 302, and a classification module 303, wherein:
the feature extraction module 301 is configured to perform feature extraction on a threat report to be classified, so as to obtain word embedding of words contained in each sentence in the threat report;
the word determining module 302 is configured to determine a first information word of each sentence, and determine a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence;
the classification module 303 is configured to obtain a threat category of each paragraph according to a first average word embedding of the first context word of the corresponding paragraph in the threat report.
On the basis of the above embodiment, the feature extraction module 301 is specifically configured to:
analyzing the threat report to obtain sentences contained in the threat report;
and extracting features of the input sentences by using the BERT model to obtain word embedding of the words in each sentence.
Based on the above embodiment, the word determining module 302 is specifically configured to:
querying the score of each word in each sentence; the score of the word is obtained by calculation according to the training threat report;
and determining the first information word of each sentence according to the score.
Based on the above embodiment, the word determining module 302 is specifically configured to:
for each sentence, if the sentence contains a first information word, the first information word is used as a first context word of the sentence;
and if the sentence does not contain the first information word, taking the first information word of the sentence which is the last sentence of the sentence as a first context word.
On the basis of the above embodiment, the apparatus further includes a model training module for:
acquiring a training sample, wherein the training sample comprises a plurality of training threat reports and tactical classification of each paragraph corresponding to each training threat report;
Extracting features of the training threat reports to obtain word embedding of words corresponding to each sentence in each training threat report;
calculating the score corresponding to each word according to the training threat reports;
determining a second information word of each sentence according to the score;
determining a second context word according to the second information word of each sentence;
calculating a second average word embedding for the second context word of each paragraph;
training the classifier according to the second average word embedding and tactical classification of the paragraph to obtain a trained classifier.
Based on the above embodiment, the model training module is specifically configured to:
according to the formulaCalculating a score corresponding to each word;
wherein score (w) is the score of word w; p (w) is the probability of word w occurring in the plurality of training threat reports, p (w|k) is the probability of word w occurring in all training threat reports describing tactical class K, and K is the tactical class set.
Based on the above embodiment, the classification module 303 is specifically configured to:
and embedding the first average word into the classifier to obtain threat categories of corresponding paragraphs output by the classifier.
Fig. 4 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present application, as shown in fig. 4, where the electronic device includes: a processor (processor) 401, a memory (memory) 402, and a bus 403; wherein,
The processor 401 and the memory 402 complete communication with each other through the bus 403;
the processor 401 is configured to call the program instructions in the memory 402 to perform the methods provided in the above method embodiments, for example, including: extracting features of threat reports to be classified to obtain word embedding of words contained in each sentence in the threat reports; determining a first information word of each sentence, and determining a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence; and embedding according to a first average word of the first context word of each paragraph in the threat report to obtain the threat category of the corresponding paragraph.
The processor 401 may be an integrated circuit chip having signal processing capabilities. The processor 401 may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Which may implement or perform the various methods, steps, and logical blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 402 may include, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), and the like.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising: extracting features of threat reports to be classified to obtain word embedding of words contained in each sentence in the threat reports; determining a first information word of each sentence, and determining a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence; and embedding according to a first average word of the first context word of each paragraph in the threat report to obtain the threat category of the corresponding paragraph.
The present embodiment provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: extracting features of threat reports to be classified to obtain word embedding of words contained in each sentence in the threat reports; determining a first information word of each sentence, and determining a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence; and embedding according to a first average word of the first context word of each paragraph in the threat report to obtain the threat category of the corresponding paragraph.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. A threat report classification method, comprising:
extracting features of threat reports to be classified to obtain word embedding of words contained in each sentence in the threat reports;
determining a first information word of each sentence, and determining a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence;
and embedding according to a first average word of the first context word of each paragraph in the threat report to obtain the threat category of the corresponding paragraph.
2. The method of claim 1, wherein the feature extraction of the threat report to obtain word embeddings of words contained in each sentence in the threat report comprises:
analyzing the threat report to obtain sentences contained in the threat report;
and extracting features of the input sentences by using the BERT model to obtain word embedding of the words in each sentence.
3. The method of claim 1, wherein said determining the first information word for each sentence comprises:
querying the score of each word in each sentence; the score of the word is obtained by calculation according to the training threat report;
And determining the first information word of each sentence according to the score.
4. The method of claim 1, wherein said determining a first context word from said first information word comprises:
for each sentence, if the sentence contains a first information word, the first information word is used as a first context word of the sentence;
and if the sentence does not contain the first information word, taking the first information word of the last sentence of the sentence as a first context word.
5. The method according to any one of claims 1-4, further comprising:
acquiring a training sample, wherein the training sample comprises a plurality of training threat reports and tactical classification of each paragraph corresponding to each training threat report;
extracting features of the training threat reports to obtain word embedding of words corresponding to each sentence in each training threat report;
calculating the score corresponding to each word according to the training threat reports;
determining a second information word of each sentence according to the score;
determining a second context word according to the second information word of each sentence;
calculating a second average word embedding for the second context word of each paragraph;
Training the classifier according to the second average word embedding and tactical classification of the paragraph to obtain a trained classifier.
6. The method of claim 5, wherein calculating a score for each word from the plurality of training threat reports comprises:
according to the formulaCalculating a score corresponding to each word;
wherein score (w) is the score of word w; p (w) is the probability of word w occurring in the plurality of training threat reports, p (w|k) is the probability of word w occurring in all training threat reports describing tactical class K, and K is the tactical class set.
7. The method of claim 5, wherein the obtaining threat categories for each paragraph based on the first average word embedding for the first context word for the corresponding paragraph in the threat report comprises:
and embedding the first average word into the classifier to obtain threat categories of corresponding paragraphs output by the classifier.
8. A threat report classification apparatus, comprising:
the feature extraction module is used for extracting features of the threat reports to be classified to obtain word embedding of words contained in each sentence in the threat reports;
The word determining module is used for determining a first information word of each sentence and determining a first context word according to the first information word; the first information word is used for representing the theme of the corresponding sentence;
and the classification module is used for obtaining the threat category of the corresponding paragraph according to the first average word embedding of the first context word of each paragraph in the threat report.
9. An electronic device, comprising: a processor, a memory, and a bus, wherein,
the processor and the memory complete communication with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-7.
10. A non-transitory computer readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-7.
CN202311755761.3A 2023-12-19 2023-12-19 Threat report classification method, threat report classification device, electronic equipment and storage medium Pending CN117743925A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311755761.3A CN117743925A (en) 2023-12-19 2023-12-19 Threat report classification method, threat report classification device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311755761.3A CN117743925A (en) 2023-12-19 2023-12-19 Threat report classification method, threat report classification device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117743925A true CN117743925A (en) 2024-03-22

Family

ID=90277032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311755761.3A Pending CN117743925A (en) 2023-12-19 2023-12-19 Threat report classification method, threat report classification device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117743925A (en)

Similar Documents

Publication Publication Date Title
US20230231875A1 (en) Detecting and mitigating poison attacks using data provenance
Aldauiji et al. Utilizing cyber threat hunting techniques to find ransomware attacks: A survey of the state of the art
Al-Khateeb et al. Awareness model for minimizing the effects of social engineering attacks in web applications
Onik et al. A novel approach for network attack classification based on sequential questions
Chethana et al. Improved Domain Generation Algorithm To Detect Cyber-Attack With Deep Learning Techniques
Perera et al. The next gen security operation center
Neupane et al. Impacts and risk of generative ai technology on cyber defense
Baballe et al. Management of Vulnerabilities in Cyber Security
Thaker et al. Detecting phishing websites using data mining
Natadimadja et al. A survey on phishing website detection using hadoop
Khan et al. A dynamic method of detecting malicious scripts using classifiers
Zakaria et al. Feature extraction and selection method of cyber-attack and threat profiling in cybersecurity audit
Kathuria et al. Automation Intercession: Cyber Security
Watney Artificial intelligence and its' legal risk to cybersecurity
Al-Hamar et al. Phishing attacks in Qatar: A literature review of the problems and solutions
Paturi et al. Detection of phishing attacks using visual similarity model
CN117743925A (en) Threat report classification method, threat report classification device, electronic equipment and storage medium
CN117312943A (en) Threat information classification method, threat information classification device, electronic equipment and storage medium
Deshpande et al. Detection and Notification of Zero-Day attack to Prevent Cybercrime
Layton Relative cyberattack attribution
Mynuddin et al. Cyber Security System Using Fuzzy Logic
Biswas et al. Artificial Intelligence for Societal Issues
US11934515B2 (en) Malware deterrence using computer environment indicators
Chaudhary et al. Role of Machine Learning Applications in Enhancing Cyber Security Effectiveness: An Empirical Study
CN117407530A (en) Text classification method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination