CN116737926A - Method, device, equipment and storage medium for classifying threat information text - Google Patents

Method, device, equipment and storage medium for classifying threat information text Download PDF

Info

Publication number
CN116737926A
CN116737926A CN202310671897.XA CN202310671897A CN116737926A CN 116737926 A CN116737926 A CN 116737926A CN 202310671897 A CN202310671897 A CN 202310671897A CN 116737926 A CN116737926 A CN 116737926A
Authority
CN
China
Prior art keywords
text
threat
classification result
threat information
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310671897.XA
Other languages
Chinese (zh)
Inventor
陆佳丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202310671897.XA priority Critical patent/CN116737926A/en
Publication of CN116737926A publication Critical patent/CN116737926A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Abstract

The disclosure provides a threat intelligence text classification method, device, equipment and storage medium, wherein the method comprises the following steps: inputting the title text of the first threat information text into a title classification model, outputting a first classification result corresponding to the first threat information text, inputting the text of the first threat information text into a text classification model, and outputting a second classification result corresponding to the first threat information text; and determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result. Therefore, the embodiment of the disclosure can respectively classify the title text and the text of the first threat information text, and further determine the text classification result corresponding to the first threat information text based on the first classification result corresponding to the title text and the second classification result corresponding to the text, thereby remarkably improving the accuracy of threat information text classification.

Description

Method, device, equipment and storage medium for classifying threat information text
Technical Field
The disclosure relates to the technical field of network security, and in particular relates to a threat information text classification method, device, equipment and storage medium.
Background
In the field of network security, various novel advanced sustainable threat attack events are frequently sent, and various propagation modes such as a network, an email and the like can be fully utilized for threat attack information, so that new challenges are brought to network security threat protection, network security personnel are required to actively collect the newly-appearing attack threat information, the acquired attack threat information is classified, the network security personnel can find out the novel attack threat information more timely, and accordingly the network security defense effect is achieved.
In the current threat information text classification method, the accuracy of the method is to be improved mainly by analyzing the text content of the threat information text.
Therefore, the novel threat information text classification method can remarkably improve the accuracy of threat information text classification.
Disclosure of Invention
In order to solve the technical problems, an embodiment of the present disclosure provides a method for classifying threat intelligence text.
In a first aspect, the present disclosure provides a method for classifying threat intelligence text, the method comprising:
inputting a title text of a first threat information text in the text set to be classified into a trained title classification model, classifying the title classification model, and outputting a first classification result corresponding to the first threat information text; the first threat intelligence text comprises the title text and the text, and the first classification result is used for representing the threat intelligence text type of the first threat intelligence text from the dimension of the title text;
Inputting the text of the first threat information text into a trained text classification model, classifying the text classification model, and outputting a second classification result corresponding to the first threat information text; the second classification result is used for representing the threat information text type of the first threat information text from the dimension of the text;
determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result; the text classification result is used for representing the threat information text type to which the first threat information text belongs.
In an alternative embodiment, before the inputting the headline text of the first threat intelligence text in the set of texts to be classified into the trained headline classification model, the method further includes:
if it is determined that the second threat intelligence text in the text set to be classified is not successfully matched with any one of the preset matching rules, determining the second threat intelligence text as the first threat intelligence text, and executing the step of inputting the title text of the first threat intelligence text in the text set to be classified into a trained title classification model; the preset matching rule is set for a type keyword corresponding to the target information source information and/or at least one threat information text type.
In an optional implementation manner, the rules in the preset matching rules have a corresponding relationship with the threat intelligence text types, and the method further includes:
if the second threat information text is successfully matched with the first rule in the preset matching rules, determining the threat information text type corresponding to the first rule as a text classification result corresponding to the second threat information text.
In an alternative embodiment, before the inputting the body text of the first threat intelligence text into the trained body classification model, the method further includes:
and if the first classification result corresponding to the first threat information text does not comprise the target threat information text type, executing the step of inputting the body text of the first threat information text into a trained body classification model.
In an alternative embodiment, the method further comprises:
if the first classification result corresponding to the first threat information text is determined to comprise any target threat information text type, the first classification result is directly determined to be a text classification result corresponding to the first threat information text.
In an optional implementation manner, the determining, based on the first classification result and the second classification result, a text classification result corresponding to the first threat intelligence text includes:
and merging and deduplicating the first classification result and the second classification result to obtain a text classification result corresponding to the first threat information text.
In an optional implementation manner, after determining the text classification result corresponding to the first threat intelligence text, the method further includes:
judging whether the number of text types in a text classification result corresponding to the first threat information text is larger than a preset number or not;
if the number of the text types in the text classification result is determined to be larger than the preset number, judging whether the text classification result contains non-threat information text types or other types;
and deleting the non-threat intelligence text type and other types from the text classification result if the text classification result is determined to contain the non-threat intelligence text type or other types.
In a second aspect, the present disclosure provides a threat intelligence text classification apparatus, the apparatus comprising:
the first output module is used for inputting the title text of the first threat information text in the text set to be classified into a trained title classification model, and outputting a first classification result corresponding to the first threat information text after classification processing is carried out by the title classification model; the first threat intelligence text comprises the title text and the text, and the first classification result is used for representing the threat intelligence text type of the first threat intelligence text from the dimension of the title text;
The second output module is used for inputting the text of the first threat information text into a trained text classification model, and outputting a second classification result corresponding to the first threat information text after classification processing is carried out by the text classification model; the second classification result is used for representing the threat information text type of the first threat information text from the dimension of the text;
the first determining module is used for determining a text classification result corresponding to the first threat information text based on the first classification result and the second classification result; the text classification result is used for representing the threat information text type to which the first threat information text belongs.
In a third aspect, the present disclosure provides a computer readable storage medium having instructions stored therein, which when run on a terminal device, cause the terminal device to implement the above-described method.
In a fourth aspect, the present disclosure provides a threat intelligence text classification apparatus, comprising: the computer program comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the method when executing the computer program.
In a fifth aspect, the present disclosure provides a computer program product comprising computer programs/instructions which when executed by a processor implement the above-described method.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has at least the following advantages:
the embodiment of the disclosure provides a method for classifying threat information texts, which comprises the steps of inputting a title text of a first threat information text into a trained title classification model, classifying by the title classification model, outputting a first classification result corresponding to the first threat information text, inputting a text of the first threat information text into the trained text classification model, classifying by the text classification model, and outputting a second classification result corresponding to the first threat information text; and determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result. Therefore, the embodiment of the disclosure can respectively classify the title text and the text of the first threat information text, and further determine the text classification result corresponding to the first threat information text based on the first classification result corresponding to the title text and the second classification result corresponding to the text, thereby remarkably improving the accuracy of threat information text classification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow chart of a method for classifying threat intelligence text provided by an embodiment of the disclosure;
fig. 2 is a schematic structural diagram of a title classification model according to an embodiment of the disclosure;
FIG. 3 is a flow chart of another method of classifying threat intelligence text provided by an embodiment of the disclosure;
fig. 4 is a schematic structural diagram of a classification device for threat information texts according to an embodiment of the disclosure;
fig. 5 is a schematic structural diagram of a classification device for threat intelligence text according to an embodiment of the disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.
In the field of network security, various novel advanced sustainable threat attack events are frequently sent, and various propagation modes such as a network, an email and the like can be fully utilized for threat attack information, so that new challenges are brought to network security threat protection, network security personnel are required to actively collect the newly-appearing attack threat information, the acquired attack threat information is classified, the network security personnel can find out the novel attack threat information more timely, and accordingly the network security defense effect is achieved.
In the current threat information text classification method, the accuracy of the method is to be improved mainly by analyzing the text content of the threat information text.
Therefore, the embodiment of the disclosure provides a method for classifying threat information texts, which comprises the steps of inputting a title text of a first threat information text to a trained title classification model, classifying by the title classification model, outputting a first classification result corresponding to the first threat information text, inputting a text of the first threat information text to a trained text classification model, classifying by the text classification model, and outputting a second classification result corresponding to the first threat information text; and determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result. Therefore, the embodiment of the disclosure can respectively classify the title text and the text of the first threat information text, and further determine the text classification result corresponding to the first threat information text based on the first classification result corresponding to the title text and the second classification result corresponding to the text, thereby remarkably improving the accuracy of threat information text classification.
Based on this, the embodiment of the disclosure provides a method for classifying threat information texts, referring to fig. 1, which is a flowchart of the method for classifying threat information texts provided by the embodiment of the disclosure, where the method specifically includes:
s101: and inputting the title text of the first threat information text into a trained title classification model, classifying the title classification model, and outputting a first classification result corresponding to the first threat information text.
The first threat intelligence text comprises the title text and the text, and the first classification result is used for representing the threat intelligence text type of the first threat intelligence text from the dimension of the title text.
The text set to be classified comprises threat information texts crawled based on a preset information source list, the preset information source list comprises information source identifiers, and the threat information texts comprise title texts and text texts.
The threat intelligence text classification method provided by the embodiment of the disclosure can be applied to a client or a server, for example, the client can comprise a client deployed on a smart phone, a client deployed on a tablet computer and the like.
In practical application, the first threat information text can be obtained by crawling based on a preset information source list, and specifically can be threat information text issued by the information source based on information of each information source in the preset information source list; the information source refers to a source and a channel for a user to acquire information, and can comprise a news website, social media, a research institution and other platforms for daily publishing of network safety related information; the threat information text can comprise text information related to the network security threat information or text information unrelated to the network security threat information, which is crawled from a platform such as a security consultation medium, an information research institution or an open source information community, and the like, and the text information unrelated to the network security threat information can be judged to be a non-threat information text type through a subsequent classification method, so that the accuracy of threat information text classification is further improved.
In the embodiment of the present disclosure, the first threat information text may specifically include a title text and a body text, and since the title text may be used to characterize the main content of the first threat information text, the embodiment of the present disclosure may determine, based on the title text, a threat information text type to which the first threat information text belongs; in addition, since the body text is a detailed description of the first threat intelligence text, embodiments of the disclosure may also determine, based on the body text, a threat intelligence text type to which the first threat intelligence text belongs.
As can be seen, after the first threat intelligence text including the header text and the body text is acquired, the embodiments of the present disclosure can facilitate subsequent classification of the threat intelligence text based on text information in two dimensions of the header text and the body text.
In an alternative embodiment, before the title text of the first threat information text is input into the title classification model, data preprocessing may be further performed on the title text and the body text of the first threat information text, respectively; the data preprocessing can comprise word segmentation, stop word removal, punctuation removal and other processes on the header text and the body text.
In practical application, for threat information text of Chinese text, a third party library can be adopted to perform word segmentation processing on the title text and the text respectively.
In practical application, because a large number of repeated punctuation marks and a large number of dead words which have no meaning on the expression of the threat information text exist in the acquired threat information text, such as ' yes ', nor ' and the like, the generalization capability of a title classification model and a text classification model can be increased by removing the punctuation marks and the dead words, and the subsequent classification processing of the title text and the text based on the title classification model and the text classification model is facilitated.
In the embodiment of the disclosure, threat information text types may include types such as APT, vulnerability, fishing, security daily report, security report, and non-threat information, and in practical application, threat information text types may be set based on specific service requirements, for example, 14 text types such as APT, DDoS, malicious software, luxury software, fishing, vulnerability, attack activity, other, security daily report, technology sharing, data leakage, and non-threat information may be set, which is not limited in the embodiment of the disclosure.
In the embodiment of the disclosure, the title classification model is used for classifying the title text of the input first threat information text so as to output a first classification result corresponding to the first threat information text; the first threat information text may be any threat information text in the text set to be classified, and the first classification result corresponding to the first threat information text is used for representing the threat information text type to which the first threat information text belongs from the dimension of the title text.
In practical applications, the title classification model also needs to be trained before being applied. In the embodiment of the disclosure, the title classification model is trained based on the title text sample with the corresponding relation and the threat information text type, and the trained title classification model is obtained and used for classifying the title text of the threat information text.
S102: inputting the text of the first threat information text to a trained text classification model, classifying the text classification model, and outputting a second classification result corresponding to the first threat information text.
And the second classification result is used for representing the threat information text type of the first threat information text from the dimension of the text body.
In the embodiment of the disclosure, the text classification model is used for classifying the text of the input first threat information text to output a second classification result corresponding to the first threat information text, wherein the second classification result is used for representing the threat information text type to which the first threat information text belongs from the dimension of the text.
In practical application, the structure of the text classification model may be set to be consistent with the structure of the title classification model, and it should be noted that the text classification model is obtained based on text samples with corresponding relations and threat intelligence text types, and is used for classifying the text of the threat intelligence text.
S103: and determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result.
The text classification result is used for representing the threat information text type to which the first threat information text belongs.
In an alternative embodiment, after a first classification result corresponding to the title text of the first threat information text is obtained based on the title classification model, and a second classification result corresponding to the text of the first threat information text is obtained based on the text classification model, merging and deduplicating processing is performed on the first classification result and the second classification result to obtain a text classification result corresponding to the first threat information text.
In the embodiment of the disclosure, a text classification result corresponding to the first threat information text is used for representing a threat information text type to which the first threat information text belongs from two dimensions of a title text and a text.
For example, assume that a first classification result corresponding to a title text of a first threat information text is APT and a vulnerability, a second classification result corresponding to a text of the first threat information text is vulnerability and phishing, and after merging and deduplicating the first classification result and the second classification result, the obtained text classification result corresponding to the first threat information text is APT, vulnerability and phishing.
It should be noted that, in the embodiment of the present disclosure, the text classification result corresponding to the first threat intelligence text may include a plurality of classification results, so that the text classification result corresponding to the first threat intelligence text can be more accurately positioned to all threat intelligence text types related to the first threat intelligence text as far as possible, thereby further improving the accuracy of threat intelligence text classification.
In the method for classifying threat information texts provided by the embodiment of the disclosure, the title text of the first threat information text is input into a trained title classification model, after classification processing is performed by the title classification model, a first classification result corresponding to the first threat information text is output, the text of the first threat information text is input into a trained text classification model, after classification processing is performed by the text classification model, a second classification result corresponding to the first threat information text is output; and determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result. Therefore, the embodiment of the disclosure can respectively classify the title text and the text of the first threat information text, and further determine the text classification result corresponding to the first threat information text based on the first classification result corresponding to the title text and the second classification result corresponding to the text, thereby remarkably improving the accuracy of threat information text classification.
In the embodiment of the disclosure, the title classification model may include a bidirectional encoder representation technology (Bidirectional Encoder Representation from Transformers, BERT) model based on a transformer and a text classification (TextCNN) model based on a convolutional neural network, where the BERT model is used to convert characters in abstract existence into vectors operated by a mathematical formula, so that text information in a text set to be classified can be better extracted, and the TextCNN model can better classify threat information text by extracting key information in the text set to be classified.
Based on the above, the embodiment of the disclosure provides a title classification model (hereinafter referred to as title BERT-TextCNN classification model) implemented based on the BERT model and the TextCNN model, as shown in fig. 2, which is a schematic structural diagram of the title classification model provided by the embodiment of the disclosure.
In the embodiment of the disclosure, the title BERT-textCNN classification model mainly comprises a BERT text representation layer, a textCNN convolution layer and a classification layer; wherein x is 1 、x 2 ……x n Header text data of the first threat intelligence text after preprocessing, E 1 、E 2 ……E n Is of the pair x 1 、x 2 ……x n Making embedded representations of words, sentences and positions, T 1 、T 2 ……T n To pass Transformers Encoder structure E 1 、E 2 ……E n The converted output vector with rich semantic features.
Wherein, the BERT text representation layer encodes the first threat information text with the input length of n by using the title text x of the first threat information text 1 、x 2 ……x n Input to the BERT text identification layer, first for the input title text x 1 、x 2 ……x n Word vector embedding representation, specifically, for x respectively 1 、x 2 ……x n Make embedded representations of words, sentences and positions, and use "[ CLS ]]"and" [ SEP]Marking sentence to obtain vector E 1 、E 2 ……E n Then E is passed through Transformers Encoder structure 1 、E 2 ……E n Conversion to output vector T with rich semantic features 1 、T 2 ……T n Finally, output vector T 1 、T 2 ……T n As input to the TextCNN convolution layer.
The processing procedure of the textCNN convolution layer is that an embedded layer corresponding to a title text is input to the convolution layer for convolution, a text feature vector corresponding to each text data is extracted, then the text feature vector is subjected to maximum pooling, the text feature vector subjected to maximum pooling is spliced, the spliced text feature vector is subjected to dimension reduction, finally the vector after dimension reduction is input to an activation function, and a first classification result corresponding to a first threat information text is judged by the activation function.
In practical application, in the process of training the title classification model, after the first classification result corresponding to the first threat intelligence text is determined based on the above process, whether the title text contains certain specific threat intelligence types can be further judged according to the keyword assistance, for example, if the title text contains an APT, the APT can be added to the first classification result, and if the title text contains a DDoS, the DDoS can be added to the first classification result, so as to further improve the accuracy of threat intelligence text classification.
In practical applications, the structure of the text classification model may be set to be consistent with the structure of the title classification model, and may be specifically understood by referring to the schematic structural diagram of the title classification model in fig. 2, which is not described herein.
It should be noted that, because the BERT model has a limitation on the input length of the text, before the text of the first threat information text is processed based on the text BERT-TextCNN classification model, the text is further required to be subjected to segmentation processing to obtain a plurality of segmented texts, and certain associated information is still reserved between each segmented text, so after the processing of the BERT text representation layer, an embedded layer of each segmented text can be obtained, and then the embedded layers are subjected to splicing processing to obtain an embedded layer of the whole text, and the embedded layer is used as the input of the TextCNN convolution layer for subsequent processing.
Based on the above method embodiment, in order to further improve accuracy of threat intelligence text classification, the disclosed embodiment further provides a threat intelligence text classification method, and referring to fig. 3, a flowchart of another threat intelligence text classification method provided by the disclosed embodiment is provided, where the method includes:
s301: and acquiring a text set to be classified.
The text set to be classified comprises threat information texts crawled based on a preset information source list, wherein the threat information texts comprise title texts and text texts.
S302: judging whether the second threat information text in the text set to be classified is successfully matched with any rule in preset matching rules, and if not, executing S303; if so, S306 is performed.
In the embodiment of the disclosure, the second threat intelligence text may be any threat intelligence text in the text set to be classified.
In practical application, in the process of classifying threat information texts, relevant personnel study the characteristics of various threat information text data to find out that a part of threat information texts can be classified into threat information text types such as safety daily information, safety week information or safety report, and the threat information text types are difficult to judge for a common text-based classification model, so that the embodiment of the disclosure judges the threat information text types such as safety daily information, safety week information and safety report by presetting preset matching rules and based on the preset matching rules, thereby further improving the accuracy of threat information text classification.
In the embodiment of the disclosure, the preset matching rule may be set for a type keyword corresponding to the type of the target information source information and/or at least one threat information text; the target information source information may include a website address corresponding to the target information source, and the type keyword corresponding to the threat information text type may include, for example, a type keyword corresponding to the security daily threat information text type is "daily security consultation" and the like.
For example, for intelligence source a, the headline text of a series of threat intelligence texts it publishes starts with "daily safety newsletter" and the body text content is the news collection of the previous day, so rule 1 in the preset matching rule can be set to: if the information source information of a certain threat information text is determined to be A, and the title text of the threat information text contains a type keyword of 'daily safety short message', the threat information text type corresponding to the threat information text can be determined to be 'safety daily information'; if not, determining that the threat intelligence text is not successfully matched with the first rule.
For another example, for the intelligence source B, the title text of the threat intelligence text that it issues includes the word "vulnerability security month" so rule 2 in the preset matching rule may be set as: if the information source information of a certain threat information text is determined to be B, and the title text of the threat information text comprises a type keyword of 'safe month information' and 'vulnerability', the threat information text corresponding to the threat information text can be determined to be 'safe report and vulnerability'. It should be noted that, the threat intelligence text for the "security report" type in the embodiments of the present disclosure refers to a report having a summary of the period of month and above, and the like.
In practical application, a preset matching rule can be set based on the dimensions of the target information source information, URL information, title text and text together to judge the type of the threat information text corresponding to the threat information text, specifically, the preset matching rule can be set by combining the characteristics of the target information source and the threat information text issued by the target information source, and in view of the information source confidentiality principle, the embodiment of the disclosure will not be described herein.
S303: and determining the second threat information text as the first threat information text, inputting the title text of the first threat information text into a trained title classification model, classifying by the title classification model, and outputting a first classification result corresponding to the first threat information text.
The first classification result is used for representing the threat information text type of the first threat information text from the dimension of the title text.
In the embodiment of the disclosure, if it is determined that a second threat intelligence text in a text set to be classified is not successfully matched with any one of preset matching rules, the second threat intelligence text is determined to be a first threat intelligence text.
In the embodiment of the disclosure, after determining the second threat information text as the first threat information text, the title text of the first threat information text is input into a trained title classification model, and after classification processing is performed by the title classification model, a first classification result corresponding to the first threat information text is output.
As can be seen, when it is determined that any rule of the second threat information text and the preset matching rule is not successfully matched, the embodiment of the disclosure determines the second threat information text as the first threat information text, inputs the title text of the first threat information text to the title classification model, outputs the first classification result of the first threat information text, and can characterize the threat information text type to which the first threat information text belongs from the dimension of the title text.
S304: inputting the text of the first threat information text to a trained text classification model, classifying the text classification model, and outputting a second classification result corresponding to the first threat information text.
And the second classification result is used for representing the threat information text type of the first threat information text from the dimension of the text body.
In an optional implementation manner, after the first classification result corresponding to the first threat information text is obtained, whether the text of the first threat information text is classified later or not can be determined by judging whether the first classification result includes the target threat information text type; in the embodiment of the disclosure, when determining that the first classification result includes any target threat information text type, the first classification result is directly determined as the text classification result corresponding to the first threat information text, and classification processing is not required to be performed on the text of the first threat information text, so that the threat information text classification efficiency is improved.
In the embodiment of the disclosure, the target threat information text type may include a security daily message, a security weekly message and a security report, and when determining that the first classification result includes any threat information text type of the security daily message, the security weekly message or the security report, the first classification result may be directly determined as a text classification result corresponding to the first threat information text, thereby improving the efficiency of threat information text classification.
In another alternative embodiment, if it is determined that the first classification result corresponding to the first threat information text does not include the target threat information text type, the body text of the first threat information text is input to the trained body classification model, and after classification processing is performed by the body classification model, a second classification result corresponding to the first threat information text is output, so that a text classification result corresponding to the first threat information text is determined conveniently based on the second classification result and the first classification result.
S305: and determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result.
The text classification result is used for representing the threat information text type to which the first threat information text belongs.
Therefore, the embodiment of the disclosure can respectively classify the title text and the text of the first threat information text, and further determine the text classification result corresponding to the first threat information text based on the first classification result corresponding to the title text and the second classification result corresponding to the text, thereby remarkably improving the accuracy of threat information text classification.
S306: and determining the threat information text type corresponding to the first rule as a text classification result corresponding to the second threat information text.
In the embodiment of the disclosure, if the second threat information text is successfully matched with the first rule in the preset matching rules, determining the threat information text type corresponding to the first rule as a text classification result corresponding to the second threat information text.
In the embodiment of the present disclosure, the matching process between the second threat information text and the preset matching rule may specifically refer to the setting portion of the preset matching rule in the foregoing embodiment, and the embodiment of the present disclosure is not described herein in detail.
In the embodiment of the disclosure, a corresponding relationship exists between a rule in the preset matching rule and a threat information text type, for example, for rule 1 in the above embodiment, the corresponding threat information text type is "security daily information"; for another example, for rule 2, its corresponding threat intelligence text types are "safe month" and "vulnerability"; therefore, in practical application, if it is determined that the second threat information text is successfully matched with the first rule in the preset matching rules, the threat information text type corresponding to the first rule may be determined as a text classification result corresponding to the second threat information text based on the corresponding relationship between the first rule and the threat information text type.
Therefore, when the second threat information text is successfully matched with the first rule in the preset matching rule, the embodiment of the disclosure directly determines the threat information text type corresponding to the first rule as the text classification result corresponding to the second threat information text, and can rapidly and accurately judge whether the threat information text type corresponding to the second threat information text belongs to safety daily information, safety week information or safety report, thereby improving the threat information text classification efficiency. In addition, the embodiment of the disclosure can also respectively classify the title text and the text of the first threat information text, and further determine the text classification result corresponding to the first threat information text based on the first classification result corresponding to the title text and the second classification result corresponding to the text, thereby remarkably improving the accuracy of threat information text classification.
In practical applications, since the first classification result and the second classification result may each include a non-threat information text type or other types, after determining a text classification result corresponding to the first threat information text based on the first classification result and the second classification result, a situation that the non-threat information text and a specific threat information text type coexist in the text classification result may occur, which is obviously unreasonable and contradictory, so that after determining a text classification result corresponding to the first threat information text, further processing is required for the text classification result corresponding to the first threat information text.
In the embodiment of the disclosure, after determining a text classification result corresponding to a first threat information text, firstly judging whether the number of text types in the text classification result corresponding to the first threat information text is greater than a preset number, and if the number of text types in the text classification result is greater than the preset number and the text classification result does not contain non-threat information text types, directly outputting the text classification result. Wherein the setting of the preset number may be determined based on the threat intelligence text type, for example, the preset number may be set to 2 assuming that the threat intelligence text type includes a non-threat intelligence text type and other types.
For example, assuming that the text classification result corresponding to the first threat intelligence text includes malware, vulnerability and APT, it is obvious that after the judgment, the number of text types in the text classification result may be determined to be 3 (greater than or equal to a preset number of 2), and the text classification result does not include the non-threat intelligence text types and other types, so that the malware, vulnerability and APT may be directly determined as the text classification result corresponding to the first threat intelligence text.
In an alternative embodiment, if the number of text types in the text classification result is determined to be greater than the preset number, and the text classification result contains a non-threat intelligence text type or other types, the non-threat intelligence text type or other types are deleted from the text classification result.
For example, assuming that the text classification result corresponding to the first threat information text includes a vulnerability and a non-threat information text type, it is obvious that after the judgment, the number of the text types in the text classification result can be determined to be 2 (greater than or equal to a preset number of 2), and the text classification result includes the non-threat information text type, at this time, the non-threat information text type needs to be deleted, and the classification result after the non-threat information text type is deleted, namely, the vulnerability is determined to be the text classification result corresponding to the first threat information text.
For example, assuming that the text classification result corresponding to the first threat intelligence text includes APT, malware and other types, it is obvious that after the judgment, the number of the text types in the text classification result can be determined to be 3 (greater than or equal to the preset number of 2), and the text classification result includes the other types, at this time, the other types need to be deleted, and the classification result after deleting the other types, that is, the APT and the malware are determined to be the text classification result corresponding to the first threat intelligence text.
Therefore, after determining the text classification result corresponding to the first threat information text, the embodiment of the disclosure can further improve the accuracy of the text classification result by further judging and processing the text classification result.
Based on the above method embodiment, the disclosure further provides a threat intelligence text classification apparatus, referring to fig. 4, which is a schematic structural diagram of a threat intelligence text classification apparatus provided in the embodiment of the disclosure, where the apparatus includes:
the first output module 401 is configured to input a title text of a first threat information text in the text set to be classified into a trained title classification model, and output a first classification result corresponding to the first threat information text after classification processing by the title classification model; the first threat intelligence text comprises the title text and the text, and the first classification result is used for representing the threat intelligence text type of the first threat intelligence text from the dimension of the title text;
The second output module 402 is configured to input the text of the first threat information text to a trained text classification model, perform classification processing by using the text classification model, and output a second classification result corresponding to the first threat information text; the second classification result is used for representing the threat information text type of the first threat information text from the dimension of the text;
a first determining module 403, configured to determine a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result; the text classification result is used for representing the threat information text type to which the first threat information text belongs.
In an alternative embodiment, the apparatus further comprises:
the second determining module is configured to determine the second threat intelligence text as the first threat intelligence text if it is determined that the second threat intelligence text in the text set to be classified is not successfully matched with any one of the preset matching rules, and perform the step of inputting the title text of the first threat intelligence text in the text set to be classified into a trained title classification model; the preset matching rule is set for a type keyword corresponding to the target information source information and/or at least one threat information text type.
In an optional implementation manner, the rules in the preset matching rules have a corresponding relation with the threat intelligence text types, and the device further includes:
and the third determining module is used for determining the threat information text type corresponding to the first rule as a text classification result corresponding to the second threat information text if the second threat information text is successfully matched with the first rule in the preset matching rules.
In an alternative embodiment, the apparatus further comprises:
and the execution module is used for executing the step of inputting the text of the first threat information text into the trained text classification model if the first classification result corresponding to the first threat information text does not comprise the target threat information text type.
In an alternative embodiment, the apparatus further comprises:
and the fourth determining module is used for directly determining the first classification result as a text classification result corresponding to the first threat information text if the first classification result corresponding to the first threat information text comprises any target threat information text type.
In an alternative embodiment, the first determining module includes:
and the merging processing sub-module is used for merging and de-duplicating the first classification result and the second classification result to obtain a text classification result corresponding to the first threat information text.
In an alternative embodiment, the apparatus further comprises:
the first judging module is used for judging whether the number of the text types in the text classification result corresponding to the first threat information text is larger than a preset number;
the second judging module is used for judging whether the text classification result contains a non-threat information text type or other types if the number of the text types in the text classification result is larger than the preset number;
and the deleting module is used for deleting the non-threat intelligence text type and other types from the text classification result if the text classification result contains the non-threat intelligence text type or other types.
In the method for classifying threat information texts provided by the embodiment of the disclosure, the title text of the first threat information text is input into a trained title classification model, after classification processing is performed by the title classification model, a first classification result corresponding to the first threat information text is output, the text of the first threat information text is input into a trained text classification model, after classification processing is performed by the text classification model, a second classification result corresponding to the first threat information text is output; and determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result. Therefore, the embodiment of the disclosure can respectively classify the title text and the text of the first threat information text, and further determine the text classification result corresponding to the first threat information text based on the first classification result corresponding to the title text and the second classification result corresponding to the text, thereby remarkably improving the accuracy of threat information text classification.
In addition to the above method and apparatus, the embodiments of the present disclosure further provide a computer readable storage medium, where instructions are stored, when the instructions are executed on a terminal device, to cause the terminal device to implement the threat intelligence text classification method according to the embodiments of the present disclosure.
The disclosed embodiments also provide a computer program product comprising computer programs/instructions which, when executed by a processor, implement the threat intelligence text classification method of the disclosed embodiments.
In addition, the embodiment of the disclosure further provides a device for classifying threat intelligence text, which is shown in fig. 5, and may include:
a processor 501, a memory 502, an input device 503 and an output device 504. The number of processors 501 in the threat intelligence text classification apparatus may be one or more, one processor being exemplified in fig. 5. In some embodiments of the present disclosure, the processor 501, memory 502, input device 503, and output device 504 may be connected by a bus or other means, with bus connections being exemplified in fig. 5.
The memory 502 may be used to store software programs and modules, and the processor 501 performs various functional applications and data processing of the threat intelligence text classification apparatus by running the software programs and modules stored in the memory 502. The memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function, and the like. In addition, memory 502 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The input means 503 may be used to receive entered numeric or character information and to generate signal inputs related to user settings and function control of the classifying device of threat intelligence text.
In particular, in this embodiment, the processor 501 loads executable files corresponding to the processes of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application programs stored in the memory 502, so as to implement the various functions of the above-mentioned threat intelligence text classification device.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of classifying threat intelligence text, the method comprising:
inputting a title text of a first threat information text into a trained title classification model, classifying the title classification model, and outputting a first classification result corresponding to the first threat information text; the first threat intelligence text comprises the title text and the text, and the first classification result is used for representing the threat intelligence text type of the first threat intelligence text from the dimension of the title text;
inputting the text of the first threat information text into a trained text classification model, classifying the text classification model, and outputting a second classification result corresponding to the first threat information text; the second classification result is used for representing the threat information text type of the first threat information text from the dimension of the text;
Determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result; the text classification result is used for representing the threat information text type to which the first threat information text belongs.
2. The method of claim 1, wherein before entering the headline text of the first threat intelligence text in the set of texts to be classified into the trained headline classification model, further comprising:
if it is determined that the second threat intelligence text in the text set to be classified is not successfully matched with any one of the preset matching rules, determining the second threat intelligence text as the first threat intelligence text, and executing the step of inputting the title text of the first threat intelligence text in the text set to be classified into a trained title classification model; the preset matching rule is set for a type keyword corresponding to the target information source information and/or at least one threat information text type.
3. The method for classifying threat intelligence text according to claim 2, wherein the rule in the preset matching rule has a correspondence relationship with the threat intelligence text type, the method further comprising:
If the second threat information text is successfully matched with the first rule in the preset matching rules, determining the threat information text type corresponding to the first rule as a text classification result corresponding to the second threat information text.
4. The method of claim 1, wherein before entering the body text of the first threat intelligence text into the trained body classification model, further comprising:
and if the first classification result corresponding to the first threat information text does not comprise the target threat information text type, executing the step of inputting the body text of the first threat information text into a trained body classification model.
5. The method of claim 4, further comprising:
if the first classification result corresponding to the first threat information text is determined to comprise any target threat information text type, the first classification result is directly determined to be a text classification result corresponding to the first threat information text.
6. The method for classifying threat intelligence text of claim 1, wherein determining a text classification result corresponding to the first threat intelligence text based on the first classification result and the second classification result comprises:
And merging and deduplicating the first classification result and the second classification result to obtain a text classification result corresponding to the first threat information text.
7. The method for classifying threat intelligence text according to claim 1, wherein after determining the text classification result corresponding to the first threat intelligence text, further comprises:
judging whether the number of text types in a text classification result corresponding to the first threat information text is larger than a preset number or not;
if the number of the text types in the text classification result is determined to be larger than the preset number, judging whether the text classification result contains non-threat information text types or other types;
and deleting the non-threat intelligence text type and other types from the text classification result if the text classification result is determined to contain the non-threat intelligence text type or other types.
8. A threat intelligence text classification apparatus, the apparatus comprising:
the first output module is used for inputting the title text of the first threat information text in the text set to be classified into a trained title classification model, and outputting a first classification result corresponding to the first threat information text after classification processing is carried out by the title classification model; the first threat information text comprises a title text and a text, and the first classification result is used for representing the threat information text type of the first threat information text from the dimension of the title text;
The second output module is used for inputting the text of the first threat information text into a trained text classification model, and outputting a second classification result corresponding to the first threat information text after classification processing is carried out by the text classification model; the second classification result is used for representing the threat information text type of the first threat information text from the dimension of the text;
the first determining module is used for determining a text classification result corresponding to the first threat information text based on the first classification result and the second classification result; the text classification result is used for representing the threat information text type to which the first threat information text belongs.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to implement the method of any of claims 1-7.
10. A threat intelligence text classification apparatus, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-7 when the computer program is executed.
CN202310671897.XA 2023-06-07 2023-06-07 Method, device, equipment and storage medium for classifying threat information text Pending CN116737926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310671897.XA CN116737926A (en) 2023-06-07 2023-06-07 Method, device, equipment and storage medium for classifying threat information text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310671897.XA CN116737926A (en) 2023-06-07 2023-06-07 Method, device, equipment and storage medium for classifying threat information text

Publications (1)

Publication Number Publication Date
CN116737926A true CN116737926A (en) 2023-09-12

Family

ID=87914456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310671897.XA Pending CN116737926A (en) 2023-06-07 2023-06-07 Method, device, equipment and storage medium for classifying threat information text

Country Status (1)

Country Link
CN (1) CN116737926A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190334942A1 (en) * 2018-04-30 2019-10-31 Microsoft Technology Licensing, Llc Techniques for curating threat intelligence data
CN110968687A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Method and device for classifying texts
CN113468339A (en) * 2021-06-24 2021-10-01 北京明略软件系统有限公司 Label extraction method, system, electronic device and medium based on knowledge graph
CN113672734A (en) * 2021-08-23 2021-11-19 倪显虎 Long text classification method based on deep learning composite model
CN115827864A (en) * 2022-12-06 2023-03-21 企查查科技有限公司 Processing method for automatic classification of bulletins

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190334942A1 (en) * 2018-04-30 2019-10-31 Microsoft Technology Licensing, Llc Techniques for curating threat intelligence data
CN110968687A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Method and device for classifying texts
CN113468339A (en) * 2021-06-24 2021-10-01 北京明略软件系统有限公司 Label extraction method, system, electronic device and medium based on knowledge graph
CN113672734A (en) * 2021-08-23 2021-11-19 倪显虎 Long text classification method based on deep learning composite model
CN115827864A (en) * 2022-12-06 2023-03-21 企查查科技有限公司 Processing method for automatic classification of bulletins

Similar Documents

Publication Publication Date Title
Li et al. A stacking model using URL and HTML features for phishing webpage detection
WO2020244066A1 (en) Text classification method, apparatus, device, and storage medium
Buber et al. NLP based phishing attack detection from URLs
CN107992764B (en) Sensitive webpage identification and detection method and device
RU2704531C1 (en) Method and apparatus for analyzing semantic information
CN110929145B (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
CN110909531B (en) Information security screening method, device, equipment and storage medium
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
Gaglani et al. Unsupervised WhatsApp fake news detection using semantic search
CN111899089A (en) Enterprise risk early warning method and system based on knowledge graph
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
Kejriwal et al. On detecting urgency in short crisis messages using minimal supervision and transfer learning
CN112989348A (en) Attack detection method, model training method, device, server and storage medium
CN102831116A (en) Method and system for document clustering
Samonte Polarity analysis of editorial articles towards fake news detection
CN109284465B (en) URL-based web page classifier construction method and classification method thereof
CN116089732B (en) User preference identification method and system based on advertisement click data
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN116737926A (en) Method, device, equipment and storage medium for classifying threat information text
CN116055067A (en) Weak password detection method, device, electronic equipment and medium
CN107491530B (en) Social relationship mining analysis method based on file automatic marking information
CN113472686B (en) Information identification method, device, equipment and storage medium
CN115883111A (en) Phishing website identification method and device, electronic equipment and storage medium
CN109063117B (en) Network security blog classification method and system based on feature extraction
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination