WO2021227831A1 - 威胁情报的主题检测方法、装置和计算机存储介质 - Google Patents

威胁情报的主题检测方法、装置和计算机存储介质 Download PDF

Info

Publication number
WO2021227831A1
WO2021227831A1 PCT/CN2021/089290 CN2021089290W WO2021227831A1 WO 2021227831 A1 WO2021227831 A1 WO 2021227831A1 CN 2021089290 W CN2021089290 W CN 2021089290W WO 2021227831 A1 WO2021227831 A1 WO 2021227831A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
threat intelligence
topic
detected
candidate
Prior art date
Application number
PCT/CN2021/089290
Other languages
English (en)
French (fr)
Inventor
范如
范渊
Original Assignee
杭州安恒信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州安恒信息技术股份有限公司 filed Critical 杭州安恒信息技术股份有限公司
Publication of WO2021227831A1 publication Critical patent/WO2021227831A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of information security technology, in particular to the subject detection method, device and computer storage medium of threat intelligence.
  • open source information In the new generation of network defense systems, open source information is often processed. However, in open source information, due to the existence of threatening open source information such as Internet vulnerabilities, malicious viruses, and hacking tools, these open source threat information can be obtained by anyone through the Internet. Using and spreading it will have a huge impact on the security of Internet open source information. At the same time, in the usual network defense system, the processing efficiency of vectorization of large-scale text data is low, the effect of semantic mining of high-latitude vectors is poor, and the Internet open source threats cannot be detected in time.
  • the embodiments of the present application provide a threat intelligence subject detection method, device, and computer storage medium, so as to at least solve the problem that threat subjects cannot be discovered in time in related technologies.
  • an embodiment of the present application provides a subject detection method for threat intelligence, including:
  • a hierarchical clustering algorithm is used to cluster the threat intelligence text to be detected into an existing topic or a new topic according to the text characteristics of the threat intelligence text to be detected.
  • extracting a set of candidate words from the threat intelligence text to be detected includes:
  • Preprocessing the threat intelligence text to be detected to obtain the candidate word set includes at least one of the following: deduplication, stop word deletion, punctuation removal, case conversion, part-of-speech tagging And removal, morphological restoration.
  • extracting the keyword features from the candidate word set includes:
  • Extract keywords from the candidate word set determine the word frequency and inverse document frequency of the keywords, determine the weight value of the keyword according to the word frequency and the inverse document frequency, and determine the weight of the keyword according to the weight of the keyword The value determines the keyword characteristics.
  • extracting the topic word features from the candidate word set includes:
  • the candidate topic words are extracted from the candidate word set, the similarity between the candidate topic words and the preset label category words is calculated, the weight value of the candidate topic words is determined according to the similarity, and the weight value of the candidate topic words is determined according to the candidate topic words The weight value of determines the feature of the topic word.
  • extracting the entity feature from the candidate word set includes:
  • a hierarchical clustering algorithm is used to cluster the threat intelligence text to be detected into existing topics or newly added topics according to the text characteristics of the threat intelligence text to be detected.
  • the threat intelligence text to be detected is classified into all Describe the topic of the next level of the current level.
  • the method further includes:
  • the method further includes:
  • an embodiment of the present application provides a threat intelligence subject detection device, including:
  • the acquisition module is used to crawl the threat intelligence text to be detected from the preset data source;
  • the extraction module is used to extract a set of candidate words from the threat intelligence text to be detected, and extract a variety of key features from the set of candidate words, where the key features include: keyword features, topic word features, and/or Physical characteristics
  • the fusion module is used to fuse the multiple key features to obtain the text features of the threat intelligence text to be detected;
  • the processing module is configured to use a hierarchical clustering algorithm to cluster the threat information text to be detected into existing topics or newly added topics according to the text characteristics of the threat information text to be detected.
  • the acquisition module includes:
  • the preprocessing unit is used to preprocess the threat intelligence text to be detected to obtain the candidate word set; wherein, the preprocessing includes at least one of the following: deduplication, stop word deletion, punctuation removal, Case conversion, part-of-speech tagging and removal, morphological restoration.
  • the extraction module includes:
  • the first extraction unit is configured to extract keywords from the candidate word set, determine the word frequency and inverse document frequency of the keywords, determine the weight value of the keyword according to the word frequency and the inverse document frequency, and The keyword feature is determined according to the weight value of the keyword.
  • the extraction module further includes:
  • the second extraction unit is configured to extract candidate topic words from the candidate word set, calculate the similarity between the candidate topic words and preset label category words, and determine the weight value of the candidate topic words according to the similarity, And determine the topic word feature according to the weight value of the candidate topic word.
  • the extraction module further includes:
  • the third extraction unit is used to identify entity candidate words from the candidate word set, delete entity candidate words with preset parts of speech from the entity candidate words, and obtain the entity characteristics.
  • the processing module includes:
  • the first judgment unit is used to judge whether the similarity between the text feature of the threat intelligence text to be detected and the text feature of the existing topic at the current level in the existing topic cluster is higher than that corresponding to the existing topic at the current level
  • the preset threshold
  • the first classification unit is configured to, when the similarity between the text feature of the threat intelligence text to be detected and the text feature of the existing topic at the current level is higher than the preset threshold, classify the text to be detected The threat intelligence text is classified into the topic of the next level of the current level.
  • the device further includes:
  • the first processing module is configured to: in the case where the similarity between the text feature of the threat intelligence text to be detected and the text feature of the existing topic at the current level is not higher than the preset threshold, the current A new topic is added under the level, and the threat intelligence text to be detected is classified into the new topic.
  • the device further includes:
  • the second processing module is used to select a baseline threat intelligence text from a plurality of threat intelligence texts clustered to the same existing theme, and respectively calculate each threat intelligence text and the baseline threat intelligence text in the multiple threat intelligence texts And use the average value of the obtained similarity as the preset threshold corresponding to the same existing theme.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the threat intelligence subject detection method as described in the first aspect is implemented.
  • the subject detection method, device and computer storage medium of threat intelligence crawl the threat intelligence text to be detected from the preset data source; from the threat intelligence text to be detected Extract a set of candidate words from the set of candidate words, and extract a variety of key features from the set of candidate words; fuse multiple key features to obtain the text features of the threat intelligence text to be detected; adopt a hierarchical clustering algorithm based on the text of the threat intelligence text to be detected The feature clusters the threat intelligence text to be detected into existing topics or new topics, which solves the problem that the threat topics cannot be discovered in time in related technologies, and realizes the efficient and accurate detection and extraction of threat topics from massive document data.
  • Fig. 1 is a flowchart of a subject detection method of threat intelligence according to an embodiment of the present application.
  • Fig. 2 is a flow chart of the pre-processing of threat intelligence text to be detected in an embodiment of the present application.
  • Fig. 3 is a flowchart of key feature extraction in an embodiment of the present application.
  • Fig. 4 is a framework diagram of threat subject detection according to an embodiment of the present application.
  • Figure 5 is a flow chart of threat subject discovery and tracking in an embodiment of the present application.
  • Fig. 6 is a structural diagram of a threat intelligence subject detection device according to an embodiment of the present application.
  • Fig. 7 is an internal structure diagram of a computer device according to an embodiment of the present application.
  • connection refers to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
  • plurality refers to two or more.
  • And/or describes the association relationship of the associated objects, which means that there can be three kinds of relationships. For example, “A and/or B” can mean: A alone exists, A and B exist at the same time, and B exists alone.
  • the character “/” generally indicates that the associated objects before and after are in an “or” relationship.
  • the terms “first”, “second”, “third”, etc. involved in this application merely distinguish similar objects, and do not represent a specific order for the objects.
  • CTI Cerber Threat Intelligence
  • the Word2Vec model is a type of neural network model. In a given unlabeled corpus, each word vector that can express semantic information is used to represent the words in the corpus.
  • the word vector trained by the Word2Vec model contains the semantic information of the word and can reflect the meaning of the word. The linear relationship between.
  • the LDA topic model is a document topic generation model.
  • FIG. 1 is a flowchart of a threat intelligence subject detection method according to an embodiment of the present application. As shown in Fig. 1, the process includes the following steps:
  • Step S101 Crawling the threat intelligence text to be detected from the preset data source.
  • crawling the threat intelligence text to be detected from the preset data source is by monitoring a series of websites, and crawling and collecting articles and articles in real time on security information platforms such as forums, news, and blogs of each security website of a specific category. Comment on the data.
  • Step S102 Extract a set of candidate words from the threat intelligence text to be detected, and extract a variety of key features from the set of candidate words, where the key features include: keyword features, topic word features, and/or entity features.
  • Step S103 Fusion of multiple key features to obtain the text features of the threat intelligence text to be detected.
  • Step S104 Using a hierarchical clustering algorithm, cluster the threat intelligence text to be detected into an existing topic or a new topic according to the text characteristics of the threat intelligence text to be detected.
  • the threat intelligence text to be detected is crawled from the preset data source; the candidate word set is extracted from the threat intelligence text to be detected, and a variety of key features are extracted from the candidate word set; fusion A variety of key features are used to obtain the text features of the threat intelligence text to be detected; the hierarchical clustering algorithm is used to cluster the threat intelligence text to be detected into existing topics or new topics based on the text features of the threat intelligence texts to be detected. It solves the problem of low efficiency in processing massive document data in related technologies, inaccurate mining of high-latitude data features, and inability to discover threat topics in time. It realizes the fusion of key features extracted from the threat intelligence text to be detected to obtain the text of the threat intelligence text. According to the text feature clustering, the subject type of the threat intelligence text to be detected is judged, massive document data is processed efficiently, high-latitude data features are accurately mined, and threat topics are discovered in time.
  • the extraction of the candidate word set from the threat intelligence text to be detected in step S102 is implemented by the following steps:
  • Step S102-1 Preprocessing the threat intelligence text to be detected to obtain a set of candidate words; wherein, the preprocessing includes at least one of the following: deduplication, stop word deletion, punctuation removal, case conversion, part-of-speech tagging and removal , Lemmatization restoration.
  • FIG. 2 is a flow chart of the threat intelligence text preprocessing to be detected. As shown in Figure 2, the preprocessing of the threat intelligence text to be detected is further elaborated as follows:
  • the preprocessing includes the following steps:
  • the embodiment of the application uses the natural language processing library (NLTK) in natural language processing (NLP) for part-of-speech tagging.
  • NLTK natural language processing library
  • NLP natural language processing
  • the preprocessing stage removes the "qualifiers (DT)", “adverbs (adverb, RB)", “cardinal numbers (CD)” and “ “Conjunction (CC)”, “existential sentence (existential there, EX)", “preposition or subordinating conjunction (subordinating, IN)", “adjective, comparative, JJR”, “modal auxiliary verb ( Part of speech words such as modal auxiliary, MD)", “WH-determiner (WH-determiner, WDT)”, and “WH-pronoun (WH-pronoun, WP)”. Then the remaining words are morphologically restored, and words with the same original form and different forms are restored to their general form.
  • the retained words are used as candidate words for key feature extraction.
  • Keyword feature extraction method optimized based on term frequency-inverse document frequency (Term Frequency-InverseDocument Frequency, TF-IDF).
  • TF-IDF is a method of dimensionality reduction or extraction of text features.
  • the main idea is that the greater the probability of a word in an article, the higher the word frequency TF; and the word appears rarely in other articles , That is, the higher the IDF of the inverse document probability, the higher the recognition of the word.
  • the inverse document frequency IDF of each word in the article is a long value, and the threat intelligence data set is dynamically changing.
  • the inverse document frequency in the dynamic data set cannot be well represented by a fixed IDF set.
  • the existing TF calculation method does not consider the influence of stop words in the article, the part of speech of the word, and the position of the word in the text on the weight of the word. As a result, many article subject words directly use this algorithm to be misjudged. Key words.
  • the TF-IDF calculation method used in this application uses the following methods to extract keywords: firstly, non-text keywords such as stop words, quantifiers, and qualifiers are removed from the text, and the remaining words are used as candidate words for keywords.
  • the TF-IDF algorithm used in the application also considers the position of the word in the article. The importance of the word in different positions is different. For example, the title is more important than the word in the body.
  • the improved TF used in this application is based on the position of the candidate keywords. The method formula is as follows:
  • TF(t,d) in the formula represents the probability of t appearing in document d
  • T represents the set of words in the title
  • C represents the set of words in the text
  • represents the weight of words appearing in the title . The larger the ⁇ , the more important the word in the title.
  • the calculation of the IDF part is optimized into an incremental IDF method to solve the problem that the IDF value does not change dynamically with the data set.
  • the formula used is as follows:
  • N c represents the total number of documents in the database in the current time period
  • n(t, c) represents the number of documents containing the word t in the database. Since the data set in the database changes dynamically, the weight of the candidate words is changed in this application Make N c and n(t,c) change dynamically with time.
  • the document topic generation model (LDA topic model) is used to extract the topics in the article, but because the candidate topic words in the article extracted by the LDA topic model are relatively extensive, the probability of high frequency words appearing in the candidate topic words is large , And these words cannot well reflect the topic of the article, so it is necessary to further remove the topic feature words from the candidate topic words.
  • this application uses the calculation of the similarity between the candidate subject word vector and the label category word vector, and this similarity is used as the candidate subject word weight coefficient to recalculate the candidate subject word weight. Filter keywords.
  • FIG. 3 is a flowchart of key feature extraction. As shown in FIG. 3, in some of the embodiments, the extraction of keyword features from the candidate word set in step S102 is implemented by the following steps:
  • Step S102-2 Extract keywords from the set of candidate words, determine the word frequency and inverse document frequency of the keywords, determine the weight value of the keyword according to the word frequency and the inverse document frequency, and determine the keyword feature according to the weight value of the keyword.
  • the keyword feature extraction method in this embodiment is as follows: First, use the TF method considering the word position to calculate the word frequency of the key candidate words in the article; then, use the incremental IDF method to calculate the inverse of the key candidate words in the article Document frequency; Finally, based on the above-mentioned TF-IDF method, the weight of the key candidate words in the article is calculated, and the keyword features of the article are extracted.
  • the extraction of topic word features from the candidate word set in step S102 is implemented by the following steps:
  • Step S102-3 Extract candidate topic words from the candidate word set, calculate the similarity between the candidate topic words and the preset label category words, determine the weight value of the candidate topic words according to the similarity, and determine the topic according to the weight value of the candidate topic words Word characteristics.
  • the topic word feature extraction method in this embodiment is as follows: first, use the LDA model to extract article candidate topic words; then, use the natural language processing (NLP) Word2Vec model to train the article word vector; then, calculate The similarity between the candidate topic word vector and the tag category word vector, the word value of this similarity is used as the coefficient of the weight of the candidate topic word, the weight value of the candidate topic word is updated and determined, and the weight of the candidate topic word with high similarity to the label category is increased Value; Finally, increase the weight of certain parts of speech and candidate topic words contained in the title, and extract the topic word features of the article.
  • NLP natural language processing
  • the extraction of entity features from the candidate word set in step S102 is implemented by the following steps:
  • the entity feature extraction method is as follows: first, obtain the person, location, and organization entity of each article; then, remove certain part-of-speech entities. These entity words (preset part-of-speech entity words) ) Cannot be the entity feature required by the embodiment of the present application; finally, the entity feature of the article is extracted.
  • the feature fusion method is used to merge the above-mentioned extracted keyword features, topic word features, and entity features to obtain the key features of the article, and then construct according to the key features
  • the article feature vector is used as the input of topic hierarchical clustering.
  • the hierarchical clustering algorithm is adopted in step S104, and the threat intelligence text to be detected is clustered into existing topics or newly added topics according to the text characteristics of the threat intelligence text to be detected through the following steps:
  • Step S104-1 Determine whether the similarity between the text feature of the threat intelligence text to be detected and the text feature of the existing topic at the current level in the existing topic cluster is higher than the preset threshold corresponding to the existing topic at the current level;
  • Step S104-2 In the case that the similarity between the text feature of the threat intelligence text to be detected and the text feature of the existing topic at the current level is higher than the preset threshold, the threat intelligence text to be detected is classified into the lower level of the current level. A level of themes.
  • the hierarchical clustering algorithm used in this application is an improved hierarchical clustering model that uses vector product similarity on the hierarchical clustering model based on the centroid linkage method.
  • the hierarchical clustering algorithm used in this application performs topic clustering on the threat intelligence text to be detected, including topic clustering and topic tracking.
  • the topic clustering mainly uses the clustering model to cluster the threat intelligence to be detected in each time period.
  • the text features of the text can obtain real-time topics; topic tracking is to compare the similarity of real-time topics with existing topics, and identify new topics and the continuation of existing topics in real time.
  • the threat intelligence subject detection method of this application calculates the similarity between the current clustered subject and the existing subject, and selects the existing subject with the highest similarity to the current subject. If the similarity is greater than the set threshold, the current subject is selected. The subject text is merged into the existing subject cluster, otherwise a new subject cluster is created and inserted into the database.
  • the basis for selecting topic seed articles is to calculate the similarity between all text features in the topic and a certain text feature, and then obtain the average similarity, and select the top N articles with the highest average value as Topic seed articles. That is, if this article is highly relevant to all articles in the topic, then it is a seed article. Then calculate the similarity between the text features of all the existing topic seed articles and the text features of the current clustered topic seed articles, and obtain the arithmetic mean of the similarity as the value of the similarity between the two topics.
  • the centroid linkage method mentioned above is a hierarchical clustering algorithm that takes the topic centroid vector as the topic vector, and takes each dimension of the topic centroid vector as the average value of the dimension element values of all the article feature vectors in the topic.
  • the centroid linkage The (centroid linkage) method calculates the topic vector similarity by using the cosine similarity and Euclidean distance to measure.
  • the subject vector cosine angle of the above-mentioned centroid linkage method is small, and the subject vector similarity is high, which cannot indicate that the texts (articles) between the two topics are similar. Therefore, the centroid linkage method measures the similarity between the topics. The similarity between the articles on the new topic after the merger is within an acceptable range.
  • the hierarchical clustering algorithm of this application uses the vector product method to calculate the existing topic before determining the similarity calculation with the current clustered topic. Relevance calculations are performed on all articles in the topic, so that after the above-mentioned vector product operation is completed, all articles in the above-mentioned existing topic are related to the topic.
  • the combined new topic (which belongs to the existing topic relative to the current clustered topic) is related to multiple article vectors
  • the combined new topic will be multiplied by the vector product of multiple article vectors Directly proportional.
  • the new topic is proportional to the vector product of multiple article vectors included, the relevance of all the articles that characterize the combined new topic is high.
  • the hierarchical clustering algorithm used in this application retains the advantage of the centroid linkage (centroid linkage) to measure the similarity of two topics with low computational complexity, and improves the centroid linkage (centroid linkage) to calculate the similarity of the topic vectors through the vector product operation. It is easy to form a "cluster” The disadvantage of "effect".
  • the above-mentioned clustering effect refers to: as the number of topic merging increases, the problem of a large number of articles on a certain topic and low similarity of some articles within the topic is likely to occur.
  • the method further implements the following steps: Step S105: When the similarity between the text feature of the threat intelligence text to be detected and the text feature of the existing topic at the current level is not higher than a preset threshold , Add a new topic under the current level, and classify the threat intelligence text to be detected into the new topic.
  • step S105 when a hierarchical classification algorithm is used to classify the text features of the threat intelligence text to be detected, a top-down method is implemented to classify the new topics or existing topics layer by layer to the lowest level.
  • Step S106 Select a baseline threat intelligence text from multiple threat intelligence texts clustered to the same existing theme, and calculate each threat in the multiple threat intelligence texts.
  • the similarity between the intelligence text and the baseline threat intelligence text, and the average value of the similarity obtained is used as the preset threshold corresponding to the same existing topic.
  • Figure 4 is a framework diagram of threat subject detection in an embodiment of the present application
  • Figure 5 is a flow chart of threat subject discovery and tracking. As shown in Figures 4 and 5, in the subject detection process of threat intelligence in this embodiment of the application, the following specific steps are also performed:
  • This embodiment also provides a threat intelligence subject detection device, which is used to implement the above-mentioned embodiments and optional implementation manners, and those that have been described will not be repeated.
  • the terms “module”, “unit”, “sub-unit” and the like can implement a combination of software and/or hardware that can implement predetermined functions.
  • the devices described in the following embodiments are preferably implemented by software, implementation by hardware or a combination of software and hardware is also possible and conceived.
  • Fig. 6 is a structural diagram of a threat intelligence subject detection device according to an embodiment of the present application. As shown in Fig. 6, the device includes:
  • the obtaining module 61 is used to crawl the threat intelligence text to be detected from the preset data source;
  • the extraction module 62 coupled with the acquisition module 61, is used to extract a set of candidate words from the threat intelligence text to be detected, and extract a variety of key features from the set of candidate words, where the key features include: keywords Features, subject term features and/or entity features;
  • the fusion module 63 coupled with the extraction module 62, is used to fuse a variety of key features to obtain the text features of the threat intelligence text to be detected;
  • the processing module 64 is coupled to the fusion module 63, and is configured to use a hierarchical clustering algorithm to cluster the threat information text to be detected into existing topics or newly added topics according to the text characteristics of the threat information text to be detected.
  • the acquisition module 61 includes:
  • the preprocessing unit is used to preprocess the threat intelligence text to be detected to obtain a set of candidate words; wherein, the preprocessing includes at least one of the following: deduplication, stop word deletion, punctuation removal, case conversion, part-of-speech tagging, and Removal and morphological restoration.
  • the extraction module 62 includes:
  • the first extraction unit coupled with the preprocessing unit, is used to extract keywords from the candidate word set, determine the word frequency and inverse document frequency of the keywords, determine the weight value of the keywords according to the word frequency and inverse document frequency, and according to the keywords The weight value of determines the keyword characteristics.
  • the extraction module 62 further includes:
  • the second extraction unit coupled with the preprocessing unit, is used to extract candidate topic words from the candidate word set, calculate the similarity between the candidate topic words and the preset label category words, determine the weight value of the candidate topic words according to the similarity, and Determine the topic word features according to the weight value of the candidate topic word.
  • the extraction module 62 further includes:
  • the third extraction unit coupled with the preprocessing unit, is used to identify entity candidate words from the candidate word set, delete entity candidate words with preset parts of speech from the entity candidate words, and obtain entity features.
  • the processing module 64 includes:
  • the first judging unit coupled with the fusion module 63, is used to judge whether the text feature of the threat intelligence text to be detected is more similar to the text feature of the existing subject in the current level in the existing subject cluster.
  • the first classification unit is coupled to the first judging unit, and is used for detecting the text feature of the threat intelligence text to be detected when the similarity of the text feature of the existing topic at the current level is higher than the preset threshold.
  • the threat intelligence text is classified into topics at the next level below the current level.
  • the device further includes:
  • the first processing module coupled to the fusion module 63, is used to determine whether the similarity between the text feature of the threat intelligence text to be detected and the text feature of the existing topic at the current level is not higher than the preset threshold In this case, a new topic is added under the current level, and the threat intelligence text to be detected is classified into the new topic.
  • the device further includes:
  • the second processing module coupled with the processing module 64, is used to select baseline threat intelligence texts from multiple threat intelligence texts clustered to the same existing topic, and calculate each threat intelligence text and baseline in the multiple threat intelligence texts respectively The similarity of threat intelligence texts, and the average of the obtained similarity is used as the preset threshold corresponding to the same existing topic.
  • each of the above-mentioned modules may be functional modules or program modules, which may be implemented by software or hardware.
  • each of the foregoing modules may be located in the same processor; or each of the foregoing modules may also be located in different processors in any combination.
  • FIG. 7 is a schematic diagram of the hardware structure of a computer device according to an embodiment of the present application.
  • the computer device may include a processor 71 and a memory 72 storing computer program instructions.
  • the aforementioned processor 71 may include a central processing unit (CPU), or a specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), or may be configured to implement one or more integrated circuits in the embodiments of the present application.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 72 may include a large-capacity memory for text or instructions.
  • the storage 72 may include a hard disk drive (Hard Disk Drive, referred to as HDD), a floppy disk drive, a solid state drive (Solid State Drive, referred to as SSD), flash memory, optical disk, magneto-optical disk, magnetic tape, or universal serial Universal Serial Bus (USB for short) driver or a combination of two or more of these.
  • the storage 72 may include removable or non-removable (or fixed) media.
  • the memory 72 may be internal or external to the text processing device.
  • the memory 72 is a non-volatile (Non-Volatile) memory.
  • the memory 72 includes a read-only memory (Read-Only Memory, ROM for short) and a random access memory (Random Access Memory, RAM for short).
  • the ROM can be mask-programmed ROM, programmable ROM (Programmable Read-Only Memory, referred to as PROM), erasable PROM (Erasable Programmable Read-Only Memory, referred to as EPROM), and electrically erasable Except PROM (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), Electrically Alterable Read-Only Memory (EAROM for short) or Flash memory (FLASH), or a combination of two or more of these.
  • PROM Programmable ROM
  • EPROM Erasable Programmable Read-Only Memory
  • FLASH Flash memory
  • the RAM can be a static random access memory (Static Random-Access Memory, referred to as SRAM) or a dynamic random access memory (Dynamic Random Access Memory, referred to as DRAM), where the DRAM can be a fast page Mode Dynamic Random Access Memory (Fast Page Mode Dynamic Random Access Memory, referred to as FPMDRAM), Extended Text Output Dynamic Random Access Memory (Extended Data Out Dynamic Random Access Memory, referred to as EDODRAM), Synchronous Dynamic Random Access Memory (EDODRAM) Dynamic Random-Access Memory, SDRAM for short), etc.
  • SRAM Static Random-Access Memory
  • DRAM Dynamic Random Access Memory
  • FPMDRAM fast page Mode Dynamic Random Access Memory
  • EDODRAM Extended Text Output Dynamic Random Access Memory
  • EDODRAM Synchronous Dynamic Random Access Memory
  • the memory 72 may be used to store or cache various text files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 71.
  • the processor 71 reads and executes computer program instructions stored in the memory 72 to implement any one of the threat intelligence subject detection methods in the foregoing embodiments.
  • the computer device may further include a communication interface 73 and a bus 70.
  • the processor 71, the memory 72, and the communication interface 73 are connected through the bus 70 and complete mutual communication.
  • the communication interface 73 is used to implement communication between various modules, devices, units and/or devices in the embodiments of the present application.
  • the communication interface 73 can also implement text communication with other components such as external devices, image/text acquisition devices, text libraries, external storage, and image/text processing workstations.
  • the bus 70 includes hardware, software, or both, and couples the components of the computer device to each other.
  • the bus 70 includes but is not limited to at least one of the following: a text bus (Data Bus), an address bus (Address Bus), a control bus (Control Bus), an expansion bus (Expansion Bus), and a local bus (Local Bus).
  • the bus 70 may include an Accelerated Graphics Port (AGP) or other graphics buses, an Enhanced Industry Standard Architecture (EISA) bus, and a front side bus (Front Side Bus).
  • AGP Accelerated Graphics Port
  • EISA Enhanced Industry Standard Architecture
  • Front Side Bus Front Side Bus
  • FSB HyperTransport
  • ISA Industry Standard Architecture
  • InfiniBand wireless bandwidth
  • LPC Low Pin Count
  • MCA Micro Channel Architecture
  • PCI Peripheral Component Interconnect
  • PCI-X Peripheral Component Interconnect
  • SATA Serial Advanced Technology Attachment
  • VLB Video Electronics Standards Association Local Bus
  • the bus 70 may include one or more buses.
  • the computer device can execute the subject detection method of threat intelligence in the embodiment of the present application based on the acquired threat intelligence text to be detected, thereby realizing the subject detection method of threat intelligence described in conjunction with FIG. 1.
  • the embodiment of the present application may provide a computer-readable storage medium for implementation.
  • the computer-readable storage medium stores computer program instructions; when the computer program instructions are executed by the processor, any one of the threat intelligence subject detection methods in the foregoing embodiments is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种威胁情报的主题检测方法、装置和计算机存储介质,其中,该威胁情报的主题检测方法包括:从预设数据源中爬取待检测的威胁情报文本;从待检测的威胁情报文本中抽取候选词集合,并从候选词集合中提取多种关键特征;融合多种关键特征,得到待检测的威胁情报文本的文本特征;采用层次聚类算法,根据待检测的威胁情报文本的文本特征将待检测的威胁情报文本聚类到已有主题或者新增主题。

Description

威胁情报的主题检测方法、装置和计算机存储介质
相关申请
本申请要求2020年5月13日申请的,申请号为202010402752.6,发明名称为“威胁情报的主题检测方法、装置和计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息安全技术领域,特别是涉及威胁情报的主题检测方法、装置和计算机存储介质。
背景技术
随着目的性强、手段复杂的网络攻击逐渐增多,早期的单点检测防御技术难以有效分析网络攻击的协同性和攻击所处阶段。随着威胁环境的不断变化,以及攻击者手段更加先进,安全人员需要更有效地预防、检测和相应威胁。合理利用威胁情报(Cyber Threat Intelligence,CTI)可以在一定程度上减缓网络威胁,威胁情报作为新一代网络防御体系,能够及时感知层出不穷的安全事件及各种APTs的攻击,为各种攻击提供预防及防御措施。
新一代网络防御体系中,常常对开源信息进行处理,但开源信息中,由于存在互联网漏洞、恶意病毒、黑客攻击工具等具有威胁性的开源信息,这些开源威胁信息可以被任何人通过互联网获取后加以利用和扩散,对互联网开源信息安全影响巨大。同时,在通常的网络防御体系中,存在对大规模文本数据向量化的处理效率低,对高纬度向量的语义挖掘效果较差,不能及时发现互联网开源威胁的不足。
目前针对相关技术中处理海量文档数据效率低,挖掘高纬度数据特征不精准,对威胁主题不能及时发现,尚未提出有效的解决方案。
发明内容
本申请实施例提供了一种威胁情报的主题检测方法、装置和计算机存储介质,以至少解决相关技术中对威胁主题不能及时发现的问题。
第一方面,本申请实施例提供了一种威胁情报的主题检测方法,包括:
从预设数据源中爬取待检测的威胁情报文本;
从待检测的威胁情报文本中抽取候选词集合,并从所述候选词集合中提取多种关键特征,其中,所述关键特征包括:关键词特征、主题词特征和/或实体特征;
融合所述多种关键特征,得到所述待检测的威胁情报文本的文本特征;
采用层次聚类算法,根据所述待检测的威胁情报文本的文本特征将所述待检测的威胁情报文本聚类到已有主题或者新增主题。
在其中一些实施例中,从所述待检测的威胁情报文本中抽取候选词集合包括:
对所述待检测的威胁情报文本进行预处理,得到所述候选词集合;其中,所述预处理包括以下至少之一:去重、停用词删除、标点符号去除、大小写转换、词性标注及去除、词形还原。
在其中一些实施例中,从所述候选词集合中提取所述关键词特征包括:
从所述候选词集合中提取关键词,确定所述关键词的词频和逆文档频率,根据所述词频和所述逆文档频率确定所述关键词的权重值,并根据所述关键词的权重值确定所述关键词特征。
在其中一些实施例中,从所述候选词集合中提取所述主题词特征包括:
从所述候选词集合中提取候选主题词,计算所述候选主题词与预设标签类别词的相似度,根据所述相似度确定所述候选主题词的权重值,并根据所述候选主题词的权重值确定所述主题词特征。
在其中一些实施例中,从所述候选词集合中提取所述实体特征包括:
从所述候选词集合中识别实体候选词,从所述实体候选词中删除预设词性的实体候选词,得到所述实体特征。
在其中一些实施例中,采用层次聚类算法,根据所述待检测的威胁情报文本的文本特征将所述待检测的威胁情报文本聚类到已有主题或者新增主题包括:
判断所述待检测的威胁情报文本的文本特征与已有主题簇中当前层级的已有主题的文本特征的相似度是否高于与所述当前层级的已有主题对应的预设阈值;
在所述待检测的威胁情报文本的文本特征与所述当前层级的已有主题的文本特征的相似度高于所述预设阈值的情况下,将所述待检测的威胁情报文本分类到所述当前层级的下一层级的主题。
在其中一些实施例中,所述方法还包括:
在所述待检测的威胁情报文本的文本特征与所述当前层级的已有主题的文本特征的相似度不高于所述预设阈值的情况下,在所述当前层级下增加新增主题,并将所述待检测的威胁情报文本分类到所述新增主题。
在其中一些实施例中,所述方法还包括:
从聚类到相同已有主题的多个威胁情报文本中选取基准威胁情报文本,分别计算所述多个威胁情报文本中每个威胁情报文本与所述基准威胁情报文本的相似度,并将得到的相似度的平均值作为与所述相同已有主题对应的预设阈值。
第二方面,本申请实施例提供了一种威胁情报的主题检测装置,包括:
获取模块,用于从预设数据源中爬取待检测的威胁情报文本;
提取模块,用于从待检测的威胁情报文本中抽取候选词集合,并从所述候选词集合中提取多种关键特征,其中,所述关键特征包括:关键词特征、主题词特征和/或实体特征;
融合模块,用于融合所述多种关键特征,得到所述待检测的威胁情报文本的文本特征;
处理模块,用于采用层次聚类算法,根据所述待检测的威胁情报文本的文本特征将所述待检测的威胁情报文本聚类到已有主题或者新增主题。
在其中一些实施例中,所述获取模块包括:
预处理单元,用于对所述待检测的威胁情报文本进行预处理,得到所述候选词集合;其中,所述预处理包括以下至少之一:去重、停用词删除、标点符号去除、大小写转换、词性标注及去除、词形还原。
在其中一些实施例中,所述提取模块包括:
第一提取单元,用于从所述候选词集合中提取关键词,确定所述关键词的词频和逆文档频率,根据所述词频和所述逆文档频率确定所述关键词的权重值,并根据所述关键词的权重值确定所述关键词特征。
在其中一些实施例中,所述提取模块还包括:
第二提取单元,用于从所述候选词集合中提取候选主题词,计算所述候选主题词与预设标签类别词的相似度,根据所述相似度确定所述候选主题词的权重值,并根据所述候选主题词的权重值确定所述主题词特征。
在其中一些实施例中,所述提取模块还包括:
第三提取单元,用于从所述候选词集合中识别实体候选词,从所述实体候选词中删除预设词性的实体候选词,得到所述实体特征。
在其中一些实施例中,所述处理模块包括:
第一判断单元,用于判断所述待检测的威胁情报文本的文本特征与已有主题簇中当前层级的已有主题的文本特征的相似度是否高于与所述当前层级的已有主题对应的预设阈值;
第一分类单元,用于在所述待检测的威胁情报文本的文本特征与所述当前层级的已有 主题的文本特征的相似度高于所述预设阈值的情况下,将所述待检测的威胁情报文本分类到所述当前层级的下一层级的主题。
在其中一些实施例中,所述装置还包括:
第一处理模块,用于在所述待检测的威胁情报文本的文本特征与所述当前层级的已有主题的文本特征的相似度不高于所述预设阈值的情况下,在所述当前层级下增加新增主题,并将所述待检测的威胁情报文本分类到所述新增主题。
在其中一些实施例中,所述装置还包括:
第二处理模块,用于从聚类到相同已有主题的多个威胁情报文本中选取基准威胁情报文本,分别计算所述多个威胁情报文本中每个威胁情报文本与所述基准威胁情报文本的相似度,并将得到的相似度的平均值作为与所述相同已有主题对应的预设阈值。
第三方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如上述第一方面所述的威胁情报的主题检测方法。
相比于相关技术,本申请实施例提供的一种威胁情报的主题检测方法、装置和计算机存储介质,通过从预设数据源中爬取待检测的威胁情报文本;从待检测的威胁情报文本中抽取候选词集合,并从候选词集合中提取多种关键特征;融合多种关键特征,得到待检测的威胁情报文本的文本特征;采用层次聚类算法,根据待检测的威胁情报文本的文本特征将待检测的威胁情报文本聚类到已有主题或者新增主题,解决了相关技术中对威胁主题不能及时发现的问题,实现了从海量文档数据中高效精准的发现和提取威胁主题。
本申请的一个或多个实施例的细节在以下附图和描述中提出,以使本申请的其他特征、目的和优点更加简明易懂。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是根据本申请实施例的威胁情报的主题检测方法的流程图。
图2是本申请实施例的待检测的威胁情报文本预处理流程图。
图3是本申请实施例的关键特征提取的流程图。
图4是本申请实施例的威胁主题检测的框架图。
图5是本申请实施例中威胁主题发现与跟踪流程图。
图6是根据本申请实施例的威胁情报的主题检测装置的结构图。
图7是根据本申请实施例的计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行描述和说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。基于本申请提供的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例,都属于本申请保护的范围。
显而易见地,下面描述中的附图仅仅是本申请的一些示例或实施例,对于本领域的普通技术人员而言,在不付出创造性劳动的前提下,还可以根据这些附图将本申请应用于其他类似情景。此外,还可以理解的是,虽然这种开发过程中所作出的努力可能是复杂并且冗长的,然而对于与本申请公开的内容相关的本领域的普通技术人员而言,在本申请揭露的技术内容的基础上进行的一些设计,制造或者生产等变更只是常规的技术手段,不应当理解为本申请公开的内容不充分。
在本申请中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域普通技术人员显式地和隐式地理解的是,本申请所描述的实施例在不冲突的情况下,可以与其它实施例相结合。
除非另作定义,本申请所涉及的技术术语或者科学术语应当为本申请所属技术领域内具有一般技能的人士所理解的通常意义。本申请所涉及的“一”、“一个”、“一种”、“该”等类似词语并不表示数量限制,可表示单数或复数。本申请所涉及的术语“包括”、“包含”、“具有”以及它们任何变形,意图在于覆盖不排他的包含;例如包含了一系列步骤或模块(单元)的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可以还包括没有列出的步骤或单元,或可以还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。本申请所涉及的“连接”、“相连”、“耦接”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电气的连接,不管是直接的还是间接的。本申请所涉及的“多个”是指两个或两个以上。“和/或”描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。本申请所涉及的术语“第一”、“第二”、“第三”等仅仅是区别类似的对象,不代表针对对象的特定排序。
本申请中描述的各种技术科用于各种信息网络安全系统、网络防御系统。对本申请的实施例进行描述之前,对以下技术术语进行说明如下:
威胁情报(Cyber Threat Intelligence,CTI)指收集证据知识的任务,包括关于现存的或潜在威胁和风险的背景、机制、指标、意义和行动的建议,可用于对威胁或风险做出响应的决策。
Word2Vec模型是一类神经网络模型,在给定无标签语料库中,用一个个能表达语义信息的词向量来表示语料中的词语,Word2Vec模型训练的词向量包含词的语义信息并且能体现词之间的线性关系。
LDA主题模型是一种文档主题生成模型。
本实施例提供了一种威胁情报的主题检测方法。图1是根据本申请实施例的威胁情报的主题检测方法的流程图,如图1所示,该流程包括如下步骤:
步骤S101:从预设数据源中爬取待检测的威胁情报文本。
本实施例中,从预设数据源中爬取待检测的威胁情报文本是通过监控一系列网站,对特定类别的各安全网站的论坛、新闻、博客等安全资讯平台实时地爬取收集文章和评论数据。
步骤S102:从待检测的威胁情报文本中抽取候选词集合,并从候选词集合中提取多种关键特征,其中,所述关键特征包括:关键词特征、主题词特征和/或实体特征。
步骤S103:融合多种关键特征,得到待检测的威胁情报文本的文本特征。
步骤S104:采用层次聚类算法,根据待检测的威胁情报文本的文本特征将待检测的威胁情报文本聚类到已有主题或者新增主题。
通过上述步骤S101至步骤S104,采用从预设数据源中爬取待检测的威胁情报文本;从待检测的威胁情报文本中抽取候选词集合,并从候选词集合中提取多种关键特征;融合多种关键特征,得到待检测的威胁情报文本的文本特征;采用层次聚类算法,根据待检测的威胁情报文本的文本特征将待检测的威胁情报文本聚类到已有主题或者新增主题。解决了相关技术中处理海量文档数据效率低,挖掘高纬度数据特征不精准,对威胁主题不能及时发现的问题,实现了将从待检测的威胁情报文本提取的关键特征融合得到威胁情报文本的文本特征,并根据文本特征聚类判断出该待检测威胁情报文本的主题类型,高效地处理海量文档数据,精准的挖掘高纬度数据特征,及时发现威胁主题。
下面通过可选实施例对本申请实施例进行描述和说明。
在其中一些实施例中,步骤S102中的从待检测的威胁情报文本中抽取候选词集合通过如下步骤实现:
步骤S102-1:对待检测的威胁情报文本进行预处理,得到候选词集合;其中,预处理包括以下至少之一:去重、停用词删除、标点符号去除、大小写转换、词性标注及去除、 词形还原。
附图2是待检测的威胁情报文本预处理流程图。如图2所示,对待检测的威胁情报文本进行预处理的进一步阐述如下:
将预处理操作得到的词语作为特征抽取的候选词,能提高特征提取效率及效果。
预处理包括如下步骤:
1、词性标注
利用词性标注方法标注文章中的词语,剔除不可能是文章主题词的词性的词语,减少文章维度,提高关键特征抽取的效率和效果。本申请实施例采用的是自然语言处理(NLP)中的自然语言处理库(NLTK)进行词性标注。
2、繁简体、特殊字符转换
繁简体、特殊字符转换,也就是词形还原,由于爬虫得到的文章中相同原形的词经常以不同形态出现,在后续特征提取阶段这类词通常会当做两个词进行处理,比如同一个字符的繁简体、特殊编码字符等,削弱了这类词的语义及统计作用,因此,需要通过词形还原将不同形态的相同原词转为统一形态的原词。
3、特征候选词提取
由于限定词、基数词、量词等词性的词和文章主题无关,因此预处理阶段去除文章中“限定词(DT)”、“副词(adverb、RB)”、“基数词(CD)”、“连词(CC)”、“存在句(existential there、EX)”、“介词或从属连词(preposition or conjunction,subordinating、IN)”、“形容词比较级(adjective,comparative、JJR)”、“情态助动词(modal auxiliary、MD)”、“WH限定词(WH-determiner、WDT)”、“WH代词(WH-pronoun、WP)”等词性的词。然后将剩余的词进行词形还原,将相同原形不同形态的词还原为其一般形式。
经过上述词性标注,繁简体、特殊字符转换,特征候选词提取处理后保留的词作为关键特征抽取的候选词。
在对如下实施例进行阐述之前,先对以下相关技术进行说明如下:
1、基于词频-逆文本频率(Term Frequency-InverseDocument Frequency、TF-IDF)优化的关键词特征提取方法。
TF-IDF是一种文本特征降维或提取的方法,其主要思想是在一篇文章中某个词出现的概率越大,即词频TF越高;而该词在其他文章中出现的很少,即逆文档概率IDF越高,则说明该词具有很高的辨识度。文章中每个词的逆文档频率IDF是一个长数值,而威胁情报的数据集是动态变化的,动态数据集中逆文档频率用固定的IDF集合不能很好的表示。
现有的TF计算方法对于文章中的停用词、词的词性、词在文本中出现的位置对词权 重的影响这些因素都没有考虑,导致很多文章主题词直接用该算法会被误判不是关键词。本申请中使用的TF-IDF计算方法提取关键词采用的是:首先去除了文本中停用词、量词、限定词等非文章关键词,剩下的词作为关键词的候选词,同时,本申请中使用的TF-IDF算法还考虑了对于文章中词的位置,不同位置的词重要程度不同,如标题比正文中的词重要,本申请中使用的基于候选关键词的位置的改进的TF方法公式如下:
Figure PCTCN2021089290-appb-000001
式中的TF(t,d)表示t出现在文档d中的概率,T表示由标题部分的词构成的集合,C表示由正文中的词构成的集合,α表示词出现在标题中的权重,α越大表示标题中的词越重要。
此外,本申请中将IDF部分的计算优化成增量IDF方法,解决IDF值不随数据集动态更改的问题,使用的公式如下:
Figure PCTCN2021089290-appb-000002
式中N c表示当前时间段数据库中总的文档数量,n(t,c)表示数据库中包含词t的文档数量,由于数据库中数据集是动态变化的,所以本申请中改变候选词的权重使N c与n(t,c)随时间动态变化。
2、本申请中使用文档主题生成模型(LDA主题模型)来提取文章中的主题,但由于LDA主题模型提取的文章中的候选主题词比较广泛,高频词出现在候选主题词中的概率大,而这些词不能很好地体现文章的主题,因此需要从候选主题词中进一步进行剔除主题特征词。同时,由于文章主题词与文章的类别密切相关,因此本申请中采用计算候选主题词向量与标签类别词向量的相似度,将此相似度作为候选主题词权重系数,重新计算候选主题词权重,筛选主题词。
图3是关键特征提取的流程图。如图3所示,在其中一些实施例中,步骤S102中的从候选词集合中提取关键词特征通过如下步骤实现:
步骤S102-2:从候选词集合中提取关键词,确定关键词的词频和逆文档频率,根据词频和逆文档频率确定关键词的权重值,并根据关键词的权重值确定关键词特征。
需要说明的是,本实施例中的关键词特征提取方法如下:首先,使用考虑词位置的TF方法计算文章中关键候选词的词频;之后,利用增量IDF方法计算文章中关键候选词的逆文档频率;最后基于上述的TF-IDF方法计算文章中关键候选词的权重,提取文章关键词 特征。
在其中一些实施例中,步骤S102中的从候选词集合中提取主题词特征通过如下步骤实现:
步骤S102-3:从候选词集合中提取候选主题词,计算候选主题词与预设标签类别词的相似度,根据相似度确定候选主题词的权重值,并根据候选主题词的权重值确定主题词特征。
需要说明的是,本实施例中的主题词特征提取方法如下:首先,利用LDA模型提取文章候选主题词;然后,利用自然语言处理(NLP)中的Word2Vec模型训练文章词向量;随之,计算候选主题词向量与标签类别词向量的相似度,将此相似度的词值作为候选主题词权重的系数,更新确定候选主题词的权重值,提高与标签类别相似度高的候选主题词的权重值;最后,提高某些词性以及包含在标题中的候选主题词的权重,提取文章主题词特征。
在其中一些实施例中,步骤S102中的从候选词集合中提取实体特征通过如下步骤实现:
从候选词集合中识别实体候选词,从实体候选词中删除预设词性的实体候选词,得到实体特征。
需要说明的是,本实施例中,实体特征提取方法如下:首先,获取每篇文章的人物、地点、组织机构实体;然后,去除某些词性的实体,这些实体词(预设词性的实体词)不可能是本申请实施例需要的实体特征;最后,提取文章的实体特征。
在申请的实施例中,从候选词集合中提取多种关键特征后,采用特征融合方法合并上述提取的关键词特征、主题词特征和实体特征,得到文章的关键特征,然后,根据关键特征构建文章特征向量作为主题分层聚类的输入。
在其中一些实施例中,步骤S104中的采用层次聚类算法,根据待检测的威胁情报文本的文本特征将待检测的威胁情报文本聚类到已有主题或者新增主题通过如下步骤实现:
步骤S104-1:判断待检测的威胁情报文本的文本特征与已有主题簇中当前层级的已有主题的文本特征的相似度是否高于与当前层级的已有主题对应的预设阈值;
步骤S104-2:在待检测的威胁情报文本的文本特征与当前层级的已有主题的文本特征的相似度高于预设阈值的情况下,将待检测的威胁情报文本分类到当前层级的下一层级的主题。
需要说明的是,本申请中使用的层次聚类算法是在基于质心连锁(centroid linkage)方法的层次聚类模型上采用向量乘积相似度的一种改进型的层次聚类模型。
本申请中使用的层次聚类算法对待检测的威胁情报文本进行主题聚类包括主题聚类与主题跟踪,其中,主题聚类主要是利用聚类模型聚类每个时间段的待检测的威胁情报文本的文本特征,得到实时的主题;主题跟踪是将实时的主题与已有主题进行相似度比较,实时识别新的主题以及已有主题的事件延续。
本申请的威胁情报的主题检测方法,通过计算当前聚类的主题与已有主题的相似度,选取与当前主题相似度最高的已有主题,若其相似度大于设定的阈值,则将当前主题文本归并到已有主题簇,否则创建新的主题簇,插入数据库。
其中,当前主题与已有主题簇相似度采用如下计算方法:
首先对每个主题选取主题种子文章,主题种子文章的选择依据是计算主题内所有文本特征与某一文本特征的相似度,然后得到相似度的平均值,选取平均值最高的前N个文章作为主题种子文章。即如果这篇文章与主题内所有文章的相关性都很高,那么它就是一篇种子文章。然后计算所有已有主题种子文章的文本特征与当前聚类的主题种子文章的文本特征的相似度,得到相似度算术平均值作为两主题的相似度的值。
上述的质心连锁(centroid linkage)方法是一种以主题质心向量作为主题向量,以主题质心向量每一维度为主题内所有文章特征向量对应维度元素值的平均值的层次聚类算法,该质心连锁(centroid linkage)方法计算主题向量相似度是使用余弦(consine)相似度、欧式距离来度量的。但上述的centroid linkage方法的主题向量余弦(consine)夹角小,主题向量相似度高,不能表示两主题之间的文本(文章)都相似,因此该centroid linkage方法度量主题间的相似度无法确保合并后新主题的文章之间相似度在可接受范围内。
为了确保合并后的新主题(相对当前聚类的主题而言属于已有主题)的文章之间相似度在可接受范围内,也就是确保与当前聚类的主题进行相似度计算的已有主题内的所有文章都与该已有主题的相关性都高,本申请的层次聚类算法在确定选取与当前聚类的主题进行相似度计算的已有主题之前,采用向量乘积方法对上述已有主题内的所有文章进行相关性计算,使得在完成上述向量乘积运算后,上述已有主题内的所有文章都与该主题相关。例如,当合并后的新主题(相对当前聚类的主题而言属于已有主题)与多个文章向量有关,那么进行向量乘积运算后,合并后的新主题会与多个文章向量的向量乘积成正比。只要新主题与其包括的多个文章向量的向量乘积成正比,则表征合并后的新主题的所有文章的相关性都高。
本申请使用的层次聚类算法在保留质心连锁(centroid linkage)度量两主题相似度计算复杂度低的优势下,通过向量乘积运算改善了质心连锁(centroid linkage)计算主题向量相似度容易形成“聚集效应”的缺点。上述的聚集效应是指:随着主题合并次数增加,容易 出现某一主题文章数量多、主题内的某些文章相似度低的问题。
在其中一些实施例中,所述方法还实施如下步骤:步骤S105:在待检测的威胁情报文本的文本特征与当前层级的已有主题的文本特征的相似度不高于预设阈值的情况下,在当前层级下增加新增主题,并将待检测的威胁情报文本分类到新增主题。
通过步骤S105,当采用层次分类算法对待检测的威胁情报文本的文本特征进行主题分类时,实施自顶向下的方式逐层聚类分类到最底层的新增主题或已有主题上。
在其中一些实施例中,所述方法还实施如下步骤:步骤S106:从聚类到相同已有主题的多个威胁情报文本中选取基准威胁情报文本,分别计算多个威胁情报文本中每个威胁情报文本与基准威胁情报文本的相似度,并将得到的相似度的平均值作为与相同已有主题对应的预设阈值。
图4是本申请实施例中威胁主题检测的框架图;图5是威胁主题发现与跟踪流程图。如图4和5所示,本申请实施例的威胁情报的主题检测过程中,还进行如下具体的步骤:
1、每隔一段时间抓取最新安全领域数据;之后对数据进行预处理;然后利用上述三种特征提取方法(关键词特征提取方法、主题词特征方法和实体特征方法)提取文本特征(多种关键特征);最后对文本特征进行特征融合,得到文本特征向量(待检测的威胁情报文本的文本特征)。
2、使用聚类模型进行主题聚类,准确地发现实时威胁主题簇,其中,聚类的对象为待检测的威胁情报文本的文本特征。
3、将实时发现的威胁主题与主题簇中已有主题进行相似度比较,选取与当前主题相似度最高的已有主题,若其相似度高于设定阈值,则将识别的主题与已有主题合并,然后插入已有主题库;若相似度低于设定的阈值,则将实时识别的威胁主题作为新兴主题插入主题库。
4、淘汰过时主题,即删除数据库中超过N天未更新的主题簇,减轻服务器存储负担。
本实施例还提供了一种威胁情报的主题检测装置,该装置用于实现上述实施例及可选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”、“单元”、“子单元”等可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图6是根据本申请实施例的威胁情报的主题检测装置的结构图,如图6所示,该装置包括:
获取模块61,用于从预设数据源中爬取待检测的威胁情报文本;
提取模块62,与获取模块61耦合连接,用于从待检测的威胁情报文本中抽取候选词 集合,并从所述候选词集合中提取多种关键特征,其中,所述关键特征包括:关键词特征、主题词特征和/或实体特征;
融合模块63,与提取模块62耦合连接,用于融合多种关键特征,得到待检测的威胁情报文本的文本特征;
处理模块64,与融合模块63耦合连接,用于采用层次聚类算法,根据待检测的威胁情报文本的文本特征将待检测的威胁情报文本聚类到已有主题或者新增主题。
在其中一些实施例中,获取模块61包括:
预处理单元,用于对待检测的威胁情报文本进行预处理,得到候选词集合;其中,预处理包括以下至少之一:去重、停用词删除、标点符号去除、大小写转换、词性标注及去除、词形还原。
在其中一些实施例中,提取模块62包括:
第一提取单元,与预处理单元耦合连接,用于从候选词集合中提取关键词,确定关键词的词频和逆文档频率,根据词频和逆文档频率确定关键词的权重值,并根据关键词的权重值确定关键词特征。
在其中一些实施例中,提取模块62还包括:
第二提取单元,与预处理单元耦合连接,用于从候选词集合中提取候选主题词,计算候选主题词与预设标签类别词的相似度,根据相似度确定候选主题词的权重值,并根据候选主题词的权重值确定主题词特征。
在其中一些实施例中,提取模块62还包括:
第三提取单元,与预处理单元耦合连接,用于从候选词集合中识别实体候选词,从实体候选词中删除预设词性的实体候选词,得到实体特征。
在其中一些实施例中,处理模块64包括:
第一判断单元,与融合模块63耦合连接,用于判断待检测的威胁情报文本的文本特征与已有主题簇中当前层级的已有主题的文本特征的相似度是否高于与当前层级的已有主题对应的预设阈值;
第一分类单元,与第一判断单元耦合连接,用于在待检测的威胁情报文本的文本特征与当前层级的已有主题的文本特征的相似度高于预设阈值的情况下,将待检测的威胁情报文本分类到当前层级的下一层级的主题。
在其中一些实施例中,装置还包括:
第一处理模块,与融合模块63耦合连接,用于在所述待检测的威胁情报文本的文本特征与所述当前层级的已有主题的文本特征的相似度不高于所述预设阈值的情况下,在所 述当前层级下增加新增主题,并将所述待检测的威胁情报文本分类到所述新增主题。
在其中一些实施例中,装置还包括:
第二处理模块,与处理模块64耦合连接,用于从聚类到相同已有主题的多个威胁情报文本中选取基准威胁情报文本,分别计算多个威胁情报文本中每个威胁情报文本与基准威胁情报文本的相似度,并将得到的相似度的平均值作为与相同已有主题对应的预设阈值。
需要说明的是,上述各个模块可以是功能模块也可以是程序模块,既可以通过软件来实现,也可以通过硬件来实现。对于通过硬件来实现的模块而言,上述各个模块可以位于同一处理器中;或者上述各个模块还可以按照任意组合的形式分别位于不同的处理器中。
另外,结合图1描述的本申请实施例威胁情报的主题检测方法可以由计算机设备来实现。图7为根据本申请实施例的计算机设备的硬件结构示意图。
计算机设备可以包括处理器71以及存储有计算机程序指令的存储器72。
具体地,上述处理器71可以包括中央处理器(CPU),或者特定集成电路(Application Specific Integrated Circuit,简称为ASIC),或者可以被配置成实施本申请实施例的一个或多个集成电路。
其中,存储器72可以包括用于文本或指令的大容量存储器。举例来说而非限制,存储器72可包括硬盘驱动器(Hard Disk Drive,简称为HDD)、软盘驱动器、固态驱动器(Solid State Drive,简称为SSD)、闪存、光盘、磁光盘、磁带或通用串行总线(Universal Serial Bus,简称为USB)驱动器或者两个或更多个以上这些的组合。在合适的情况下,存储器72可包括可移除或不可移除(或固定)的介质。在合适的情况下,存储器72可在文本处理装置的内部或外部。在特定实施例中,存储器72是非易失性(Non-Volatile)存储器。在特定实施例中,存储器72包括只读存储器(Read-Only Memory,简称为ROM)和随机存取存储器(Random Access Memory,简称为RAM)。在合适的情况下,该ROM可以是掩模编程的ROM、可编程ROM(ProgrammableRead-Only Memory,简称为PROM)、可擦除PROM(Erasable Programmable Read-Only Memory,简称为EPROM)、电可擦除PROM(Electrically Erasable Programmable Read-Only Memory,简称为EEPROM)、电可改写ROM(Electrically Alterable Read-Only Memory,简称为EAROM)或闪存(FLASH)或者两个或更多个以上这些的组合。在合适的情况下,该RAM可以是静态随机存取存储器(Static Random-Access Memory,简称为SRAM)或动态随机存取存储器(Dynamic Random Access Memory,简称为DRAM),其中,DRAM可以是快速页模式动态随机存取存储器(Fast Page Mode Dynamic Random Access Memory,简称为FPMDRAM)、扩展文本输出动态随机存取存储器(Extended Data Out Dynamic Random Access Memory,简称为EDODRAM)、同步 动态随机存取内存(Synchronous Dynamic Random-Access Memory,简称SDRAM)等。
存储器72可以用来存储或者缓存需要处理和/或通信使用的各种文本文件,以及处理器71所执行的可能的计算机程序指令。
处理器71通过读取并执行存储器72中存储的计算机程序指令,以实现上述实施例中的任意一种威胁情报的主题检测方法。
在其中一些实施例中,计算机设备还可包括通信接口73和总线70。其中,如图7所示,处理器71、存储器72、通信接口73通过总线70连接并完成相互间的通信。
通信接口73用于实现本申请实施例中各模块、装置、单元和/或设备之间的通信。通信接口73还可以实现与其他部件例如:外接设备、图像/文本采集设备、文本库、外部存储以及图像/文本处理工作站等之间进行文本通信。
总线70包括硬件、软件或两者,将计算机设备的部件彼此耦接在一起。总线70包括但不限于以下至少之一:文本总线(Data Bus)、地址总线(Address Bus)、控制总线(Control Bus)、扩展总线(Expansion Bus)、局部总线(Local Bus)。举例来说而非限制,总线70可包括图形加速接口(Accelerated Graphics Port,简称为AGP)或其他图形总线、增强工业标准架构(Extended Industry Standard Architecture,简称为EISA)总线、前端总线(Front Side Bus,简称为FSB)、超传输(Hyper Transport,简称为HT)互连、工业标准架构(Industry Standard Architecture,简称为ISA)总线、无线带宽(InfiniBand)互连、低引脚数(Low Pin Count,简称为LPC)总线、存储器总线、微信道架构(Micro Channel Architecture,简称为MCA)总线、外围组件互连(Peripheral Component Interconnect,简称为PCI)总线、PCI-Express(PCI-X)总线、串行高级技术附件(Serial Advanced Technology Attachment,简称为SATA)总线、视频电子标准协会局部(Video Electronics Standards Association Local Bus,简称为VLB)总线或其他合适的总线或者两个或更多个以上这些的组合。在合适的情况下,总线70可包括一个或多个总线。尽管本申请实施例描述和示出了特定的总线,但本申请考虑任何合适的总线或互连。
该计算机设备可以基于获取到的待检测的威胁情报文本,执行本申请实施例中的威胁情报的主题检测方法,从而实现结合图1描述的威胁情报的主题检测方法。
另外,结合上述实施例中的威胁情报的主题检测方法,本申请实施例可提供一种计算机可读存储介质来实现。该计算机可读存储介质上存储有计算机程序指令;该计算机程序指令被处理器执行时实现上述实施例中的任意一种威胁情报的主题检测方法。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛 盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种威胁情报的主题检测方法,其特征在于,所述方法包括:
    从预设数据源中爬取待检测的威胁情报文本;
    从待检测的威胁情报文本中抽取候选词集合,并从所述候选词集合中提取多种关键特征,其中,所述关键特征包括:关键词特征、主题词特征和/或实体特征;
    融合所述多种关键特征,得到所述待检测的威胁情报文本的文本特征;
    采用层次聚类算法,根据所述待检测的威胁情报文本的文本特征将所述待检测的威胁情报文本聚类到已有主题或者新增主题。
  2. 根据权利要求1所述的威胁情报的主题检测方法,其中,从所述待检测的威胁情报文本中抽取候选词集合包括:
    对所述待检测的威胁情报文本进行预处理,得到所述候选词集合;其中,所述预处理包括以下至少之一:去重、停用词删除、标点符号去除、大小写转换、词性标注及去除、词形还原。
  3. 根据权利要求1所述的威胁情报的主题检测方法,其中,从所述候选词集合中提取所述关键词特征包括:
    从所述候选词集合中提取关键词,确定所述关键词的词频和逆文档频率,根据所述词频和所述逆文档频率确定所述关键词的权重值,并根据所述关键词的权重值确定所述关键词特征。
  4. 根据权利要求1所述的威胁情报的主题检测方法,其中,从所述候选词集合中提取所述主题词特征包括:
    从所述候选词集合中提取候选主题词,计算所述候选主题词与预设标签类别词的相似度,根据所述相似度确定所述候选主题词的权重值,并根据所述候选主题词的权重值确定所述主题词特征。
  5. 根据权利要求1所述的威胁情报的主题检测方法,其中,从所述候选词集合中提取所述实体特征包括:
    从所述候选词集合中识别实体候选词,从所述实体候选词中删除预设词性的实体候选词,得到所述实体特征。
  6. 根据权利要求1所述的威胁情报的主题检测方法,其中,采用层次聚类算法,根据所述待检测的威胁情报文本的文本特征将所述待检测的威胁情报文本聚类到已有主题或者新增主题包括:
    判断所述待检测的威胁情报文本的文本特征与已有主题簇中当前层级的已有主题的 文本特征的相似度是否高于与所述当前层级的已有主题对应的预设阈值;
    在所述待检测的威胁情报文本的文本特征与所述当前层级的已有主题的文本特征的相似度高于所述预设阈值的情况下,将所述待检测的威胁情报文本分类到所述当前层级的下一层级的主题。
  7. 根据权利要求6所述的威胁情报的主题检测方法,其中,所述方法还包括:
    在所述待检测的威胁情报文本的文本特征与所述当前层级的已有主题的文本特征的相似度不高于所述预设阈值的情况下,在所述当前层级下增加新增主题,并将所述待检测的威胁情报文本分类到所述新增主题。
  8. 根据权利要求6所述的威胁情报的主题检测方法,其中,所述方法还包括:
    从聚类到相同已有主题的多个威胁情报文本中选取基准威胁情报文本,分别计算所述多个威胁情报文本中每个威胁情报文本与所述基准威胁情报文本的相似度,并将得到的相似度的平均值作为与所述相同已有主题对应的预设阈值。
  9. 一种威胁情报的主题检测装置,其特征在于,包括:
    获取模块,用于从预设数据源中爬取待检测的威胁情报文本;
    提取模块,用于从待检测的威胁情报文本中抽取候选词集合,并从所述候选词集合中提取多种关键特征,其中,所述关键特征包括:关键词特征、主题词特征和/或实体特征;
    融合模块,用于融合所述多种关键特征,得到所述待检测的威胁情报文本的文本特征;
    处理模块,用于采用层次聚类算法,根据所述待检测的威胁情报文本的文本特征将所述待检测的威胁情报文本聚类到已有主题或者新增主题。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1至8中任一项所述的威胁情报的主题检测方法。
PCT/CN2021/089290 2020-05-13 2021-04-23 威胁情报的主题检测方法、装置和计算机存储介质 WO2021227831A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010402752.6A CN111581355B (zh) 2020-05-13 2020-05-13 威胁情报的主题检测方法、装置和计算机存储介质
CN202010402752.6 2020-05-13

Publications (1)

Publication Number Publication Date
WO2021227831A1 true WO2021227831A1 (zh) 2021-11-18

Family

ID=72122889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/089290 WO2021227831A1 (zh) 2020-05-13 2021-04-23 威胁情报的主题检测方法、装置和计算机存储介质

Country Status (2)

Country Link
CN (1) CN111581355B (zh)
WO (1) WO2021227831A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065767A (zh) * 2021-11-29 2022-02-18 北京航空航天大学 一种威胁情报的分类及演化关系分析方法
CN115658879A (zh) * 2022-12-29 2023-01-31 北京天际友盟信息技术有限公司 自动化威胁情报文本聚类方法和系统
CN116431814A (zh) * 2023-06-06 2023-07-14 北京中关村科金技术有限公司 信息提取方法、装置、电子设备及可读存储介质
CN117093951A (zh) * 2023-10-16 2023-11-21 北京安天网络安全技术有限公司 一种威胁情报合并方法、装置、电子设备及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581355B (zh) * 2020-05-13 2023-07-25 杭州安恒信息技术股份有限公司 威胁情报的主题检测方法、装置和计算机存储介质
CN112202818B (zh) * 2020-12-01 2021-03-09 南京中孚信息技术有限公司 一种融合威胁情报的网络流量入侵检测方法及系统
CN112733542B (zh) * 2021-01-14 2022-02-08 北京工业大学 主题的探测方法、装置、电子设备及存储介质
CN113191123A (zh) * 2021-04-08 2021-07-30 中广核工程有限公司 工程设计档案信息的标引方法、装置、计算机设备
CN113420127A (zh) * 2021-07-06 2021-09-21 北京信安天途科技有限公司 威胁情报处理方法、装置、计算设备及存储介质
CN115687960B (zh) * 2022-12-30 2023-07-11 中国人民解放军61660部队 一种面向开源安全情报的文本聚类方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170237752A1 (en) * 2016-02-11 2017-08-17 Honeywell International Inc. Prediction of potential cyber security threats and risks in an industrial control system using predictive cyber analytics
CN108399194A (zh) * 2018-01-29 2018-08-14 中国科学院信息工程研究所 一种网络威胁情报生成方法及系统
CN110008311A (zh) * 2019-04-04 2019-07-12 北京邮电大学 一种基于语义分析的产品信息安全风险监测方法
CN110413864A (zh) * 2019-08-06 2019-11-05 南方电网科学研究院有限责任公司 一种网络安全情报采集方法、装置、设备及存储介质
CN111581355A (zh) * 2020-05-13 2020-08-25 杭州安恒信息技术股份有限公司 威胁情报的主题检测方法、装置和计算机存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150057497A (ko) * 2013-11-19 2015-05-28 서울시립대학교 산학협력단 온라인 텍스트 문서의 계층적 트리 기반 주제탐색 방법 및 시스템
US10049148B1 (en) * 2014-08-14 2018-08-14 Medallia, Inc. Enhanced text clustering based on topic clusters
CN104516947B (zh) * 2014-12-03 2017-08-22 浙江工业大学 一种融合显性和隐性特征的中文微博情感分析方法
CN106682095B (zh) * 2016-12-01 2019-11-08 浙江大学 基于图的主题描述词预测及排序方法
CN107368856B (zh) * 2017-07-25 2021-10-19 深信服科技股份有限公司 恶意软件的聚类方法及装置、计算机装置及可读存储介质
CN111201566A (zh) * 2017-08-10 2020-05-26 费赛特实验室有限责任公司 用于处理数据和输出用户反馈的口语通信设备和计算体系架构以及相关方法
CN109299266B (zh) * 2018-10-16 2019-11-12 中国搜索信息科技股份有限公司 一种用于中文新闻突发事件的文本分类与抽取方法
CN109858018A (zh) * 2018-12-25 2019-06-07 中国科学院信息工程研究所 一种面向威胁情报的实体识别方法及系统
CN110177114B (zh) * 2019-06-06 2021-07-13 腾讯科技(深圳)有限公司 网络安全威胁指标识别方法、设备、装置以及计算机可读存储介质
CN110717049B (zh) * 2019-08-29 2020-12-04 四川大学 一种面向文本数据的威胁情报知识图谱构建方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170237752A1 (en) * 2016-02-11 2017-08-17 Honeywell International Inc. Prediction of potential cyber security threats and risks in an industrial control system using predictive cyber analytics
CN108399194A (zh) * 2018-01-29 2018-08-14 中国科学院信息工程研究所 一种网络威胁情报生成方法及系统
CN110008311A (zh) * 2019-04-04 2019-07-12 北京邮电大学 一种基于语义分析的产品信息安全风险监测方法
CN110413864A (zh) * 2019-08-06 2019-11-05 南方电网科学研究院有限责任公司 一种网络安全情报采集方法、装置、设备及存储介质
CN111581355A (zh) * 2020-05-13 2020-08-25 杭州安恒信息技术股份有限公司 威胁情报的主题检测方法、装置和计算机存储介质

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065767A (zh) * 2021-11-29 2022-02-18 北京航空航天大学 一种威胁情报的分类及演化关系分析方法
CN114065767B (zh) * 2021-11-29 2024-05-14 北京航空航天大学 一种威胁情报的分类及演化关系分析方法
CN115658879A (zh) * 2022-12-29 2023-01-31 北京天际友盟信息技术有限公司 自动化威胁情报文本聚类方法和系统
CN116431814A (zh) * 2023-06-06 2023-07-14 北京中关村科金技术有限公司 信息提取方法、装置、电子设备及可读存储介质
CN116431814B (zh) * 2023-06-06 2023-09-05 北京中关村科金技术有限公司 信息提取方法、装置、电子设备及可读存储介质
CN117093951A (zh) * 2023-10-16 2023-11-21 北京安天网络安全技术有限公司 一种威胁情报合并方法、装置、电子设备及存储介质
CN117093951B (zh) * 2023-10-16 2024-01-26 北京安天网络安全技术有限公司 一种威胁情报合并方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN111581355A (zh) 2020-08-25
CN111581355B (zh) 2023-07-25

Similar Documents

Publication Publication Date Title
WO2021227831A1 (zh) 威胁情报的主题检测方法、装置和计算机存储介质
WO2022022045A1 (zh) 基于知识图谱的文本比对方法、装置、设备及存储介质
EP2092419B1 (en) Method and system for high performance data metatagging and data indexing using coprocessors
Aggarwal et al. Classification of fake news by fine-tuning deep bidirectional transformers based language model
Urvoy et al. Tracking web spam with html style similarities
US10078632B2 (en) Collecting training data using anomaly detection
WO2016180268A1 (zh) 一种文本聚合方法及装置
Wu et al. A phishing detection system based on machine learning
US20130086096A1 (en) Method and System for High Performance Pattern Indexing
WO2020000717A1 (zh) 网页分类方法、装置及计算机可读存储介质
Li et al. Bursty event detection from microblog: a distributed and incremental approach
US10198504B2 (en) Terms for query expansion using unstructured data
Vani et al. Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system
WO2020052547A1 (zh) 短信垃圾新词识别方法、装置及电子设备
WO2019085332A1 (zh) 金融数据分析方法、应用服务器及计算机可读存储介质
CN106569989A (zh) 一种用于短文本的去重方法及装置
Wang et al. Cyber threat intelligence entity extraction based on deep learning and field knowledge engineering
Hu et al. Cross-site scripting detection with two-channel feature fusion embedded in self-attention mechanism
Lu et al. Domain-oriented topic discovery based on features extraction and topic clustering
CN111985212A (zh) 文本关键字识别方法、装置、计算机设备及可读存储介质
Mishra et al. A novel approach to capture the similarity in summarized text using embedded model
Pandey et al. Detecting predatory behaviour from online textual chats
Bhoj et al. LSTM powered identification of clickbait content on entertainment and news websites
CN111488452A (zh) 一种网页篡改检测方法、检测系统及相关设备
Zhang et al. A refined method for detecting interpretable and real-time bursty topic in microblog stream

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21803919

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21803919

Country of ref document: EP

Kind code of ref document: A1