WO2023061304A1 - 一种基于大数据的威胁情报预警文本分析方法及系统 - Google Patents

一种基于大数据的威胁情报预警文本分析方法及系统 Download PDF

Info

Publication number
WO2023061304A1
WO2023061304A1 PCT/CN2022/124189 CN2022124189W WO2023061304A1 WO 2023061304 A1 WO2023061304 A1 WO 2023061304A1 CN 2022124189 W CN2022124189 W CN 2022124189W WO 2023061304 A1 WO2023061304 A1 WO 2023061304A1
Authority
WO
WIPO (PCT)
Prior art keywords
tokenset
participle
text
value
dtp
Prior art date
Application number
PCT/CN2022/124189
Other languages
English (en)
French (fr)
Inventor
张鹏
伍军
周晓健
朱志华
谢礼炮
黎婷婷
Original Assignee
广东机电职业技术学院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广东机电职业技术学院 filed Critical 广东机电职业技术学院
Publication of WO2023061304A1 publication Critical patent/WO2023061304A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the invention belongs to the technical fields of big data analysis, word processing, and early warning of network security threats, and in particular relates to a text analysis method and system for threat intelligence early warning based on big data.
  • the purpose of the present invention is to propose a big data-based threat intelligence early warning text analysis method and system to solve one or more technical problems in the prior art, and at least provide a beneficial choice or create conditions.
  • the present invention provides a big data-based threat intelligence early warning text analysis method and system, which obtains network data texts through web crawler technology and performs word segmentation to obtain a word segmentation set, and performs numerical processing on each word segmentation in the word segmentation set to obtain the correspondence of each word segmentation According to the word segmentation numerical signal of each word, calculate the sequence relationship between each word according to the word segmentation value signal of each word, calculate and filter out the word with abnormal sequence connection degree between other words as the threat intelligence word, and send the warning text to the customer display on the terminal screen.
  • a method for analyzing threat intelligence early warning text based on big data includes the following steps:
  • S500 Calculate and screen out the word segments with abnormal sequence connection with other word segments as threat intelligence word segments, combine the obtained threat intelligence word segments into early warning text, and send the early warning text to the screen of the client for display.
  • the method for obtaining the web data text through the web crawler technology is: obtain the web page text on the social media of the Internet through the web crawler technology, save the obtained web page text as character string data, and save the obtained
  • the text file is used as the network data text
  • the collection that all network data texts are formed is the set Txtset, and the serial number of the network data text in the collection Txtset is represented by variable t, and the network data text whose serial number is t in the collection Txtset is recorded as Txtset(t).
  • the method of performing word segmentation on the network data text to obtain the word segmentation set is as follows: use the Chinese word segmentation algorithm to perform word segmentation on all the string data stored in the network data text, and obtain the fragments of multiple character strings as multiple Word segmentation, and remove repeated word segmentation, record the set of multiple word segmentation as a word segmentation set, and the word segmentation set is a set with mutual heterogeneity and orderliness.
  • the method of performing numerical processing on each participle in the participle set to obtain the corresponding participle numerical signal of each participle is:
  • the specific method of performing numerical processing on each participle in the participle set is:
  • the collection is a collection Tokenset, let the variable n represent the number of elements in the collection Tokenset, the variable i represents the sequence number of the elements in the collection Tokenset, i ⁇ [1,n], the variable Tokenset(i) represents the element with the sequence number i in the collection Tokenset, and
  • the result of adding the binary numbers converted from the hexadecimal numbers in the national standard code for each character in the character string of the word segmentation Tokenset(i) is the binary character bistr(i), and the variable bistr(i) represents the set Tokenset
  • the element whose sequence number is i is the binary character of the token Tokenset (i)
  • the binary character Tokenset (i) is a character string consisting of character "0" and character "1”
  • the result obtained by Dtp(tv(i)) is an array with the same size as the array tv(i), and the element with serial number t in the result obtained by Dtp(tv(i)) is tv(i)_t*cos( ⁇ *( t/n)), the array Dtp(tv(i)) is recorded as Dtp(i) and represents the participle value signal corresponding to the participle Tokenset(i) with the sequence number i in the collection Tokenset, the length of the participle value array is v, participle
  • the serial number of the elements in the numerical signal is t
  • the collection of the numerical signals corresponding to the tokens in the collection Tokenset is Dtpset
  • the serial number of the array Dtp(i) in the collection Dtpset is i.
  • the method of calculating the sequence connection degree between each word segment according to the word segment numerical signal of each word segment is: the set of word segments obtained by segmenting the character string data stored in the network data text Txtset(t) as Tokenset_t , remember that each network data text Txtset(t) in the set Txtset is divided into tokens and each Tokenset_t set is Tokensets, and the elements in the set Tokenset_t exist in the set Tokenset at the same time, let the variable s represent the sequence number of the elements in the set Tokenset_t, and in Tokenset_t The element with the serial number s is recorded as Tokenset_t(s), the variable k represents the total number of elements in the set Tokenset_t, s ⁇ [1,k], and the token of Tokenset_t(s) in the set Dtpset will be obtained through the serial number s of Tokenset_t(s).
  • the participle numerical signal is recorded as Dtp(s);
  • the set Seqset the number of elements in the set Seqset is the same as the number of elements in the set Txtset, the sequence number of the elements in the set Seqset is t, the element with the sequence number t in the Seqset is recorded as Seqset(t), and the element Seqset(t) has ordered collection;
  • S401 start the program; obtain Tokenset_t; obtain the element Seqset(t) whose sequence number is t in the set Seqset, and make the elements in Seqset(t) empty;
  • Rel(Dtp(s), Dtp(s2)) is to calculate the relationship between Dtp(s), Dtp(s2) Relation degree is recorded as Rel(s, s2)
  • the element with serial number t in array Dtp(s) is Dtp(s)(t)
  • the element with serial number t in array Dtp(s2) is Dtp(s2)(t )
  • the calculation formula of Rel(s,s2) is as follows:
  • Rel(s, s2, k) is the degree of association between Tokenset_t(s) and the s2th to kth participle in the set Tokenset_t, s3 is an accumulative variable, s3 ⁇ [s2,k];
  • the obtained value is the value of the threshold, and the value of u is set as the value of the threshold;
  • Seq_t_s Record the obtained sequence connection degree Seq(Tokenset_t(s)) as Seq_t_s and add it to the set Seqset(t) as the element whose sequence number is s in Seqset(t);
  • S4051 judge whether the value of s is greater than or equal to k, if so, go to S4052, if otherwise, increase the value of s by 1; go to S403;
  • the element Seqset(t) whose serial number is t in the collection Seqset corresponds to the token set Tokenset_t obtained by segmenting the network data text Txtset(t) whose serial number is t in the collection Txtset, and the elements in the collection Seqset(t) are This element corresponds to the sequence connection degree of the participle of the serial number.
  • the method of judging whether there is any abnormality in the sequence connection degree of a word segment is, for any word segment Tokenset_t(s) in the set of word segments Tokenset_t of any network data text Txtset(t), if Tokenset_t(s) is in The value of the sequence connection degree Seq(Tokenset_t(s)) in Tokenset_t is greater than the value of the participle Tokenset_t(s) in other elements in the set Tokensets, then there is an exception in the network data text Txtset(t), which is recorded in the set Tokensets
  • the set composed of elements except Tokenset_t is Cu(Tokenset_t), Tokenset_t ⁇ Cu(Tokenset_t), and t ⁇ is
  • the present invention also provides a threat intelligence early warning text analysis system based on big data
  • said a kind of threat intelligence early warning text analysis system based on big data comprises: processor, memory and storage in said memory and can be in said A computer program running on a processor, when the processor executes the computer program, the steps in the big data-based threat intelligence early warning text analysis method are implemented, and the big data based threat intelligence early warning text analysis method
  • the system can run on computing devices such as desktop computers, notebooks, palmtop computers, and cloud data centers.
  • the executable system can include, but is not limited to, processors, memories, and server clusters.
  • the processors execute the computer programs. Runs in units on the following systems:
  • the web crawler unit is used to obtain network data text through web crawler technology
  • a word segmentation unit is used to segment the network data text to obtain a word segmentation set
  • a numerical processing unit configured to perform numerical processing on each participle in the participle set, to obtain corresponding participle numerical signals of each participle;
  • a sequence connection degree calculation unit which is used to calculate the sequence connection degree between each participle according to the participle numerical signal of each participle;
  • the threat intelligence screening unit is used to calculate and screen out the participle with abnormal sequence connection degree with other participle as the threat intelligence participle, and combine the obtained threat intelligence participle into early warning text and send it to the screen of the client for display.
  • the present invention provides a method and system for analyzing threat intelligence warning texts based on big data, obtains network data texts through web crawler technology and performs word segmentation to obtain a word segmentation set, and carries out numerical values for each word segmentation in the word segmentation set
  • the corresponding participle numerical signal of each participle is obtained by chemical processing, and the sequence connection degree between each participle is calculated according to the participle numerical signal of each participle, and the participle with abnormal sequence connection degree between other participle is selected as a threat intelligence participle.
  • the warning text is sent to the screen of the client for display, which realizes the beneficial effect of analyzing and quickly displaying potential semantic risks based on network data.
  • Fig. 1 shows a flow chart of a text analysis method for threat intelligence early warning based on big data
  • Figure 2 shows a system structure diagram of a threat intelligence early warning text analysis system based on big data.
  • Figure 1 is a flow chart of a method for analyzing a threat intelligence early warning text based on big data according to the present invention, and a text analysis method for a threat intelligence early warning text based on big data according to an embodiment of the present invention will be described below in conjunction with Figure 1 methods and systems.
  • the present invention proposes a threat intelligence early warning text analysis method based on big data, and the method specifically includes the following steps:
  • S500 Calculate and screen out the word segments with abnormal sequence connection with other word segments as threat intelligence word segments, combine the obtained threat intelligence word segments into early warning text, and send the early warning text to the screen of the client for display.
  • the method for obtaining the web data text through the web crawler technology is: obtain the web page text on the social media of the Internet through the web crawler technology, save the obtained web page text as character string data, and save the obtained Text file is as network data text, and the collection that all network data texts are formed is collection Txtset, represents the sequence number of network data text in the collection Txtset with variable t, and the network data text that sequence number is t in the collection Txtset is recorded as Txtset (t),
  • the web crawler technology includes any one of focused web crawler, incremental web crawler, and deep web crawler.
  • the method of segmenting the network data text to obtain the word segmentation set is: use Chinese word segmentation algorithm (including reverse maximum matching algorithm RMM, bidirectional maximum matching method or any one of N-gram model) to combine all network
  • the character string data stored in the data text is segmented, and the fragments of multiple character strings are recorded as multiple word segments, and the repeated word segments are removed, and the set composed of multiple word segments is recorded as a word segment set.
  • RMM reverse maximum matching algorithm
  • the method of performing numerical processing on each participle in the participle set to obtain the corresponding participle numerical signal of each participle is:
  • the specific method of performing numerical processing on each participle in the participle set is:
  • the collection is a collection Tokenset, let the variable n represent the number of elements in the collection Tokenset, the variable i represents the sequence number of the elements in the collection Tokenset, i ⁇ [1,n], the variable Tokenset(i) represents the element with the sequence number i in the collection Tokenset, and
  • the result of adding the binary numbers converted from the hexadecimal numbers of each character in the character string of the word segmentation Tokenset(i) in the national standard code is taken as the binary character bistr(i),
  • D6BB The hexadecimal system of "only” is D6BB, D6BB is converted into a string “D6BB”, and the binary value of D6BB is 1101011010111011,
  • 1101001010111011+1101011010111011 11010100101110110
  • 11010100101110110 is converted into a character string "11010100101110110”
  • "11010100101110110” is bistr(i)
  • each of 1, 10, 1 is separated into bistr(i) 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0" and then form an array [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0], where the comma ",” represents the separator
  • variable bistr(i) represents the binary character of the element whose sequence number is i in the collection Tokenset, that is, the token Tokenset(i), and the binary character bistr(i) is a character string composed of a character "0" and a character "1".
  • v represents the string length of the binary character bistr(i), and then bistr(i) is divided into a v-dimensional array composed of v values of 0 or 1 by string segmentation, that is, each word segmentation token in the collection Tokenset(i ) obtained respectively as an array tv(i), and the collection of the array tv(i) of each participle Tokenset(i) in the collection is recorded as a collection tvset, and the serial number of tv(i) in the collection tvset is i , the sequence number of the element in the array tv(i) is t, t ⁇ [1,v], the element with the sequence number t in the array tv(i) is tv(i)_t, and the function Dtp() is the array
  • Dtp(tv(i)) means to perform word segmentation numerical processing on the array tv(i) through the function Dtp(), ⁇ is the circle ratio, cos() is the
  • the result obtained by Dtp(tv(i)) is an array with the same size as the array tv(i), and the element with serial number t in the result obtained by Dtp(tv(i)) is tv(i)_t*cos( ⁇ *( t/n)), the array Dtp(tv(i)) is recorded as Dtp(i) and represents the participle numerical signal corresponding to the participle Tokenset(i) with the sequence number i in the set Tokenset, since the participle value corresponds to the participle one by one, Then the number and serial number of the word segmentation and the word segmentation array are consistent, so the array length of the word segmentation value is also v, and the sequence number of the element in the word segmentation value signal is also t. Dtpset, the sequence number of the array Dtp(i) in the set Dtpset is i.
  • the method of calculating the sequence connection degree between each word segment according to the word segment numerical signal of each word segment is: the set of word segments obtained by segmenting the character string data stored in the network data text Txtset(t) as Tokenset_t , remember that each network data text Txtset(t) in the set Txtset is divided into tokens and each Tokenset_t set is Tokensets, and the elements in the set Tokenset_t exist in the set Tokenset at the same time, let the variable s represent the sequence number of the elements in the set Tokenset_t, and in Tokenset_t The element with the serial number s is recorded as Tokenset_t(s), the variable k represents the total number of elements in the set Tokenset_t, s ⁇ [1,k], and the token of Tokenset_t(s) in the set Dtpset will be obtained through the serial number s of Tokenset_t(s).
  • the participle numerical signal is recorded as Dtp(s);
  • the set Seqset the number of elements in the set Seqset is the same as the number of elements in the set Txtset, the sequence number of the elements in the set Seqset is t, the element with the sequence number t in the Seqset is recorded as Seqset(t), and the element Seqset(t) has ordered collection;
  • Rel(Dtp(s), Dtp(s2)) is to calculate the relationship between Dtp(s), Dtp(s2) Relation degree is recorded as Rel(s, s2)
  • the element with serial number t in array Dtp(s) is Dtp(s)(t)
  • the element with serial number t in array Dtp(s2) is Dtp(s2)(t )
  • the calculation formula of Rel(s,s2) is as follows:
  • the obtained Rel(s, s2, k) is the degree of association between Tokenset_t(s) and the s2th to kth participle in the set Tokenset_t;
  • the obtained value is the value of the threshold u1, and the value of u is set as the value of the threshold u1;
  • Seq_t_s Record the obtained sequence connection degree Seq(Tokenset_t(s)) as Seq_t_s and add it to the set Seqset(t) as the element whose sequence number is s in Seqset(t);
  • S4051 judge whether the value of s is greater than or equal to k, if so, go to S4052, if otherwise, increase the value of s by 1; go to S403;
  • the element Seqset(t) whose serial number is t in the collection Seqset corresponds to the token set Tokenset_t obtained by segmenting the network data text Txtset(t) whose serial number is t in the collection Txtset, and the elements in the collection Seqset(t) are This element corresponds to the sequence connection degree of the participle of the serial number.
  • the method of judging whether there is any abnormality in the sequence connection degree of a word segment is, for any word segment Tokenset_t(s) in the set of word segments Tokenset_t of any network data text Txtset(t), if Tokenset_t(s) is in The value of the sequence connection degree Seq(Tokenset_t(s)) in Tokenset_t is greater than the value of the participle Tokenset_t(s) in other elements in the set Tokensets, then there is an exception in the network data text Txtset(t), which is recorded in the set Tokensets
  • the set composed of elements except Tokenset_t is Cu(Tokenset_t), Tokenset_t ⁇ Cu(Tokenset_t), and t ⁇ is
  • the key part of the Python implementation code for the threat intelligence word segmentation to calculate and filter out the word segmentation with abnormal sequence connection degree with other word segmentation may include:
  • the character string concatenating and combining the word segmentation of the threat intelligence to be obtained is used as the warning text, and the warning text is sent to the client screen for display.
  • the big data-based threat intelligence early warning text analysis system includes: a processor, a memory, and a computer program stored in the memory and operable on the processor, when the processor executes the computer program Realize the steps in the embodiment of the above-mentioned method for analyzing text of threat intelligence early warning based on big data, the system of text analysis of threat intelligence early warning based on big data can run on desktop computers, notebooks, palmtop computers and cloud data centers
  • an operable system may include, but not limited to, a processor, a memory, and a server cluster.
  • An embodiment of the present invention provides a threat intelligence early warning text analysis system based on big data.
  • the computer program in the memory and operable on the processor, when the processor executes the computer program, implements the steps in the embodiment of the above-mentioned big data-based threat intelligence early warning text analysis method, the The processor executes said computer program running in units of the following systems:
  • the web crawler unit is used to obtain network data text through web crawler technology
  • a word segmentation unit is used to segment the network data text to obtain a word segmentation set
  • a numerical processing unit configured to perform numerical processing on each participle in the participle set, to obtain corresponding participle numerical signals of each participle;
  • a sequence connection degree calculation unit which is used to calculate the sequence connection degree between each participle according to the participle numerical signal of each participle;
  • the threat intelligence screening unit is used to calculate and screen out the participle with abnormal sequence connection degree with other participle as the threat intelligence participle, and combine the obtained threat intelligence participle into early warning text and send it to the screen of the client for display.
  • the big data-based threat intelligence early warning text analysis system can run on computing devices such as desktop computers, notebooks, palmtop computers, and cloud data centers.
  • the big data-based threat intelligence early warning text analysis system includes, but is not limited to, a processor and a memory.
  • a processor and a memory are examples of processors and memory.
  • the above example is only an example of a big data-based threat intelligence early warning text analysis method and system, and does not constitute a limitation on a big data-based threat intelligence early warning text analysis method and system. It may include more or less components than the example, or combine certain components, or different components.
  • the threat intelligence early warning text analysis system based on big data may also include input and output devices, network access devices, bus etc.
  • the so-called processor can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), on-site Programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete component gate circuits or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc., and the processor is the control center of the threat intelligence early warning text analysis system based on big data, using various interfaces and The line connects the various sub-areas of the entire big data-based threat intelligence early warning text analysis.
  • the memory can be used to store the computer programs and/or modules, and the processor realizes the one by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory.
  • the memory may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required by a function (such as a sound playback function, an image playback function, etc.) and the like; the storage data area may store Data created based on the use of the mobile phone (such as audio data, phonebook, etc.), etc.
  • the memory can include high-speed random access memory, and can also include non-volatile memory, such as hard disk, internal memory, plug-in hard disk, smart memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card , flash card (Flash Card), at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • non-volatile memory such as hard disk, internal memory, plug-in hard disk, smart memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card , flash card (Flash Card), at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the present invention provides a big data-based threat intelligence early warning text analysis method and system, which obtains network data texts through web crawler technology and performs word segmentation to obtain a word segmentation set, and performs numerical processing on each word segmentation in the word segmentation set to obtain the correspondence of each word segmentation According to the word segmentation numerical signal of each word, calculate the sequence relationship between each word according to the word segmentation value signal of each word, calculate and filter out the word with abnormal sequence connection degree between other words as the threat intelligence word, and send the warning text to the customer Displayed on the screen of the terminal, the beneficial effect of analyzing and quickly displaying potential semantic risks based on network data is achieved.
  • variable or code that is repeatedly defined the scope of the variable is only in this natural paragraph, or the variable or code that is repeatedly defined is due to the one-to-one correspondence of the previous variable or code, the number of variables or code It is consistent with the serial number, so it can be defined repeatedly.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种基于大数据的威胁情报预警文本分析方法及系统,该方法包括:通过网络爬虫技术获取网络数据文本(S100);对网络数据文本进行分词得到分词集合(S200);对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号(S300);根据各分词的分词数值信号计算各分词之间的序列联系度(S400);计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,将得到的威胁情报分词组合成预警文本,并将预警文本发送到客户端的屏幕进行显示(S500)。该方法实现了根据网络数据对潜在的语义风险进行分析并快速显示的有益效果。

Description

一种基于大数据的威胁情报预警文本分析方法及系统 技术领域
本发明属于大数据分析、文字处理、网络安全威胁预警技术领域,具体涉及一种基于大数据的威胁情报预警文本分析方法及系统。
背景技术
随着现代社会的信息化程度加深,大数据技术的使用日益普及,在网络社交媒体的信息存储与传播中,文本数据的生产量与需求量也在急剧攀升。系统安全数据的采集和存储,以及信息安全威胁的发现排查,都相应带来了比以往更高的安全防护技术和管理规范化技术要求。在信息文本的存储总量和产生速度都急剧增长的社会现状下,超量的网页数据的分析和管理对技术系统产生了新要求。在公开号CN107196910A的专利提供的基于大数据分析的威胁预警监测系统、方法及部署架构中,尽管可用于多种业务场景下的网络安全威胁态势感知和深度分析,但仍不能有效分析网络数据的语义风险。
发明内容
本发明的目的在于提出一种基于大数据的威胁情报预警文本分析方法及系统,以解决现有技术中所存在的一个或多个技术问题,至少提供一种有益的选择或创造条件。
文本数据的生产量与需求量的急剧攀升,对系统安全数据的采集和存储、以及信息安全威胁的发现排查,带来了比以往更高的安全防护技术和管理规范化技术要求,需要有效分析网络数据的语义风险。
本发明提供了一种基于大数据的威胁情报预警文本分析方法及系统,通过网络爬虫技术获取网络数据文本进行分词得到分词集合,并对分词集合中的各个分词进行数值化处理得到各分词的对应的分词数值信号,根据各分词的分词数值信号计算各分词之间的序列联系度计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,并将预警文本发送到客户端的屏幕进行显示。
为了实现上述目的,根据本发明的一方面,提供一种基于大数据的威胁情报预警文本分析方法,所述方法包括以下步骤:
S100,通过网络爬虫技术获取网络数据文本;
S200,对网络数据文本进行分词得到分词集合;
S300,对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号;
S400,根据各分词的分词数值信号计算各分词之间的序列联系度;
S500,计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,将得到的威胁情报分词组合成预警文本,并将预警文本发送到客户端的屏幕进行显示。
进一步地,在S100中,通过网络爬虫技术获取网络数据文本的方法为:通过网络爬虫技术获取互联网的社交媒体上的网页文本,将获取到的网页文本作为字符串数据进行保存,将保存得到的文本文件作为网络数据文本,记所有的网络数据文本组成的集合为集合Txtset,以变量t表示集合Txtset中网络数据文本的序号,集合Txtset中序号为t的网络数据文本记为Txtset(t)。
进一步地,在S200中,对网络数据文本进行分词得到分词集合的方法为:使用中文分词算法将所有的网络数据文本中保存的字符串数据进行分词,得到多个字符串的片段记作多个分词,并去除重复出现的分词,将多个分词组成的集合记作分词集合,所述分词集合为具有互异性与有序性的集合。
进一步地,在S300中,对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号的方法为:对分词集合中的各个分词进行数值化处理的具体方法为,记分词集合为集合Tokenset,令变量n表示集合Tokenset中元素的数量,变量i表示集合Tokenset中元素的序号,i∈[1,n],变量Tokenset(i)表示集合Tokenset中序号为i的元素,将分词Tokenset(i)的字符串中的每个字符在国家标准代码中的十六进制数换算得到的二进制数进行相加的结果作为二进制字符bistr(i),变量bistr(i)表示集合Tokenset中序号为i的元素即分词Tokenset(i)的二进制字符,所述二进制字符Tokenset(i)为由字符“0”和字符“1”组成的字符串,以变量v表示所述二进制字符Tokenset(i)的字符串长度,进而通过字符串切分将Tokenset(i)分成由v个为0或1的数值组成v维数组,即将集合中的各个分词Tokenset(i)分别得到的v维数组记作数组tv(i),将集合中的各个分词Tokenset(i)的数组tv(i)的集合记为集合tvset,tv(i)在集合tvset中的序号为i,记数组tv(i)中的元素的序号为t,t∈[1,v],数组tv(i)中序号为t的元素为tv(i)_t,记函数Dtp()为对数组进行处理的函数,Dtp(tv(i))表示通过函数Dtp()对数组tv(i)进行处理,π为圆周率,cos()为计算余弦函数,Dtp(tv(i))的计算过程为:
Figure PCTCN2022124189-appb-000001
Dtp(tv(i))所得结果为与数组tv(i)的数组大小相同的数组,Dtp(tv(i))所得结果中序号为t的元素为tv(i)_t*cos(π*(t/n)),将数组Dtp(tv(i))记作Dtp(i)并表示集合Tokenset中序号为i的分词Tokenset(i)对应的分词数值信号,分词数值的数组长度为v,分词数值信号中的元素的序号为t,记集合Tokenset中各分词对应的各个分词数值信号的集合为Dtpset,数组Dtp(i)在集合Dtpset中的序号为i。
进一步地,在S400中,根据各分词的分词数值信号计算各分词之间的序列联系度的方法为:将网络数据文本Txtset(t)中保存的字符串数据进行分词得到的分词的集合作为Tokenset_t,记集合Txtset中各个网络数据文本Txtset(t)分别进行分词得到的各个Tokenset_t的集合为Tokensets,集合Tokenset_t中的元素同时存在于集合Tokenset中,令变量s表示集合Tokenset_t中元素的序号,Tokenset_t中序号为s的元素记作Tokenset_t(s),变量k表示集合Tokenset_t中元素的总数,s∈[1,k],将通过Tokenset_t(s)的序号s得到分词Tokenset_t(s)在集合Dtpset中的分词数值信号记为Dtp(s);
设置集合Seqset,集合Seqset中的元素数量与集合Txtset中的元素数量相同,集合Seqset中的元素的序号为t,Seqset中序号为t的元素记作Seqset(t),元素Seqset(t)为有序集合;
计算各分词之间的序列联系度的程序为:
S401,开始程序;获取Tokenset_t;获取集合Seqset中序号为t的元素Seqset(t),令Seqset(t)中的元素清空;
S402,令s的数值为1;设置变量s2;设置变量u,令变量u的数值为0;
S403,通过s获取Dtp(s);
S4041,令s2的数值为s的数值;
S4042,令s2的数值增加1;
S4043,通过s2获取Dtp(s2);
S4044,定义计算两个分词的分词数值信号之间的关联度的函数为Rel(),则Rel(Dtp(s),Dtp(s2))为计算Dtp(s)、Dtp(s2)之间的关联度记作Rel(s,s2),记数组Dtp(s)中序号为t的元素为Dtp(s)(t)、数组Dtp(s2)中序号为t的元素为Dtp(s2)(t),Rel(s,s2)的计算公式如下:
Figure PCTCN2022124189-appb-000002
S4045,计算Tokenset_t(s)与集合Tokenset_t中的其他分词的关联度,设置变量s3表示集合Tokenset_t中第s2个元素到第k个元素的序号,记Tokenset_t(s)与集合Tokenset_t中第s2个到第k个分词的关联度为Rel(s,s2,k),计算公式为:
Figure PCTCN2022124189-appb-000003
所得的Rel(s,s2,k)即为Tokenset_t(s)与集合Tokenset_t中第s2个到第k个分词的关联度,s3为累加变量,s3∈[s2,k];
S4046,进行对阈值的计算,函数exp()表示以自然数e为底数的指数函数,所述阈值的 计算公式为:
Figure PCTCN2022124189-appb-000004
所得数值即为所述阈值的数值,并将u的数值设置为所述阈值的数值;
根据所述Rel(s,s2,k)和所述u进行对序列联系度的计算,记Tokenset_t(s)与集合Tokenset_t中的其他分词的序列联系度为Seq(Tokenset_t(s)),序列联系度Seq(Tokenset_t(s))的计算公式为:
Seq(Tokenset_t(s))=u*Rel(s,s2,k),
将所得序列联系度Seq(Tokenset_t(s))记作Seq_t_s并加入到集合Seqset(t)中作为Seqset(t)序号为s的元素;
S4047,将u的数值设置为Seq_t_s的数值;转到S4051;
S4051,判断s的数值是否大于或等于k,若是则转到S4052,若否则将s的数值增加1;转到S403;
S4052,将集合Seqset(t)作为集合Seqset中序号为t的元素并进行保存;结束程序;
集合Seqset中序号为t的元素Seqset(t)与集合Txtset中序号为t的网络数据文本Txtset(t)进行分词得到的分词的集合Tokenset_t相互对应,所述集合Seqset(t)中的元素即为该元素对应序号的分词的序列联系度。
进一步地,在S500中,计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,将得到的威胁情报分词组合成预警文本并将预警文本发送到客户端的屏幕进行显示的方法为:对一个分词的序列联系度进行判断是否存在异常的方法为,在任一网络数据文本Txtset(t)的分词的集合Tokenset_t中的任一分词Tokenset_t(s),若Tokenset_t(s)在Tokenset_t中的序列联系度Seq(Tokenset_t(s))的数值大于该分词Tokenset_t(s)在集合Tokensets中的其他元素中的数值,则该网络数据文本Txtset(t)存在异常,记在集合Tokensets中除了Tokenset_t的元素组成的集合为Cu(Tokenset_t),Tokenset_t`∈Cu(Tokenset_t),记t`为集合Cu(Tokenset_t)中元素的序号,Tokenset_t`为集合Cu(Tokenset_t)中序号为t`的元素,Tokenset_t`(s)∈Tokenset_t`,函数len()为获取集合元素数量的函数,Seq(Tokenset_t`(s))表示Tokenset_t`(s)与Tokenset_t`中的其他分词的序列联系度,判断Tokenset_t(s)在Tokenset_t中的序列联系度Seq(Tokenset_t(s))的数值是否大于该分词Tokenset_t(s)在集合Tokensets中的其他元素中的数值的公式如下:
Figure PCTCN2022124189-appb-000005
若满足上述公式,则表示分词Tokenset_t(s)在Tokenset_t中存在异常,将分词Tokenset_t(s)及其所在的Tokenset_t发送到输出设备进行显示或打印,即将得到的威胁情报分词进行字符串拼接组合成的字符串作为预警文本,并将所述预警文本发送到客户端的屏幕进行显示。
本发明还提供了一种基于大数据的威胁情报预警文本分析系统,所述一种基于大数据的威胁情报预警文本分析系统包括:处理器、存储器及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现所述一种基于大数据的威胁情报预警文本分析方法中的步骤,所述一种基于大数据的威胁情报预警文本分析系统可以运行于桌上型计算机、笔记本、掌上电脑及云端数据中心等计算设备中,可运行的系统可包括,但不仅限于,处理器、存储器、服务器集群,所述处理器执行所述计算机程序运行在以下系统的单元中:
网络爬虫单元,用于通过网络爬虫技术获取网络数据文本;
分词单元,用于对网络数据文本进行分词得到分词集合;
数值化处理单元,用于对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号;
序列联系度计算单元,用于根据各分词的分词数值信号计算各分词之间的序列联系度;
威胁情报筛选单元,用于计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,并将得到的威胁情报分词组合成预警文本发送到客户端的屏幕进行显示。
本发明的有益效果为:本发明提供了一种基于大数据的威胁情报预警文本分析方法及系统,通过网络爬虫技术获取网络数据文本进行分词得到分词集合,并对分词集合中的各个分词进行数值化处理得到各分词的对应的分词数值信号,根据各分词的分词数值信号计算各分词之间的序列联系度计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,并将预警文本发送到客户端的屏幕进行显示,实现了根据网络数据对潜在的语义风险进行分析并快速显示的有益效果。
附图说明
通过对结合附图所示出的实施方式进行详细说明,本发明的上述以及其他特征将更加明显,本发明附图中相同的参考标号表示相同或相似的元素,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图,在附图中:
图1所示为一种基于大数据的威胁情报预警文本分析方法的流程图;
图2所示为一种基于大数据的威胁情报预警文本分析系统的系统结构图。
具体实施方式
以下将结合实施例和附图对本发明的构思、具体结构及产生的技术效果进行清楚、完整的描述,以充分地理解本发明的目的、方案和效果。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
在本发明的描述中,若干的含义是一个或者多个,多个的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。
如图1所示为根据本发明的一种基于大数据的威胁情报预警文本分析方法的流程图,下面结合图1来阐述根据本发明的实施方式的一种基于大数据的威胁情报预警文本分析方法及系统。
本发明提出一种基于大数据的威胁情报预警文本分析方法,所述方法具体包括以下步骤:
S100,通过网络爬虫技术获取网络数据文本;
S200,对网络数据文本进行分词得到分词集合;
S300,对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号;
S400,根据各分词的分词数值信号计算各分词之间的序列联系度;
S500,计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,将得到的威胁情报分词组合成预警文本,并将预警文本发送到客户端的屏幕进行显示。
进一步地,在S100中,通过网络爬虫技术获取网络数据文本的方法为:通过网络爬虫技术获取互联网的社交媒体上的网页文本,将获取到的网页文本作为字符串数据进行保存,将保存得到的文本文件作为网络数据文本,记所有的网络数据文本组成的集合为集合Txtset,以变量t表示集合Txtset中网络数据文本的序号,集合Txtset中序号为t的网络数据文本记为Txtset(t),其中,网络爬虫技术包括聚焦网络爬虫、增量式网络爬虫、Deep Web爬虫中任意一种。
进一步地,在S200中,对网络数据文本进行分词得到分词集合的方法为:使用中文分词算法(包括逆向最大匹配算法RMM、双向最大匹配法或者N-gram模型中任意一种)将所有的网络数据文本中保存的字符串数据进行分词,得到多个字符串的片段记作多个分词,并去除重复出现的分词,将多个分词组成的集合记作分词集合,所述分词集合为具有互异性与有序性的集合。
进一步地,在S300中,对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号的方法为:对分词集合中的各个分词进行数值化处理的具体方法为,记分词集合为集合Tokenset,令变量n表示集合Tokenset中元素的数量,变量i表示集合Tokenset中元素的序号,i∈[1,n],变量Tokenset(i)表示集合Tokenset中序号为i的元素,将分词Tokenset(i)的字符串中的每个字符在国家标准代码中的十六进制数换算得到的二进制数进行相加的结果作为二进制字符bistr(i),
例如,“这是一只猫”,可被分词为数组[“这是”,“一只”,“猫”],字符串“一只”是一个分词,当Tokenset(i)指代“一只”时,Tokenset(i)的字符串中的2个字符即为“一”和“只”,分别取“一”的字符和“只”的字符在国家标准代码中的十六进制数换算得到的二进制数,“一”的十六进制为D2BB,D2BB转成字符串“D2BB”,D2BB的二进制为1101001010111011,
“只”的十六进制为D6BB,D6BB转成字符串“D6BB”,D6BB的二进制为1101011010111011,
1101001010111011+1101011010111011=11010100101110110,11010100101110110转成字符串“11010100101110110”,“11010100101110110”即为bistr(i),将bistr(i)中的每一位分开为“1、1、0、1、0、1、0、0、1、0、1、1、1、0、1、1、0”再以这些中的为0或1的数值组成数组[1、1、0、1、0、1、0、0、1、0、1、1、1、0、1、1、0],其中的顿号“、”表示分隔符,
变量bistr(i)表示集合Tokenset中序号为i的元素即分词Tokenset(i)的二进制字符,所述二进制字符bistr(i)为由字符“0”和字符“1”组成的字符串,以变量v表示所述二进制字符bistr(i)的字符串长度,进而通过字符串切分将bistr(i)分成由v个为0或1的数值组成v维数组,即将集合中的各个分词Tokenset(i)分别得到的v维数组记作数组tv(i),将集合中的各个分词Tokenset(i)的数组tv(i)的集合记为集合tvset,tv(i)在集合tvset中的序号为i,记数组tv(i)中的元素的序号为t,t∈[1,v],数组tv(i)中序号为t的元素为tv(i)_t,记函数Dtp()为对数组进行分词数值化处理的函数,Dtp(tv(i))表示通过函数Dtp()对数组tv(i)进行分词数值化处理,π为圆周率,cos()为计算余弦函数,Dtp(tv(i))的计算过程为:
Figure PCTCN2022124189-appb-000006
Dtp(tv(i))所得结果为与数组tv(i)的数组大小相同的数组,Dtp(tv(i))所得结果中序号为t的元素为tv(i)_t*cos(π*(t/n)),将数组Dtp(tv(i))记作Dtp(i)并表示集合Tokenset中序号为i的分词Tokenset(i)对应的分词数值信号,由于分词数值与分词一一对应,则分词与分词数组 的数量和序号保持一致,所以,分词数值的数组长度同样为v,分词数值信号中的元素的序号同样为t,记集合Tokenset中各分词对应的各个分词数值信号的集合为Dtpset,数组Dtp(i)在集合Dtpset中的序号为i。
进一步地,在S400中,根据各分词的分词数值信号计算各分词之间的序列联系度的方法为:将网络数据文本Txtset(t)中保存的字符串数据进行分词得到的分词的集合作为Tokenset_t,记集合Txtset中各个网络数据文本Txtset(t)分别进行分词得到的各个Tokenset_t的集合为Tokensets,集合Tokenset_t中的元素同时存在于集合Tokenset中,令变量s表示集合Tokenset_t中元素的序号,Tokenset_t中序号为s的元素记作Tokenset_t(s),变量k表示集合Tokenset_t中元素的总数,s∈[1,k],将通过Tokenset_t(s)的序号s得到分词Tokenset_t(s)在集合Dtpset中的分词数值信号记为Dtp(s);
设置集合Seqset,集合Seqset中的元素数量与集合Txtset中的元素数量相同,集合Seqset中的元素的序号为t,Seqset中序号为t的元素记作Seqset(t),元素Seqset(t)为有序集合;
计算各分词之间的序列联系度的程序为:
S401,开设程序;获取Tokenset_t;获取集合Seqset中序号为t的元素Seqset(t),令Seqset(t)中的元素清空;
S402,令s的数值为1;设置变量s2;设置变量u,令变量u的数值为0;
S403,通过s获取Dtp(s);
S4041,令s2的数值为s的数值;
S4042,令s2的数值增加1;
S4043,通过s2获取Dtp(s2);
S4044,定义计算两个分词的分词数值信号之间的关联度的函数为Rel(),则Rel(Dtp(s),Dtp(s2))为计算Dtp(s)、Dtp(s2)之间的关联度记作Rel(s,s2),记数组Dtp(s)中序号为t的元素为Dtp(s)(t)、数组Dtp(s2)中序号为t的元素为Dtp(s2)(t),Rel(s,s2)的计算公式如下:
Figure PCTCN2022124189-appb-000007
S4045,计算Tokenset_t(s)与集合Tokenset_t中的其他分词的关联度,设置变量s3表示集合Tokenset_t中第s2个元素到第k个元素的序号,记Tokenset_t(s)与集合Tokenset_t中第s2个到第k个分词的关联度为Rel(s,s2,k),计算公式为:
Figure PCTCN2022124189-appb-000008
所得的Rel(s,s2,k)即为Tokenset_t(s)与集合Tokenset_t中第s2个到第k个分词的关联度;
S4046,进行对阈值的计算,函数exp()表示以自然数e为底数的指数函数,所述阈值记为u1的计算公式为:
Figure PCTCN2022124189-appb-000009
所得数值即为阈值u1的数值,并将u的数值设置为阈值u1的数值;
根据所述Rel(s,s2,k)和所述u进行对序列联系度的计算,记Tokenset_t(s)与集合Tokenset_t中的其他分词的序列联系度为Seq(Tokenset_t(s)),序列联系度Seq(Tokenset_t(s))的计算公式为:
Seq(Tokenset_t(s))=u*Rel(s,s2,k),
将所得序列联系度Seq(Tokenset_t(s))记作Seq_t_s并加入到集合Seqset(t)中作为Seqset(t)序号为s的元素;
S4047,将u的数值设置为Seq_t_s的数值;转到S4051;
S4051,判断s的数值是否大于或等于k,若是则转到S4052,若否则将s的数值增加1;转到S403;
S4052,将集合Seqset(t)作为集合Seqset中序号为t的元素并进行保存;结束程序;
集合Seqset中序号为t的元素Seqset(t)与集合Txtset中序号为t的网络数据文本Txtset(t)进行分词得到的分词的集合Tokenset_t相互对应,所述集合Seqset(t)中的元素即为该元素对应序号的分词的序列联系度。
进一步地,在S500中,计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,将得到的威胁情报分词组合成预警文本并将预警文本发送到客户端的屏幕进行显示的方法为:对一个分词的序列联系度进行判断是否存在异常的方法为,在任一网络数据文本Txtset(t)的分词的集合Tokenset_t中的任一分词Tokenset_t(s),若Tokenset_t(s)在Tokenset_t中的序列联系度Seq(Tokenset_t(s))的数值大于该分词Tokenset_t(s)在集合Tokensets中的其他元素中的数值,则该网络数据文本Txtset(t)存在异常,记在集合Tokensets中除了Tokenset_t的元素组成的集合为Cu(Tokenset_t),Tokenset_t`∈Cu(Tokenset_t),记t`为集合Cu(Tokenset_t)中元素的序号,Tokenset_t`为集合Cu(Tokenset_t)中序号为t`的元素,Tokenset_t`(s)为Tokenset_t`中序号为s的元素,Tokenset_t`(s)∈ Tokenset_t`,函数len()为获取集合元素数量的函数,Seq(Tokenset_t`(s))表示Tokenset_t`(s)与Tokenset_t`中的其他分词的序列联系度,判断Tokenset_t(s)在Tokenset_t中的序列联系度Seq(Tokenset_t(s))的数值是否大于该分词Tokenset_t(s)在集合Tokensets中的其他元素中的数值的公式如下:
Figure PCTCN2022124189-appb-000010
若满足上述公式,则表示分词Tokenset_t(s)在Tokenset_t中存在异常,将分词Tokenset_t(s)及其所在的Tokenset_t发送到输出设备进行显示或打印;
其中,可优选地,计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词的Python实现代码的关键部分可包括:
Figure PCTCN2022124189-appb-000011
即将得到的威胁情报分词进行字符串拼接组合成的字符串作为预警文本,并将所述预警 文本发送到客户端的屏幕进行显示。
所述一种基于大数据的威胁情报预警文本分析系统包括:处理器、存储器及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述一种基于大数据的威胁情报预警文本分析方法实施例中的步骤,所述一种基于大数据的威胁情报预警文本分析系统可以运行于桌上型计算机、笔记本、掌上电脑及云端数据中心等计算设备中,可运行的系统可包括,但不仅限于,处理器、存储器、服务器集群。
本发明的实施例提供的一种基于大数据的威胁情报预警文本分析系统,如图2所示,该实施例的一种基于大数据的威胁情报预警文本分析系统包括:处理器、存储器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述一种基于大数据的威胁情报预警文本分析方法实施例中的步骤,所述处理器执行所述计算机程序运行在以下系统的单元中:
网络爬虫单元,用于通过网络爬虫技术获取网络数据文本;
分词单元,用于对网络数据文本进行分词得到分词集合;
数值化处理单元,用于对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号;
序列联系度计算单元,用于根据各分词的分词数值信号计算各分词之间的序列联系度;
威胁情报筛选单元,用于计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,并将得到的威胁情报分词组合成预警文本发送到客户端的屏幕进行显示。
所述一种基于大数据的威胁情报预警文本分析系统可以运行于桌上型计算机、笔记本、掌上电脑及云端数据中心等计算设备中。所述一种基于大数据的威胁情报预警文本分析系统包括,但不仅限于,处理器、存储器。本领域技术人员可以理解,所述例子仅仅是一种基于大数据的威胁情报预警文本分析方法及系统的示例,并不构成对一种基于大数据的威胁情报预警文本分析方法及系统的限定,可以包括比例子更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述一种基于大数据的威胁情报预警文本分析系统还可以包括输入输出设备、网络接入设备、总线等。
所称处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立元器件门电路或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器是所述一种基于 大数据的威胁情报预警文本分析系统的控制中心,利用各种接口和线路连接整个基于大数据的威胁情报预警文本分析的各个分区域。
所述存储器可用于存储所述计算机程序和/或模块,所述处理器通过运行或执行存储在所述存储器内的计算机程序和/或模块,以及调用存储在存储器内的数据,实现所述一种xxx方法及系统的各种功能。所述存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
本发明提供了一种基于大数据的威胁情报预警文本分析方法及系统,通过网络爬虫技术获取网络数据文本进行分词得到分词集合,并对分词集合中的各个分词进行数值化处理得到各分词的对应的分词数值信号,根据各分词的分词数值信号计算各分词之间的序列联系度计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,并将预警文本发送到客户端的屏幕进行显示,实现了根据网络数据对潜在的语义风险进行分析并快速显示的有益效果。
注:如果本申请中出现重复定义的变量或者代号,则该变量的作用范围只在本自然段中,或者重复定义的变量或者代号由于之前的变量或者代号一一对应,则变量或者代号的数量和序号保持一致,故可以重复定义。
尽管本发明的描述已经相当详尽且特别对几个所述实施例进行了描述,但其并非旨在局限于任何这些细节或实施例或任何特殊实施例,从而有效地涵盖本发明的预定范围。此外,上文以发明人可预见的实施例对本发明进行描述,其目的是为了提供有用的描述,而那些目前尚未预见的对本发明的非实质性改动仍可代表本发明的等效改动。

Claims (7)

  1. 一种基于大数据的威胁情报预警文本分析方法,其特征在于,所述方法包括以下步骤:
    S100,通过网络爬虫技术获取网络数据文本;
    S200,对网络数据文本进行分词得到分词集合;
    S300,对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号;
    S400,根据各分词的分词数值信号计算各分词之间的序列联系度;
    S500,计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,将得到的威胁情报分词组合成预警文本,并将预警文本发送到客户端的屏幕进行显示。
  2. 根据权利要求1所述的一种基于大数据的威胁情报预警文本分析方法,其特征在于,在S100中,通过网络爬虫技术获取网络数据文本的方法为:通过网络爬虫技术获取互联网的社交媒体上的网页文本,将获取到的网页文本作为字符串数据进行保存,将保存得到的文本文件作为网络数据文本,记所有的网络数据文本组成的集合为集合Txtset,以变量t表示集合Txtset中网络数据文本的序号,集合Txtset中序号为t的网络数据文本记为Txtset(t)。
  3. 根据权利要求1所述的一种基于大数据的威胁情报预警文本分析方法,其特征在于,在S200中,对网络数据文本进行分词得到分词集合的方法为:使用中文分词算法将所有的网络数据文本中保存的字符串数据进行分词,得到多个字符串的片段记作多个分词,并去除重复出现的分词,将多个分词组成的集合记作分词集合。
  4. 根据权利要求1所述的一种基于大数据的威胁情报预警文本分析方法,其特征在于,在S300中,对分词集合中的各个分词进行数值化处理,得到各分词的对应的分词数值信号的方法为:记分词集合为集合Tokenset,令变量n表示集合Tokenset中元素的数量,变量i表示集合Tokenset中元素的序号,i∈[1,n],变量Tokenset(i)表示集合Tokenset中序号为i的元素,将分词Tokenset(i)的字符串中的每个字符的十六进制数换算得到的二进制数进行相加的结果作为二进制字符bistr(i),变量bistr(i)表示集合Tokenset中序号为i的元素即分词Tokenset(i)的二进制字符,所述二进制字符bistr(i)为由字符“0”和字符“1”组成的字符串,以变量v表示所述二进制字符bistr(i)的字符串长度,进而通过字符串切分将bistr(i)分成由v个为0或1的数值组成v维数组,即将集合中的各个分词Tokenset(i)分别得到的v维数组记作数组tv(i),将集合中的各个分词Tokenset(i)的数组tv(i)的集合记为集合tvset,tv(i)在集合tvset中的序号为i,记数组tv(i)中的元素的序号为t,t∈[1,v],数组tv(i)中序号为t的元素为tv(i)_t,记函数Dtp()为对数组进行分词数值化处理的函数,Dtp(tv(i))表示通过函数Dtp()对数组tv(i)进行分词数值化处理,π为圆周率,cos()为计算余弦函数,Dtp(tv(i))的计算过程为:
    Figure PCTCN2022124189-appb-100001
    Dtp(tv(i))所得结果为与数组tv(i)的数组大小相同的数组,Dtp(tv(i))所得结果中序号为t的元素为tv(i)_t*cos(π*(t/n)),将数组Dtp(tv(i))记作Dtp(i)并表示集合Tokenset中序号为i的分词Tokenset(i)对应的分词数值信号,由于分词数值与分词一一对应,则分词与分词数组的数量和序号保持一致,所以,分词数值的数组长度同样为v,分词数值信号中的元素的序号同样为t,记集合Tokenset中各分词对应的各个分词数值信号的集合为Dtpset,数组Dtp(i)在集合Dtpset中的序号为i。
  5. 根据权利要求4所述的一种基于大数据的威胁情报预警文本分析方法,其特征在于,在S400中,根据各分词的分词数值信号计算各分词之间的序列联系度的方法为:将网络数据文本Txtset(t)中保存的字符串数据进行分词得到的分词的集合作为Tokenset_t,记集合Txtset中各个网络数据文本Txtset(t)分别进行分词得到的各个Tokenset_t的集合为Tokensets,集合Tokenset_t中的元素同时存在于集合Tokenset中,令变量s表示集合Tokenset_t中元素的序号,Tokenset_t中序号为s的元素记作Tokenset_t(s),变量k表示集合Tokenset_t中元素的总数,s∈[1,k],将通过Tokenset_t(s)的序号s得到分词Tokenset_t(s)在集合Dtpset中的分词数值信号记为Dtp(s);
    设置集合Seqset,集合Seqset中的元素数量与集合Txtset中的元素数量相同,集合Seqset中的元素的序号为t,Seqset中序号为t的元素记作Seqset(t),元素Seqset(t)为有序集合;
    计算各分词之间的序列联系度的程序为:
    S401,开始程序;获取Tokenset_t;获取集合Seqset中序号为t的元素Seqset(t),令Seqset(t)中的元素清空;
    S402,令s的数值为1;设置变量s2;设置变量u,令变量u的数值为0;
    S403,通过s获取Dtp(s);
    S4041,令s2的数值为s的数值;
    S4042,令s2的数值增加1;
    S4043,通过s2获取Dtp(s2);
    S4044,定义计算两个分词的分词数值信号之间的关联度的函数为Rel(),则Rel(Dtp(s),Dtp(s2))为计算Dtp(s)、Dtp(s2)之间的关联度记作Rel(s,s2),记数组Dtp(s)中序号为t的元素为Dtp(s)(t)、数组Dtp(s2)中序号为t的元素为Dtp(s2)(t),Rel(s,s2)的计算公式如下:
    Figure PCTCN2022124189-appb-100002
    S4045,计算Tokenset_t(s)与集合Tokenset_t中的其他分词的关联度,设置变量s3表示集合Tokenset_t中第s2个元素到第k个元素的序号,记Tokenset_t(s)与集合Tokenset_t中第s2个到第k个分词的关联度为Rel(s,s2,k),计算公式为:
    Figure PCTCN2022124189-appb-100003
    所得的Rel(s,s2,k)即为Tokenset_t(s)与集合Tokenset_t中第s2个到第k个分词的关联度;
    S4046,进行对阈值的计算,函数exp()表示以自然数e为底数的指数函数,所述阈值记为u1的计算公式为:
    Figure PCTCN2022124189-appb-100004
    所得数值即为阈值u1的数值,并将u的数值设置为阈值u1的数值;
    根据所述Rel(s,s2,k)和u进行计算序列联系度,记Tokenset_t(s)与集合Tokenset_t中的其他分词的序列联系度为Seq(Tokenset_t(s)),序列联系度Seq(Tokenset_t(s))的计算公式为:
    Seq(Tokenset_t(s))=u*Rel(s,s2,k),
    将所得序列联系度Seq(Tokenset_t(s))记作Seq_t_s并加入到集合Seqset(t)中作为Seqset(t)序号为s的元素;
    S4047,将u的数值设置为Seq_t_s的数值;转到S4051;
    S4051,判断s的数值是否大于或等于k,若是则转到S4052,若否则将s的数值增加1;转到S403;
    S4052,将集合Seqset(t)作为集合Seqset中序号为t的元素并进行保存;结束程序;
    集合Seqset中序号为t的元素Seqset(t)与集合Txtset中序号为t的网络数据文本Txtset(t)进行分词得到的分词的集合Tokenset_t相互对应,所述集合Seqset(t)中的元素即为该元素对应序号的分词的序列联系度。
  6. 根据权利要求5所述的一种基于大数据的威胁情报预警文本分析方法,其特征在于,在S500中,计算筛选出与其他的分词之间的序列联系度存在异常的分词作为威胁情报分词,将得到的威胁情报分词组合成预警文本并将预警文本发送到客户端的屏幕进行显示的方法为:对一个分词的序列联系度进行判断是否存在异常的方法为,在任一网络数据文本Txtset(t)的分词的集合Tokenset_t中的任一分词Tokenset_t(s),若Tokenset_t(s)在Tokenset_t中的序列联系度Seq(Tokenset_t(s))的数值大于该分词Tokenset_t(s)在集合Tokensets中的其他元素中的数 值,则该网络数据文本Txtset(t)存在异常,记在集合Tokensets中除了Tokenset_t的元素组成的集合为Cu(Tokenset_t),Tokenset_t`∈Cu(Tokenset_t),记t`为集合Cu(Tokenset_t)中元素的序号,Tokenset_t`为集合Cu(Tokenset_t)中序号为t`的元素,Tokenset_t`(s)∈Tokenset_t`,Tokenset_t`(s)为Tokenset_t`中序号为s的元素,函数len()为获取集合元素数量的函数,Seq(Tokenset_t`(s))表示Tokenset_t`(s)与Tokenset_t`中的其他分词的序列联系度,判断Tokenset_t(s)在Tokenset_t中的序列联系度Seq(Tokenset_t(s))的数值是否大于该分词Tokenset_t(s)在集合Tokensets中的其他元素中的数值的公式如下:
    Figure PCTCN2022124189-appb-100005
    若满足上述公式,则表示分词Tokenset_t(s)在Tokenset_t中存在异常,将分词Tokenset_t(s)及其所在的Tokenset_t发送到输出设备进行显示或打印,即将得到的威胁情报分词进行字符串拼接组合成的字符串作为预警文本,并将所述预警文本发送到客户端的屏幕进行显示。
  7. 一种基于大数据的威胁情报预警文本分析系统,其特征在于,所述一种基于大数据的威胁情报预警文本分析系统包括:处理器、存储器及存储在所述存储器中并在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现权利要求1中的一种基于大数据的威胁情报预警文本分析方法中的步骤,所述一种基于大数据的威胁情报预警文本分析系统运行于桌上型计算机、笔记本、掌上电脑及云端数据中心的计算设备中。
PCT/CN2022/124189 2021-10-13 2022-10-09 一种基于大数据的威胁情报预警文本分析方法及系统 WO2023061304A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111189527.XA CN113627179B (zh) 2021-10-13 2021-10-13 一种基于大数据的威胁情报预警文本分析方法及系统
CN202111189527.X 2021-10-13

Publications (1)

Publication Number Publication Date
WO2023061304A1 true WO2023061304A1 (zh) 2023-04-20

Family

ID=78391229

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124189 WO2023061304A1 (zh) 2021-10-13 2022-10-09 一种基于大数据的威胁情报预警文本分析方法及系统

Country Status (2)

Country Link
CN (1) CN113627179B (zh)
WO (1) WO2023061304A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116859831A (zh) * 2023-05-15 2023-10-10 广东思创智联科技股份有限公司 一种基于物联网的工业大数据处理方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627179B (zh) * 2021-10-13 2021-12-21 广东机电职业技术学院 一种基于大数据的威胁情报预警文本分析方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170650A (zh) * 2016-12-07 2018-06-15 北京京东尚科信息技术有限公司 文本比较方法以及文本比较装置
CN111400439A (zh) * 2020-02-26 2020-07-10 平安科技(深圳)有限公司 网络不良数据监控方法、装置及存储介质
US20200380207A1 (en) * 2018-02-20 2020-12-03 Nippon Telegraph And Telephone Corporation Morpheme analysis learning device, morpheme analysis device, method, and program
CN113627179A (zh) * 2021-10-13 2021-11-09 广东机电职业技术学院 一种基于大数据的威胁情报预警文本分析方法及系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120239667A1 (en) * 2011-03-15 2012-09-20 Microsoft Corporation Keyword extraction from uniform resource locators (urls)
CN106886567B (zh) * 2017-01-12 2019-11-08 北京航空航天大学 基于语义扩展的微博突发事件检测方法及装置
CN108319666B (zh) * 2018-01-19 2021-09-28 国网浙江省电力有限公司营销服务中心 一种基于多模态舆情分析的供电服务评估方法
CN109698823B (zh) * 2018-11-29 2021-05-07 广东电网有限责任公司信息中心 一种网络威胁发现方法
CN112182461A (zh) * 2020-08-21 2021-01-05 杭州安恒信息技术股份有限公司 网页敏感度的计算方法、装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170650A (zh) * 2016-12-07 2018-06-15 北京京东尚科信息技术有限公司 文本比较方法以及文本比较装置
US20200380207A1 (en) * 2018-02-20 2020-12-03 Nippon Telegraph And Telephone Corporation Morpheme analysis learning device, morpheme analysis device, method, and program
CN111400439A (zh) * 2020-02-26 2020-07-10 平安科技(深圳)有限公司 网络不良数据监控方法、装置及存储介质
CN113627179A (zh) * 2021-10-13 2021-11-09 广东机电职业技术学院 一种基于大数据的威胁情报预警文本分析方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116859831A (zh) * 2023-05-15 2023-10-10 广东思创智联科技股份有限公司 一种基于物联网的工业大数据处理方法及系统
CN116859831B (zh) * 2023-05-15 2024-01-26 广东思创智联科技股份有限公司 一种基于物联网的工业大数据处理方法及系统

Also Published As

Publication number Publication date
CN113627179B (zh) 2021-12-21
CN113627179A (zh) 2021-11-09

Similar Documents

Publication Publication Date Title
WO2023061304A1 (zh) 一种基于大数据的威胁情报预警文本分析方法及系统
US11222059B2 (en) Data clustering
TWI653542B (zh) 一種基於網路媒體資料流程發現並跟蹤熱點話題的方法、系統和裝置
CN111460153B (zh) 热点话题提取方法、装置、终端设备及存储介质
WO2016180268A1 (zh) 一种文本聚合方法及装置
US10078632B2 (en) Collecting training data using anomaly detection
CN106844576B (zh) 一种异常检测方法、装置和监控设备
US20210374195A1 (en) Information processing method, electronic device and storage medium
CN112162965B (zh) 一种日志数据处理的方法、装置、计算机设备及存储介质
US10031901B2 (en) Narrative generation using pattern recognition
CN111949798B (zh) 图谱的构建方法、装置、计算机设备和存储介质
CN111950279B (zh) 实体关系的处理方法、装置、设备及计算机可读存储介质
CN112364637B (zh) 一种敏感词检测方法、装置,电子设备及存储介质
CN109933502B (zh) 电子装置、用户操作记录的处理方法和存储介质
WO2021136318A1 (zh) 一种面向数字人文的电子邮件历史事件轴生成方法及装置
CN111859093A (zh) 敏感词处理方法、装置及可读存储介质
CN108459845A (zh) 一种监控标签属性的埋点方法及装置
CN112132794A (zh) 审计视频的文字定位方法、装置、设备和可读存储介质
CN113408660B (zh) 图书聚类方法、装置、设备和存储介质
WO2021223629A1 (zh) 用于分析图像素材的方法和装置
CN111222032A (zh) 舆情分析方法及相关设备
CN116089985A (zh) 一种分布式日志的加密存储方法、装置、设备及介质
CN113741864B (zh) 基于自然语言处理的语义化服务接口自动设计方法与系统
WO2018205460A1 (zh) 获取目标用户的方法、装置、电子设备及介质
CN115470489A (zh) 检测模型训练方法、检测方法、设备以及计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22880243

Country of ref document: EP

Kind code of ref document: A1