CN112487306A

CN112487306A - Automatic event marking and classifying method based on knowledge graph

Info

Publication number: CN112487306A
Application number: CN202011417045.0A
Authority: CN
Inventors: 王晓玲; 赵鑫; 袁佳豪; 王韵弘
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-03-12
Anticipated expiration: 2040-12-07
Also published as: CN112487306B

Abstract

The invention discloses an automatic event marking and classifying method based on a knowledge graph, which comprises the steps of constructing the knowledge graph of a required field category, crawling each text data published in social media in a preset time window, extracting key phrases and screening to obtain burst phrases, clustering the burst phrases to obtain burst phrase clusters, calculating TF-IDF scores of the corresponding burst phrases of the burst events on each knowledge graph, summing to obtain the TF-IDF scores of the burst events on each knowledge graph, and marking the corresponding events as the field category if the TF-IDF scores are larger than a preset threshold value, thereby determining the marking and classification of the events. According to the method, the emergency is automatically determined by screening and clustering the emergency phrases of the text data in the social media, and then TF-IDF scores of the emergency on knowledge maps of various field categories are calculated, so that the automatic accurate marking and classification of the social media events are realized.

Description

Automatic event marking and classifying method based on knowledge graph

Technical Field

The invention belongs to the technical field of event marking and classification, and particularly relates to an automatic event marking and classifying method based on a knowledge graph.

Background

In recent years, with the rapid development of social media, social media such as twitter and microblog gradually become important ways for people to acquire news information. Therefore, more and more work is focused on and analyzes social media information, wherein one important type of work is event extraction work of social media data, namely, events described by the social media information are extracted according to the social media data. However, for extracted events (event key phrases, abstracts, etc.), some events which are not concerned are inevitably generated, so that the extracted events need to be marked and classified (several categories such as military, politics, geography, etc.) to obtain the category to which each event belongs, so as to filter out the events which are not concerned about the category and screen out the events which are interested. However, how to accurately obtain the categories of events according to a small amount of information describing the events and how to solve the problem that a certain event may belong to multiple categories, and a better solution is not available, all of which need to be further researched and solved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an automatic event marking and classifying method based on a knowledge graph, so that automatic and accurate marking and classification of social media events are realized.

In order to achieve the above object, the automatic event marking and classifying method based on knowledge graph of the present invention comprises the following steps:

s1: setting N field categories according to actual needs, respectively collecting text data of each field category and constructing a knowledge graph G_n，n＝1,2…,N；

S2: presetting a time window T, crawling social media in the time windowAnd extracting key phrases from each piece of document data of the cloth, and forming a key phrase set A by using the extracted key phrases. Respectively calculating the burst degree W of each key phrase s in the key phrase set A_sThe calculation formula is as follows:

W_s＝p_s×log(u_s)×log(r_s)×log(log(f_s))

wherein p is_sRepresenting the probability of a burst, u, of a key phrase s within a time window T_sRepresenting the number of users, r, who used the key phrase s within the time window T_sRepresenting the number of times a text containing a key phrase s is forwarded within a time window T, f_sRepresents the sum of the number of interests of the user using the key phrase s within the time window T;

sorting all key phrases according to the burst degree from high to low, and selecting the first K key phrases as burst phrases to be added into a burst phrase set B;

s3: averagely dividing the time window T into M continuous disjoint sub-time windows, and recording the mth sub-time window as T_mFor each burst phrase e, e ∈ B, each sub-time window T is counted_mText set text (e, m) and number of texts f containing the burst phrase e₁(e, m), and the amount of text f containing the burst phrase e over the time window T₂(e) Calculating the time window T of each burst phrase e in the sub-time window_mThe ratio d (e, m) is f₁(e,m)/f₂(e)；

Let two burst phrases be e respectively_a、e_bFirst, text sets text (e) are calculated separately_aM) and text set text (e)_bM) similarity sim (text (e)_a,m),text(e_bM)), and then the similarity S (e) of the two burst phrases is calculated using the following formula_a,e_b)：

Clustering burst phrases according to similarity between burst phrasesGet K burst phrase clusters C_kK is 1,2, …, K, each burst phrase cluster C_kI.e. an emergency event within the time window T;

s4: for burst phrase cluster C_kCorresponding emergencies are constructed according to the knowledge graph G of each field category constructed in the step S1_nCovered text data, calculating a burst phrase cluster C_kKnowledge graph G of each burst phrase v in each domain class_nThe score of the TF-IDF is score (v, n), and then the sum is carried out to obtain the burst phrase cluster C_kKnowledge graph G of corresponding emergency in each field category_nUpper TF-IDF score (k, n):

s5: presetting TF-IDF score threshold

For burst phrase cluster C_kCorresponding emergency, if it is in domain class knowledge map G_nA TF-IDF score of score (k, n) above a threshold value

The incident is marked as the domain category to determine the marking and classification of the incident.

The invention relates to an automatic event marking and classifying method based on a knowledge graph, which comprises the steps of constructing the knowledge graph of a required field category, crawling each text data published in social media in a preset time window, extracting key phrases and screening to obtain burst phrases, clustering the burst phrases to obtain burst phrase clusters, calculating TF-IDF scores of the corresponding burst phrases of the burst events on each knowledge graph, summing to obtain the TF-IDF scores of the burst events on each knowledge graph, and marking the corresponding events as the field category if the TF-IDF scores are larger than a preset threshold value, thereby determining the marking and classification of the events.

According to the method, the emergency is automatically determined by screening and clustering the emergency phrases of the text data in the social media, and then TF-IDF scores of the emergency on knowledge maps of various field categories are calculated, so that the automatic accurate marking and classification of the social media events are realized.

Drawings

FIG. 1 is a flow diagram of an embodiment of a method for knowledge-graph based automated event tagging and classification in accordance with the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flow diagram of an embodiment of a method for knowledge-graph based automated event tagging and classification in accordance with the present invention. As shown in FIG. 1, the automatic event labeling and classifying method based on knowledge graph of the present invention comprises the following specific steps:

s101: constructing knowledge graphs of various field categories:

setting N field categories according to actual needs, respectively collecting text data of each field category and constructing a knowledge graph G_n，n＝1,2…,N。

Knowledge Graph (Knowledge Graph) is a series of different graphs displaying the relation between the Knowledge development process and the structure, and uses visualization technology to describe Knowledge resources and carriers thereof, and excavates, analyzes, constructs, draws and displays Knowledge and the mutual relation between the Knowledge resources and the carriers. For internet social media, the domain categories set in the present embodiment include Military (milery), People (peoples), Industry (Industry), security (Safety), weather (meteorology), and Geography (geographics), relevant data of these domain categories on wikipedia is crawled, a knowledge graph is constructed based on these data, and the event is labeled and classified by means of these knowledge graph information.

S102: screening burst phrases:

and presetting a time window T, crawling each text data published in social media in the time window, extracting key phrases of each file data, and forming a key phrase set A by the extracted key phrases. Respectively calculating the burst degree W of each key phrase s in the key phrase set A_sThe calculation formula is as follows:

W_s＝p_s×log(u_s)×log(r_s)×log(log(f_s))

wherein p is_sRepresenting the probability of a burst of a key phrase s within a time window T, i.e.

t_s、t_s′Respectively representing the times of occurrence of key phrase s and key phrase s 'in the time window T, s, s' being equal to A and u_sRepresenting the number of users, r, who used the key phrase s within the time window T_sRepresenting the number of times a text containing a key phrase s is forwarded within a time window T, f_sRepresenting the sum of the number of interests of the user using the key phrase s within the time window T.

And sorting all key phrases according to the burst degree from high to low, selecting the first K key phrases as burst phrases to be added into a burst phrase set B, and setting the value of K according to the requirement.

Table 1 is a list of the degree of burstiness of some key phrases in this example.

Key phrases	Degree of burst
		Iran	20.1310
Zarif	6.91061
		foreign minister	3.68816
Human rights	3.21209
		president	2.54122
resignation	2.53455
		fellow diplomats	1.22547
foreign policy	0.32457
		hinting at	0.30289
condemns	0.02536
		In front of	0.01785
except for	0.00566

TABLE 1

In this embodiment, the first 3 key phrases are selected as burst phrases to be added to the burst phrase set B, i.e., B ═ Iran, Zarif, forign minimum }.

S103: constructing an emergency based on the burst phrase clustering:

averagely dividing the time window T into M continuous disjoint sub-time windows, and recording the mth sub-time window as T_mM1, 2, …, M, for each burst phrase e, e ∈ B, each sub-time window T is counted_mText set text (e, m) and number of texts f containing the burst phrase e₁(e, m), and the amount of text f containing the burst phrase e over the time window T₂(e) Calculating the time window T of each burst phrase e in the sub-time window_mThe ratio d (e, m) is f₁(e,m)f₂(e)。

And calculating the similarity of the burst phrases in the burst phrase set B pairwise, wherein the calculation method comprises the following steps:

Clustering the burst phrases according to the similarity between the burst phrases to obtain K burst phrase clusters C_kK is 1,2, …, K, each burst phrase cluster C_kI.e. an emergency event of the time window T.

In this embodiment, the similarity between text sets is TF-IDF (term frequency-inverse document frequency) similarity. TF-IDF is a commonly used weighting technique for information retrieval (information retrieval) and text mining (text mining). TF-IDF is a statistical method to assess how important a word is for one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. The method for calculating the TF-IDF similarity of the text set in the embodiment comprises the following steps:

1) and respectively segmenting the two text sets, and combining the two obtained word sets to obtain a word set phi.

2) For each word in the set of words phi, the TF-IDF value p of the word in the two text sets is calculated separately_i,1、p_i,2I ═ 1,2, …, | φ |, | φ | represents the number of words in the set of words φ. The TF-IDF value is calculated by the following method: for the word i, the word frequency TF (i) and the inverse text frequency IDF (i) log (D/D) of the word i in the text set are counted_i) D denotes the number of texts in the text set, D_iRepresenting the number of texts in the text set containing word i, the TF-IDF value is TF (i) x IDF (i).

3) A TF-IDF vector P corresponding to each text set is constructed and obtained according to the TF-IDF value of each word₁＝(p_1,1,p_2,1,…p_|φ|,1)、P₂＝(p_1,2,p_2,2,…p_|φ|,2)。

4) And calculating cosine similarity between TF-IDF vectors corresponding to the two text sets, namely taking the cosine similarity as the similarity between the text sets. In this embodiment, the clustering of the burst phrases adopts a Jarvis-Patrick clustering algorithm, and the algorithm can perform clustering based on the similarity between burst knowledge, and the method can be briefly described as follows: drawing an SNN (shared neighbor similarity) similarity graph according to the burst phrase similarity, thinning the SNN similarity graph by using a similarity threshold, and finding out a communication branch of the thinned SNN similarity graph to obtain a clustering result.

Table 2 shows the burst phrase clusters obtained by clustering the burst phrases in this embodiment.

TABLE 2

S104: calculating the TF-IDF score of the emergency:

for burst phrase cluster C_kCorresponding events are constructed according to the knowledge graph G of each field category constructed in the step S101_nCovered text data, calculating a burst phrase cluster C_kKnowledge graph G of each burst phrase v in each domain class_nThe score of the TF-IDF is score (v, n), and then the sum is carried out to obtain the burst phrase cluster C_kKnowledge graph G of corresponding emergency in each field category_nUpper TF-IDF score (k, n):

table 3 shows TF-IDF scores of events corresponding to the burst phrase clusters on the domain category knowledge graph in the embodiment.

TABLE 3

S105: event tagging and classification:

presetting TF-IDF score threshold

Setting the score threshold in the present embodiment

The event can be derivedThe labels and classifications are "Military (Military)" and "People (People)", consistent with human observation.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. An automatic event marking and classifying method based on knowledge graph is characterized by comprising the following steps:

S2: and presetting a time window T, crawling each text data published in social media in the time window, extracting key phrases of each file data, and forming a key phrase set A by the extracted key phrases. Respectively calculating the burst degree W of each key phrase s in the key phrase set A_sThe calculation formula is as follows:

W_s＝p_s×log(u_s)×log(r_s)×log(log(f_s))

s3: dividing the time window T into M consecutive disjoint sub-time windows on averagem sub-time windows are T_mFor each burst phrase e, e ∈ B, each sub-time window T is counted_mText set text (e, m) and number of texts f containing the burst phrase e₁(e, m), and the amount of text f containing the burst phrase e over the time window T₂(e) Calculating the time window T of each burst phrase e in the sub-time window_mThe ratio d (e, m) is f₁(e,m)/f₂(e)；

Clustering the burst phrases according to the similarity between the burst phrases to obtain K burst phrase clusters C_kK is 1,2, …, K, each burst phrase cluster C_kI.e. an emergency event within the time window T;

s4: for burst phrase cluster C_kCorresponding emergencies are constructed according to the knowledge graph G of each field category constructed in the step S101_nCovered text data, calculating a burst phrase cluster C_kKnowledge graph G of each burst phrase v in each domain class_nThe score of the TF-IDF is score (v, n), and then the sum is carried out to obtain the burst phrase cluster C_kKnowledge graph G of corresponding emergency in each field category_nUpper TF-IDF score (k, n):

s5: presetting TF-IDF score threshold

2. The automated event labeling and classification method according to claim 1, wherein the text set similarity in step S3 is TF-IDF similarity, and the calculation method comprises the following steps:

1) respectively segmenting the two text sets, and combining the two obtained word sets to obtain a word set phi;

2) for each word in the set of words phi, the TF-IDF value p of the word in the two text sets is calculated separately_i,1、p_i,2I ═ 1,2, …, | φ |, | φ | represents the number of words in the set of words φ;

3) a TF-IDF vector P corresponding to each text set is constructed and obtained according to the TF-IDF value of each word₁＝(p_1,1,p_2,1,…p_|φ|,1)、P₂＝(p_1,2,p_2,2,…p_|φ|,2)；

4) And calculating cosine similarity between TF-IDF vectors corresponding to the two text sets, namely taking the cosine similarity as the similarity between the text sets.

3. The automated event labeling and classification method according to claim 1, wherein the clustering of the burst phrases in step S3 adopts a Jarvis-Patrick clustering algorithm.