CN111090994A

CN111090994A - Chinese-internet-forum-text-oriented event place attribution province identification method

Info

Publication number: CN111090994A
Application number: CN201911101388.3A
Authority: CN
Inventors: 陈进东; 刘琳琳; 杜雨璇; 张健; 齐林
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-05-01

Abstract

The invention relates to a Chinese Internet forum text oriented incident place attribution province identification method, which comprises the following steps: 1. constructing a place name attribution province inquiry dictionary; 2. chinese word segmentation based on a jieba tool; step two, event location identification: 1. extracting and constructing a characteristic value; 2. text event location identification; 3. multiple event locations are deduplicated; step three, determining the province of the homed province: and directly utilizing a place name attribution province query dictionary to query and determine the event place attribution province of the post text in the forum aiming at the identified event place of the post text in the forum. The invention provides a clear idea for dealing with complex text word segmentation, particularly for removing duplication of a plurality of event places and identifying provinces to which the event places belong on the basis of event place identification. The method is simple to realize and easy to generalize, and compared with the traditional place name identification, the method has the advantage that the fineness and the accuracy are obviously improved.

Description

Chinese-internet-forum-text-oriented event place attribution province identification method

Technical Field

The invention relates to the fields of computer science and technology, natural language processing, public opinion analysis, text mining and the like, in particular to a Chinese internet forum text-oriented incident place attribution province identification method.

Background

The method can be used for counting and analyzing main public opinion events and public opinion conditions of different provinces, transversely comparing the public opinion levels and the public opinion differences of different provinces, and providing support for government accurate management and intelligent decision. The basis for identifying the province of the network text where the event occurs is to identify the event location of the network text, and a certain achievement is obtained in the event location identification of the network text such as news by means of a natural language processing tool, a location dictionary, a classification model, and the like.

The Chinese web forum is becoming more and more important worldwide, and a large number of posts and comment texts are emerging every day. The published contents of users often contain rich public sentiment and place information, including various vocabularies of parts of speech such as simple place names, composite place names, organization names, enterprise names, landmark place names and the like. Based on the published content of forum users, the main public sentiment events and public sentiment conditions of different provinces can be identified, so that local public sentiments can be reflected more directly and accurately, and support is provided for government decision. However, most of the forum users have informal publications, the quality of the corpus is worse than that of the news, and how to accurately identify the event location from these large amount of forum texts is a difficult problem.

The method for identifying provinces belonging to the event site of the Chinese-oriented Internet forum post text has no patent with obvious pertinence at present. The patent title "an event and place extraction method for Chinese news text"; CN 104731768A; the invention discloses a place name extraction method for a news text, which is realized by extracting candidate event places, constructing a feature vector and identifying event places in the news text. However, the patent can only identify the event location, and no clear idea is provided for the convenience of dealing with word segmentation of complex texts, duplicate removal of a plurality of event locations, identification of provinces to which the event locations belong, and the like.

Disclosure of Invention

The invention aims to provide a Chinese-oriented internet forum text-oriented incident place attribution province identification method. The method comprises the steps of segmenting the post text in the Chinese network forum by adopting a jieba Chinese word segmentation tool, carrying out binary classification on the place names acquired by segmenting words by adopting a support vector machine, and finally determining the occurrence place of the post text event by using an event place attributive province query dictionary.

In order to achieve the above purposes, the technical scheme adopted by the invention is as follows:

a method for identifying provinces belonging to an event site facing to Chinese Internet forum texts comprises the following steps:

step one, text word segmentation

1. Constructing a place name attribution province inquiry dictionary: constructing a place name attribution province inquiry dictionary of 'city, county, town, street office and landmark → attribution province' through a four-level administrative division place name lexicon in a dog searching input method lexicon, a large-scale and Chinese landscape and famous site lexicon of a government organization;

2. chinese word segmentation based on a jieba tool: adopting a jieba Chinese word segmentation tool, establishing a custom dictionary, adding a four-level administrative division place word bank, a government organization and organization universe and a Chinese landscape scenic spot word bank, and segmenting the content of a post text T in a forum by adopting an accurate mode;

step two, event location identification

1. Extracting and constructing characteristic values: part-of-speech tagging is carried out through jieba, and place names, organization groups and landmark nouns in text participles are extracted to form a candidate event place setAnd W_T(ii) a For set W_TIs of each feature vector w'_iTwo features are selected, including w'_iContextual feature c in post text T_iAnd w'_iLocation feature p in post text T_i；

2. Text event location identification

Manually marking the event location of the post text in the forum, and marking 5000 post texts; based on w'_iTraining an SVM event location classifier according to the context characteristics and the position characteristics in the post text and the result of artificial marking, and utilizing the SVM event location classifier to collect W_TOf (2) a feature vector w'_iPerforming binary classification on the event location and the non-event location, and identifying the event location of the post text;

3. multiple event location deduplication

Aiming at the condition that a plurality of event places are identified, calculating cosine similarity between the content of the post text in the forum and different event places through a word distributed vector established by the word2vec model through unsupervised learning, and selecting the event place with high cosine similarity as the only event place of the post text in the forum;

step three, determining the province of the homed province

And directly utilizing a place name attribution province query dictionary to query and determine the event place attribution province of the post text in the forum aiming at the identified event place of the post text in the forum.

For set W_TW 'of'_iContextual feature in post text T, w'_iWeight representation of matched regular expressions, denoted c_i；

(1) If w'_iOne of the regular expressions from formula (1) to formula (10) can be matched in the post text T, and the k-th regular expression is assumed to be r_kWhen k is 1-10, r_kThe expression of (a) is as follows:

r₁^ w + generation $ (1)

r₂＝^\，\w$ (2)

r₃^ at + \ w $ (3)

r₄＝^\：\w$ (4)

r₅^ report + \ w $ (5)

r₆^ explosive + \ w $ (6)

r₇Is ^ is + \ w $ (7)

r₈^ report + \ w $ (8)

r₉^ name + \\ w $ (9)

r₁₀^ located \ w + $ (10)

The weight matching the kth regular expression is represented by the value tfidf (k), which is expressed as

c_i＝tfidf_i,j(k)＝tf_i,j(k)×idf_i,j(k) (11)

Tfidf is a fixed algorithm name, tf represents word frequency, idf represents an inverse text frequency index, and k represents matching with the kth regular expression;

wherein, tf_i,j(k) Defined by the following equation:

wherein n is_i,j(k) Represents w'_iNumber of times of conforming to kth regular expression in post text j, N (k) represents w'_iThe times of conforming to the kth regular expression in all post texts in the forum;

idf_i,j(k) defined by the following equation:

where | D | represents the number of all posts in the forum, r_kDenotes the kth regular expression, d_jThe number of posts containing the kth regular expression in all post text sets in the forum is represented, and the +1 in the denominator is used for preventing the denominator from being 0 and being incapable of calculating due to the fact that no post contains the kth regular expression in the corpus;

(2) if w'_iRegular expressions in formula (1) to formula (10) cannot be matched in the post text T, c_i＝0。

For the position feature p_iThere are two cases:

(1) occurrence of position information, p, in the title of post text T_i＝0.99；

(2) The location information appears in the non-title text of the post text T,

wherein loc (p)_iT) represents w from the start of post text T'_iThe number of words between the first occurrence positions; length (T) represents the total word count of the post text T.

The invention has the beneficial effects that:

the invention relates to a Chinese Internet forum text-oriented incident place belonging province identification method, which is realized by extracting candidate incident places in a Chinese Internet forum, identifying incident places and determining belonging provinces. The method provides a clear idea in the aspects of dealing with complex text word segmentation, particularly, de-duplication of a plurality of event places on the basis of event place identification, identification of provinces to which the event places belong and the like. The method is simple to realize and easy to generalize, and compared with the traditional place name identification, the method has the advantage that the fineness and the accuracy are obviously improved.

Drawings

The invention has the following drawings:

FIG. 1: the invention is a schematic diagram of an event location attribution province identification process for Chinese internet forum texts;

FIG. 2: the invention is based on a text classification process schematic diagram of a support vector machine;

FIG. 3: the invention discloses a network structure schematic diagram of a word2vec model.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1-3, the method for identifying provinces belonging to an event site facing to a chinese internet forum text according to the present invention includes the following steps:

the method comprises the following steps: text word segmentation

1. And constructing a place name attribution province inquiry dictionary. A place name attribution province inquiry dictionary of 'city, county, town, street office and landmark → attribution province' is constructed through a four-level administrative division place name lexicon in a dog search input method lexicon and a large-scale and Chinese landscape and famous site lexicon of government organization organizations.

2. Chinese word segmentation based on a jieba tool. In order to ensure higher word segmentation accuracy, a jieba Chinese word segmentation tool is adopted to establish a custom dictionary (userdite), a four-level administrative district word bank, a large-scale and Chinese landscape word bank of government organization groups are added, and the content of a post text T (a certain post text in a forum) in the forum is segmented by adopting an accurate mode.

Step two: event location identification

1. And extracting and constructing the characteristic value. Part-of-speech tagging is carried out through jieba, and place names, organization groups and landmark nouns in text participles are extracted to form a candidate event place set W_T. For set W_TIs of each feature vector w'_iTwo features are selected, including w'_iContextual feature c in post text T_iAnd w'_iLocation feature p in post text T_i。

The method is characterized in that: contextual feature c_i

(1) Imitating the concept of TF-IDF algorithm, if w'_iOne of the regular expressions from formula (1) to formula (10) can be matched in the post text T, and the k-th regular expression is assumed to be r_kWhen k is 1-10, r_kThe expression of (a) is as follows:

r₁^ w + generation $ (1)

r₂＝^\，\w$ (2)

r₃^ at + \ w $ (3)

r₄＝^\：\w$ (4)

r₅^ report + \ w $ (5)

r₆^ explosive + \ w $ (6)

r₇Is ^ is + \ w $ (7)

r₈^ report + \ w $ (8)

r₉^ name + \\ w $ (9)

r₁₀^ located \ w + $ (10)

c_i＝tfidf_i,j(k)＝tf_i,j(k)×idf_i,j(k) (11)

a)tf_i,j(k) defined by the following equation:

wherein n is_i,j(k) Represents w'_iNumber of times of conforming to kth regular expression in post text j, N (k) represents w'_iThe number of times that the kth regular expression is met in all post texts in the forum.

b)idf_i,j(k) Defined by the following equation:

where | D | represents the number of all posts in the forum, r_kDenotes the kth regular expression, d_jThe number of posts containing the kth regular expression in all post text sets in the forum is represented, and the +1 in the denominator is used for preventing that the denominator is 0 and calculation cannot be carried out due to the fact that no post in the corpus contains the kth regular expression.

(2) If w'_iIn post textT cannot match the regular expressions in the formulas (1) to (10), then c_i＝0。

The second characteristic: position feature p_i

(1) The position information appears in the title of the post text T. Generally, the occurrence of a single location in a title is largely an incident, but it is found by reading the corpus that a case of multiple place names in the title occasionally occurs. Therefore, the feature of the position information appearing in the title can be weighted more heavily, p_i＝0.99。

(2) The location information appears in the non-title text of the post text T,

2. Text event location identification

And manually marking the event location of the post texts in the forum, and marking 5000 post texts in order to ensure the accuracy of the classifier. Based on w'_iTraining an SVM event location classifier according to the context characteristics and the position characteristics in the post text and the result of artificial marking, and utilizing the SVM event location classifier to collect W_TOf (2) a feature vector w'_iAnd performing binary classification of the event location and the non-event location, and identifying the event location of the post text.

3. Multiple event location deduplication

Aiming at the condition that a plurality of event places are identified, cosine similarity between the content of the post text in the forum and different event places is calculated through a word distributed vector established by the word2vec model through unsupervised learning, and the event place with high cosine similarity is selected as the only event place of the post text in the forum.

Step three: home province determination

Those not described in detail in this specification are within the skill of the art.

Claims

1. A method for identifying provinces belonging to an event site and oriented to Chinese Internet forum texts is characterized by comprising the following steps:

step one, text word segmentation

step two, event location identification

1. Extracting and constructing characteristic values: part-of-speech tagging is carried out through jieba, and place names, organization groups and landmark nouns in text participles are extracted to form a candidate event place set W_T(ii) a For set W_TIs of each feature vector w'_iTwo features are selected, including w'_iContextual feature c in post text T_iAnd w'_iLocation feature p in post text T_i；

2. Text event location identification

Manually marking the event location of the post text in the forum, and marking 5000 post texts; based on w'_iTraining an SVM event location classifier according to the context characteristics and the position characteristics in the post text and the result of artificial marking, and utilizing the SVM event location classifier to collect W_TOf (2) a feature vector w'_iPerforming binary classification of event location and non-event location to identify post textA piece location;

3. multiple event location deduplication

step three, determining the province of the homed province

2. The method for identifying provinces of event sites oriented to the text of the chinese internet forum as claimed in claim 1, wherein:

r₁^ w + generation $ (1)

r₂＝^\，\w$ (2)

r₃^ at + \ w $ (3)

r₄＝^\：\w$ (4)

r₅^ report + \ w $ (5)

r₆^ explosive + \ w $ (6)

r₇Is ^ is + \ w $ (7)

r₈^ report + \ w $ (8)

r₉^ name + \\ w $ (9)

r₁₀^ located \ w + $ (10)

c_i＝tfidf_i,j(k)＝tf_i,j(k)×idf_i,j(k) (11)

wherein, tf_i,j(k) Defined by the following equation:

idf_i,j(k) defined by the following equation:

3. The method for identifying the province of the event location oriented to the Chinese forum text as claimed in claim 1, wherein the location characteristic p is_iThere are two cases:

(2) Non-logo in post text TThe position information appears in the subject text and,