CN106874448B

CN106874448B - Method and device for mining earthquake subject term from microblog

Info

Publication number: CN106874448B
Application number: CN201710074352.5A
Authority: CN
Inventors: 张晓东; 陈欣意; 邹再超; 李林; 苏伟; 刘峻明; 朱德海; 孙瑞志
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2017-02-10
Filing date: 2017-02-10
Publication date: 2020-03-06
Anticipated expiration: 2037-02-10
Also published as: CN106874448A

Abstract

The invention provides a method and a device for mining earthquake subject terms from microblogs, wherein the method comprises the following steps: extracting a characteristic word from each microblog text in a microblog text set containing earthquake vocabularies, and calculating the weight of each characteristic word in the microblog text where the characteristic word is located based on a TF-PDF formula; obtaining the influence of each microblog text based on the corresponding liveness, spreading power and coverage of the microblog text; and acquiring the popularity of each feature word based on the influence of each microblog text and the weight of each feature word in the microblog text, performing descending order arrangement according to the popularity of each feature word, and taking the feature word with the top rank as the earthquake subject word. The method improves the calculation accuracy of the popularity of the feature words and increases the accuracy of extracting the hot subject words from the seismic microblog text data.

Description

Method and device for mining earthquake subject term from microblog

Technical Field

The invention relates to the technical field of seismic information acquisition, in particular to a method and a device for mining seismic subject terms from microblogs.

Background

When an earthquake occurs, a large amount of relevant data about the earthquake disaster is spread through the microblog. How to mine and extract hot earthquake subject information from a large number of microblog texts is a problem to be solved. The acquired earthquake microblog information is Chinese short text information and is mixed with a large amount of information with low public information content such as news facts, repeated forwarding and the like, so that the earthquake microblog information needs to be screened and sorted, and subject terms which can represent the public most are extracted according to evaluation of microblog information spreading influence.

Term Frequency-Proportional Document Frequency (TF-PDF), is a commonly used weighting technique for intelligence retrieval and text mining. For the extraction of the hot topic, the feature vocabulary representing the hot topic should appear frequently in a large number of documents. TF PDF assigns larger weight to words frequently appearing in a plurality of documents in a plurality of channels, and vice versa, so that the method is a feature word weight calculation method more suitable for extracting hot words.

The existing microblog subject term extraction method still has some problems aiming at the mining of the earthquake microblog information hot subject terms:

1. the noise influence of pure forwarding of the earthquake news micro-blog and the content thereof is ignored;

2. the mining and extraction of number words, date and time expression phrases and geographic position expression phrases which can accurately express seismic information are lacked;

3. the consideration of the influence of the comprehensive influence of the activity, the spreading power and the coverage of each microblog text on the contained feature words is lacked;

the problems all affect the calculation of popularity and popularity of the subject term, so that the result of mining the earthquake subject term lacks accuracy and efficiency.

Disclosure of Invention

The present invention provides a method and apparatus for mining seismic subject terms from microblogs that overcomes or at least partially solves the above-mentioned problems.

According to one aspect of the invention, a method for mining earthquake subject terms from microblogs is provided, and the method comprises the following steps:

s1, extracting a feature word from each microblog text in the microblog text set containing the earthquake vocabulary, and calculating the weight of each feature word in the microblog text where the feature word is located based on a TF-PDF formula;

s2, obtaining the influence of each microblog text based on the activity, the propagation force and the coverage corresponding to the microblog text; and

s3, based on the influence of each microblog text and the weight of each feature word in the microblog text, obtaining the popularity of each feature word, performing descending order arrangement according to the popularity of each feature word, and taking the feature word with the top rank as an earthquake subject word.

According to another aspect of the invention, an apparatus for mining seismic subject terms from microblogs is provided, which includes:

the weight calculation unit is used for extracting a characteristic word from each microblog text in the microblog text set containing the earthquake vocabulary, and calculating the weight of each characteristic word in the microblog text where the characteristic word is located based on a TF-PDF formula;

the influence calculation unit is used for obtaining the influence of each microblog text based on the activity, the propagation and the coverage corresponding to the microblog text; and

and the subject term obtaining unit is used for obtaining the popularity of each feature term based on the influence of each microblog text and the weight of each feature term in the microblog text, performing descending order arrangement according to the popularity of each feature term, and taking the feature term with the top rank as the earthquake subject term.

The method and the device for mining the earthquake subject term of the microblog text have the advantages that the earthquake subject term is obtained by taking a microblog as a data source from a public perspective and comprehensively considering the weight of the feature term obtained based on a TF-PDF formula and the influence of the microblog text, the popularity calculation accuracy of the feature term is improved, the accuracy of extracting the hot subject term from the earthquake microblog text data is improved, the supplement of an important data source and an analysis method is provided for information analysis of researches such as earthquake information propagation, earthquake disaster prevention and reduction and the like, and the practical significance is very strong.

Drawings

FIG. 1 is a flow chart of mining seismic subject terms from microblogs according to an embodiment of the invention;

FIG. 2 is a screenshot of a microblog with a title in the prior art;

fig. 3 is a screenshot of a microblog containing a topic in the prior art.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Fig. 1 shows a flowchart of mining a seismic subject term from a microblog according to an embodiment of the present invention, and as can be seen from fig. 1, the method includes:

In a microblog text obtained by searching by taking 'earthquake' as a keyword, the microblog content is mostly factual description and does not contain feelings of people, microblog data are generated by a seismic platform network center, media or individuals for forwarding and the like, the information has small analysis significance on the earthquake microblog, but the number and the forwarding amount of the information are large. According to the method, the news microblogs have negative influence on the analysis of the information of the earthquake microblogs, so that the earthquake news microblogs and the forwarding thereof are identified, and are cleaned and filtered from the data set after identification and extraction.

In one embodiment, the step S1 is preceded by:

collecting microblog texts containing earthquake vocabularies within a certain time range to form a microblog text set; and

and removing the microblog texts containing specific words, specific topics or specific titles, the microblog texts sent by the bloggers with media authentication and the microblog texts which are simply forwarded from the microblog text set.

In the release rules of micro-blogs, the convention colloquially states:

the titles are shown in the symbols, fig. 2 shows a schematic diagram of a microblog containing the titles, which is randomly captured from the microblog by the inventor, and as is apparent from fig. 2, the title "earthquake news," of the microblog is contained in the symbols.

The topic is displayed in two # symbols, fig. 3 shows a microblog diagram containing the topic captured randomly from a microblog by the inventor, and as can be seen clearly in fig. 3, the topic "earthquake news in rush" of the microblog is contained between two # symbols.

By searching whether the [ sign ] or the # # sign is contained in the microblog, the topic or the title of the microblog can be obtained.

In one embodiment, the microblogs to be rejected include the following categories:

(1) microblogs containing 'earthquake news in rush', 'news in rush', and the like;

(2) micro-blogs containing a # latest message #, "# earthquake news #," # earthquake live broadcast #, "# earthquake latest dynamic #, and the like;

(3) microblogs containing vocabularies such as 'Chinese earthquake table net', 'Chinese earthquake bureau', 'Xinhua society express news', 'statistics', and the like;

(4) and simply forwarding the microblogs of the three types of microblogs.

The reason why the following microblogs are deleted is that the method provided by the invention is more used for mining and extracting earthquake hotwords expressed by the public from the microblog information, and the news information and a large amount of pure forwarding thereof can influence the expression of the information of the public.

If the microblog text contains traditional characters, the word segmentation analysis of the microblog is difficult, and the word segmentation result is wrong, so that in one embodiment, the traditional character conversion operation to the simplified character conversion operation is required before the microblog text is segmented.

In one embodiment, the Chinese characters from traditional to simplified are converted using the Chinese converter.dll in the Visual Studio International Pack class library of the NET framework.

In one embodiment, the step S1 includes:

s1.1, segmenting each microblog text in the microblog text set, and labeling the part of speech of each vocabulary; and correspondingly extracting vocabulary combinations of different parts of speech in each microblog text based on different regular expressions.

The regular expression is a logic formula for operating on character strings, namely, specific characters defined in advance and a combination of the specific characters are used for forming a 'regular character string', and the 'regular character string' is used for expressing a filtering logic for the character strings.

Given a regular expression and another string, we can achieve the following:

1. filtering logic (referred to as "matching") of whether a given string conforms to a regular expression;

2. the specific part that we want can be obtained from the character string by regular expression.

Table 1 shows a part-of-speech tagging table in the embodiment of the present invention, which includes code information, examples, and remark information of different parts-of-speech.

TABLE 1 part-of-speech tagging Table

S1.2, extracting nouns, verbs, quantifiers, numerators and time words from all the vocabulary combinations obtained in the step S1.2 to serve as the characteristic words.

S1.3, expressing the microblog texts as vectors by using a space vector model, wherein the feature words correspond to feature items in the vectors, and calculating the weight of each feature word in each microblog text based on a TF-PDF formula.

The text is represented as vectors by using a space vector model, so that the text can be processed by using a method of operation on the vector space model. When the text is converted into the vector, each word in the document corresponds to each feature item in the vector, the dimensionality corresponding to the words in all the documents forms the whole space, and the weight corresponds to the value of each dimension.

The parts of speech at least comprise nouns, numerals, quantifiers, position words, magnitude, time words, date words and verbs.

The invention considers that the main factors for determining the influence of the microblog comprise the following aspects:

awareness and influence of microbolors

Generally, the higher the popularity of a user is, the greater the influence is, the greater the possibility that the microblog released by the user is concerned is, the easier the message released by the user reaches the spreading range, and the greater the influence of the information on the topic is.

Quantitative measurement of daily average influence of microbolors can be achieved, and comprehensive evaluation can be achieved through indexes such as the access amount, the commented condition and the forwarding condition of microbolors, the number of active fans of the microbolors, the authentication degree and the like.

Number of vermicelli made from bean starch

The extent to which the content sent by the user is listened to for the first time is largely determined by the number of fans. The greater the listening range, the greater the possibility of being relayed, because the possibility of being relayed by the second or third listening band is correspondingly greater. The extent and influence of late-stage information dissemination is largely determined by the size of the length of the first chain in which information propagates rapidly.

Number of vermicelli concerned

The concern of fans is a negative factor of improving the influence of microblogs. If the attention content of the fan is more, the received interference information is more, the possibility of filtering the information is higher, and the probability of transferring the microblog is correspondingly reduced. If the number of times of being transferred is reduced, the influence of the corresponding information itself becomes weak.

Quality of vermicelli

The quality of the fans appears in the Xinlang microblog as a VIP real name authentication user starting with a V-shaped letter. The average fan number of the VIP real-name authenticated users is usually far greater than that of ordinary users, and the influence of celebrity effects on the information spreading process is very large. The higher the quality of the fans is, the more the users can hear, the wider the information transmission range which can be achieved by the potential second listening and the multiple listening, and the larger the influence of the information.

Liveness of microbolor and vermicelli

The liveness of the microbolor determines the quantity of sent information, the liveness is high, the more the sent information is, the lower the probability that a single piece of information is received, and the higher the liveness can be easily proved in common users without strong influence.

The vitality of the vermicelli is not high, and even if the microbolor has higher popularity and more vermicelli quantity, the influence of the information sent by the microbolor is relatively weakened.

Self influence of microblog platform

Influenced by the scale, the universality and the user activity of each microblog platform, the influence of the microblog platforms is different.

In summary, the invention provides an index of microblog comprehensive influence, which includes: liveness, transmission and coverage.

In one embodiment, the step S2 includes:

and the average microblog sending number and the average comment forwarding number per day of the bloggers based on the microblog texts in the time range to obtain the activity corresponding to the microblog texts.

For example, the blogger sends 5 microblogs on average each day, the number of microblogs forwarded and commented is 3, and the number of microblogs forwarded only is 2, because the forwarding only does not belong to the consideration range of liveness, so that the liveness is 5+ 3-8.

And obtaining the corresponding spreading force of the microblog text based on the sum of the forwarded comments of the microblog text and the number of the forwarded comments.

And obtaining the coverage corresponding to the microblog text based on the number of active fans of the blogger of the microblog text.

And respectively setting 3 influencing force parameters corresponding to the activity, the propagation force and the coverage degree based on the time of the earthquake.

In one embodiment, the coefficient can be adjusted according to actual needs, for example, if the coverage represents the potential of microblog propagation in the early stage of earthquake disaster, the relative importance is greater, and the influence parameter corresponding to the coverage can be increased; when earthquake disaster and rescue arrive at the tail, the importance of the propagation force is larger, and the influence force parameter corresponding to the propagation force is increased.

Obtaining the influence of each microblog text based on the liveness, the spreading power, the coverage and 3 influence parameters corresponding to the microblog text:

the influence p is a activity + b propagation + c coverage, wherein a, b and c are influence parameters respectively corresponding to the activity, the propagation and the coverage.

In one embodiment, the popularity in step S3 is calculated by the following formula:

wherein q (j, t) represents the popularity of the feature word j in the time range t, D represents the microblog text set in the time range t, p (D) is the influence of the microblog text D, and w_d,jAnd representing TF-PDF weight of the feature word j in the microblog text d.

In one embodiment, the calculation formula of the occurrence frequency of the feature word i in the microblog text d is as follows:

wherein n is_i,jRepresenting the number of occurrences of the feature word i in the microblog text d, sigma_kn_k,jAnd representing the total times of appearance of all the feature words in the microblog text d.

In an embodiment, the correspondingly extracting vocabulary combinations of different parts of speech of each microblog text based on different regular expressions includes:

extracting a combination of nouns, numbers or words in the microblog text based on the first regular expression:

“(？:\S*/n\s\S*/n\s|\S*/n\s)(？:\S*/m\s\S*/m\s|\S*/m\s)(？:\S*/n\s|\S*/qv\s|\S*/q\s|)(？:\S*/n\s|)|(？:\S*/mq\s|\S*/m\s\S*/m\s|\S*/m\s)(？:\S*/qt\s|\S*/qv\s|\S*/q\s|)(？:\S*/m\s|\S*/ns\s|)(？:\S*/n\s|)”

extracting a combination of position words, magnitude words or time words in the microblog text based on the second regular expression;

“(？:\S*/ns\s\S*/ns\s\S*/ns\s\S*/ns\s\S*/ns\s|\S*/ns\s\S*/ns\s\S*/ns\s\S*/ns\s|\S*/ns\s\S*/ns\s\S*/ns\s|\S*/ns\s\S*/ns\s|\S*/ns\s)(？:\S*/v\s|\S*/n\s|)(？:\S*/m\s|)(？:\S*/t\s\S*/t\s\S*/t\s|\S*/t\s\S*/t\s|\S*/t\s|\S*/qt\s|)(？:\S*/m\s|)(？:\S*/v\s|\S*/q\s|\S*/t\s|)(？:\S*/n\s|)(？:\S*/vi\s|)”

extracting a verb, a noun or a combination of the quantifier in the microblog text based on a third regular expression:

“(？:\S*/v\s|\S*/vi\s)(？:\S*/n\s|)(？:\S*/m\s\S*/m\s|\S*/m\s|)(？:\S*/qt\s|\S*/q\s|)(？:\S*/m\s|\S*/vn\s|)(？:\S*/n\s|)”

extracting a combination of date words or time words in the microblog text based on a fourth regular expression:

in one embodiment, the TF-PDF formula is:

w_d,i＝tf_i*exp(df_i/D)

wherein, w_d,iRepresenting the weight, tf, of the feature word i in the microblog text d_iRepresenting the frequency of occurrence, df, of the feature word i in the microblog text d_iThe number of the microblogs containing the feature word i in the microblog text set is represented, and D is the total number of the microblog texts in the microblog text set.

The invention also provides a device for mining the earthquake subject term from the microblog, which comprises the following steps:

the weight calculation unit is used for extracting a feature word from each microblog text in the microblog text set and calculating the weight of each feature word in each microblog text based on a TF-PDF formula;

In one embodiment, the weight calculation unit includes:

the vocabulary combination module is used for segmenting each microblog text in the microblog text set, marking the part of speech of each vocabulary, and correspondingly extracting vocabulary combinations of different parts of speech in each microblog text based on different regular expressions;

the characteristic word acquisition module extracts nouns, verbs, quantifiers, numerators and time words from all the vocabulary combinations obtained in the step S1.2 to serve as the characteristic words; and

the weight obtaining module is used for representing the microblog texts as vectors by using a space vector model, the feature words correspond to feature items in the vectors, and the weight of each feature word in each microblog text is calculated based on a TF-PDF formula;

wherein, the part of speech at least comprises nouns, number words, quantifier words, position words, magnitude, time words, date words and verbs.

In one embodiment, the influence calculation unit is specifically configured to:

based on the sum of the average microblog sending number and the average comment forwarding number of the bloggers of the microblog texts in the time range, obtaining the activity corresponding to the microblog texts;

obtaining the corresponding spreading force of the microblog text based on the sum of the forwarded comments of the microblog text and the number of the forwarded comments;

acquiring coverage corresponding to the microblog text based on the number of active fans of the bloggers of the microblog text;

respectively setting 3 influencing force parameters corresponding to the activeness, the propagation force and the coverage degree based on the time of the earthquake; and

and obtaining the influence of each microblog text based on the activity, the propagation force, the coverage and the 3 influence parameters corresponding to the microblog text.

Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for mining earthquake subject terms from microblogs is characterized by comprising the following steps:

s3, acquiring popularity of each feature word based on influence of each microblog text and weight of each feature word in the microblog text, performing descending order arrangement according to the popularity of each feature word, and taking the feature word with the top rank as an earthquake subject word;

the characteristic words comprise nouns, verbs, quantifiers, numerators and time words;

the step S1 is preceded by:

removing microblog texts containing specific words, specific topics or specific titles, microblog texts sent by bloggers with media authentication and microblog texts which are simply forwarded from the microblog text set;

wherein the specific vocabulary includes: one or more of a Chinese earthquake table net, a Chinese earthquake bureau, statistics and Xinhua society express news;

specific topics include: one or more of recent messages, live seismic activity, and recent seismic activity;

the specific title includes: one or more of seismic tachy-time and tachy-time.

2. The method of claim 1, wherein the step S1 includes:

s1.1, segmenting words of each microblog text in the microblog text set, labeling part of speech of each word, and correspondingly extracting word combinations of different parts of speech in each microblog text based on different regular expressions;

s1.2, extracting nouns, verbs, quantifiers, numerators and time words from all the vocabulary combinations obtained in the step S1.2 to serve as the characteristic words; and

s1.3, expressing the microblog texts as vectors by using a space vector model, wherein the feature words correspond to feature items in the vectors, and calculating the weight of each feature word in each microblog text based on a TF-PDF formula;

3. The method of claim 1, wherein the step S2 includes:

4. The method of claim 1, wherein the popularity in the step S3 is calculated by the formula:

5. The method of claim 2, wherein correspondingly extracting vocabulary combinations of different parts of speech in each microblog text based on different regular expressions comprises:

extracting a combination of nouns, numbers or words in the microblog text based on the first regular expression;

extracting a verb, a noun or a combination of the quantifier in the microblog text based on the third regular expression; and

and extracting a combination of date words or time words in the microblog text based on the fourth regular expression.

6. The method of claim 2, wherein the TF-PDF formula is:

w_d,i＝tf_i*exp(df_i/D)

7. The method of claim 2, wherein step S1.1 is preceded by: and converting the microblog text in the traditional Chinese format into the simplified Chinese format.

8. The method of claim 6, wherein the frequency of occurrence of the feature word i in the microblog text d is calculated by the formula:

wherein n is_i,jRepresents the times, sigma of the appearance of the feature word i in the microblog text d_kn_k,jAnd representing the total times of appearance of all the feature words in the microblog text d.

9. An apparatus for mining seismic subject terms from microblogs, comprising:

the subject term obtaining unit is used for obtaining the popularity of each feature term based on the influence of each microblog text and the weight of each feature term in the microblog text, performing descending order arrangement according to the popularity of each feature term, and taking the feature term with the top rank as an earthquake subject term;

the apparatus is further configured to:

the specific title includes: one or more of seismic tachy-time and tachy-time.