CN106874448B - Method and device for mining earthquake subject term from microblog - Google Patents

Method and device for mining earthquake subject term from microblog Download PDF

Info

Publication number
CN106874448B
CN106874448B CN201710074352.5A CN201710074352A CN106874448B CN 106874448 B CN106874448 B CN 106874448B CN 201710074352 A CN201710074352 A CN 201710074352A CN 106874448 B CN106874448 B CN 106874448B
Authority
CN
China
Prior art keywords
microblog
microblog text
words
text
earthquake
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710074352.5A
Other languages
Chinese (zh)
Other versions
CN106874448A (en
Inventor
张晓东
陈欣意
邹再超
李林
苏伟
刘峻明
朱德海
孙瑞志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Agricultural University
Original Assignee
China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Agricultural University filed Critical China Agricultural University
Priority to CN201710074352.5A priority Critical patent/CN106874448B/en
Publication of CN106874448A publication Critical patent/CN106874448A/en
Application granted granted Critical
Publication of CN106874448B publication Critical patent/CN106874448B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention provides a method and a device for mining earthquake subject terms from microblogs, wherein the method comprises the following steps: extracting a characteristic word from each microblog text in a microblog text set containing earthquake vocabularies, and calculating the weight of each characteristic word in the microblog text where the characteristic word is located based on a TF-PDF formula; obtaining the influence of each microblog text based on the corresponding liveness, spreading power and coverage of the microblog text; and acquiring the popularity of each feature word based on the influence of each microblog text and the weight of each feature word in the microblog text, performing descending order arrangement according to the popularity of each feature word, and taking the feature word with the top rank as the earthquake subject word. The method improves the calculation accuracy of the popularity of the feature words and increases the accuracy of extracting the hot subject words from the seismic microblog text data.

Description

Method and device for mining earthquake subject term from microblog
Technical Field
The invention relates to the technical field of seismic information acquisition, in particular to a method and a device for mining seismic subject terms from microblogs.
Background
When an earthquake occurs, a large amount of relevant data about the earthquake disaster is spread through the microblog. How to mine and extract hot earthquake subject information from a large number of microblog texts is a problem to be solved. The acquired earthquake microblog information is Chinese short text information and is mixed with a large amount of information with low public information content such as news facts, repeated forwarding and the like, so that the earthquake microblog information needs to be screened and sorted, and subject terms which can represent the public most are extracted according to evaluation of microblog information spreading influence.
Term Frequency-Proportional Document Frequency (TF-PDF), is a commonly used weighting technique for intelligence retrieval and text mining. For the extraction of the hot topic, the feature vocabulary representing the hot topic should appear frequently in a large number of documents. TF PDF assigns larger weight to words frequently appearing in a plurality of documents in a plurality of channels, and vice versa, so that the method is a feature word weight calculation method more suitable for extracting hot words.
The existing microblog subject term extraction method still has some problems aiming at the mining of the earthquake microblog information hot subject terms:
1. the noise influence of pure forwarding of the earthquake news micro-blog and the content thereof is ignored;
2. the mining and extraction of number words, date and time expression phrases and geographic position expression phrases which can accurately express seismic information are lacked;
3. the consideration of the influence of the comprehensive influence of the activity, the spreading power and the coverage of each microblog text on the contained feature words is lacked;
the problems all affect the calculation of popularity and popularity of the subject term, so that the result of mining the earthquake subject term lacks accuracy and efficiency.
Disclosure of Invention
The present invention provides a method and apparatus for mining seismic subject terms from microblogs that overcomes or at least partially solves the above-mentioned problems.
According to one aspect of the invention, a method for mining earthquake subject terms from microblogs is provided, and the method comprises the following steps:
s1, extracting a feature word from each microblog text in the microblog text set containing the earthquake vocabulary, and calculating the weight of each feature word in the microblog text where the feature word is located based on a TF-PDF formula;
s2, obtaining the influence of each microblog text based on the activity, the propagation force and the coverage corresponding to the microblog text; and
s3, based on the influence of each microblog text and the weight of each feature word in the microblog text, obtaining the popularity of each feature word, performing descending order arrangement according to the popularity of each feature word, and taking the feature word with the top rank as an earthquake subject word.
According to another aspect of the invention, an apparatus for mining seismic subject terms from microblogs is provided, which includes:
the weight calculation unit is used for extracting a characteristic word from each microblog text in the microblog text set containing the earthquake vocabulary, and calculating the weight of each characteristic word in the microblog text where the characteristic word is located based on a TF-PDF formula;
the influence calculation unit is used for obtaining the influence of each microblog text based on the activity, the propagation and the coverage corresponding to the microblog text; and
and the subject term obtaining unit is used for obtaining the popularity of each feature term based on the influence of each microblog text and the weight of each feature term in the microblog text, performing descending order arrangement according to the popularity of each feature term, and taking the feature term with the top rank as the earthquake subject term.
The method and the device for mining the earthquake subject term of the microblog text have the advantages that the earthquake subject term is obtained by taking a microblog as a data source from a public perspective and comprehensively considering the weight of the feature term obtained based on a TF-PDF formula and the influence of the microblog text, the popularity calculation accuracy of the feature term is improved, the accuracy of extracting the hot subject term from the earthquake microblog text data is improved, the supplement of an important data source and an analysis method is provided for information analysis of researches such as earthquake information propagation, earthquake disaster prevention and reduction and the like, and the practical significance is very strong.
Drawings
FIG. 1 is a flow chart of mining seismic subject terms from microblogs according to an embodiment of the invention;
FIG. 2 is a screenshot of a microblog with a title in the prior art;
fig. 3 is a screenshot of a microblog containing a topic in the prior art.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Fig. 1 shows a flowchart of mining a seismic subject term from a microblog according to an embodiment of the present invention, and as can be seen from fig. 1, the method includes:
s1, extracting a feature word from each microblog text in the microblog text set containing the earthquake vocabulary, and calculating the weight of each feature word in the microblog text where the feature word is located based on a TF-PDF formula;
s2, obtaining the influence of each microblog text based on the activity, the propagation force and the coverage corresponding to the microblog text; and
s3, based on the influence of each microblog text and the weight of each feature word in the microblog text, obtaining the popularity of each feature word, performing descending order arrangement according to the popularity of each feature word, and taking the feature word with the top rank as an earthquake subject word.
The method and the device for mining the earthquake subject term of the microblog text have the advantages that the earthquake subject term is obtained by taking a microblog as a data source from a public perspective and comprehensively considering the weight of the feature term obtained based on a TF-PDF formula and the influence of the microblog text, the popularity calculation accuracy of the feature term is improved, the accuracy of extracting the hot subject term from the earthquake microblog text data is improved, the supplement of an important data source and an analysis method is provided for information analysis of researches such as earthquake information propagation, earthquake disaster prevention and reduction and the like, and the practical significance is very strong.
In a microblog text obtained by searching by taking 'earthquake' as a keyword, the microblog content is mostly factual description and does not contain feelings of people, microblog data are generated by a seismic platform network center, media or individuals for forwarding and the like, the information has small analysis significance on the earthquake microblog, but the number and the forwarding amount of the information are large. According to the method, the news microblogs have negative influence on the analysis of the information of the earthquake microblogs, so that the earthquake news microblogs and the forwarding thereof are identified, and are cleaned and filtered from the data set after identification and extraction.
In one embodiment, the step S1 is preceded by:
collecting microblog texts containing earthquake vocabularies within a certain time range to form a microblog text set; and
and removing the microblog texts containing specific words, specific topics or specific titles, the microblog texts sent by the bloggers with media authentication and the microblog texts which are simply forwarded from the microblog text set.
In the release rules of micro-blogs, the convention colloquially states:
the titles are shown in the symbols, fig. 2 shows a schematic diagram of a microblog containing the titles, which is randomly captured from the microblog by the inventor, and as is apparent from fig. 2, the title "earthquake news," of the microblog is contained in the symbols.
The topic is displayed in two # symbols, fig. 3 shows a microblog diagram containing the topic captured randomly from a microblog by the inventor, and as can be seen clearly in fig. 3, the topic "earthquake news in rush" of the microblog is contained between two # symbols.
By searching whether the [ sign ] or the # # sign is contained in the microblog, the topic or the title of the microblog can be obtained.
In one embodiment, the microblogs to be rejected include the following categories:
(1) microblogs containing 'earthquake news in rush', 'news in rush', and the like;
(2) micro-blogs containing a # latest message #, "# earthquake news #," # earthquake live broadcast #, "# earthquake latest dynamic #, and the like;
(3) microblogs containing vocabularies such as 'Chinese earthquake table net', 'Chinese earthquake bureau', 'Xinhua society express news', 'statistics', and the like;
(4) and simply forwarding the microblogs of the three types of microblogs.
The reason why the following microblogs are deleted is that the method provided by the invention is more used for mining and extracting earthquake hotwords expressed by the public from the microblog information, and the news information and a large amount of pure forwarding thereof can influence the expression of the information of the public.
If the microblog text contains traditional characters, the word segmentation analysis of the microblog is difficult, and the word segmentation result is wrong, so that in one embodiment, the traditional character conversion operation to the simplified character conversion operation is required before the microblog text is segmented.
In one embodiment, the Chinese characters from traditional to simplified are converted using the Chinese converter.dll in the Visual Studio International Pack class library of the NET framework.
In one embodiment, the step S1 includes:
s1.1, segmenting each microblog text in the microblog text set, and labeling the part of speech of each vocabulary; and correspondingly extracting vocabulary combinations of different parts of speech in each microblog text based on different regular expressions.
The regular expression is a logic formula for operating on character strings, namely, specific characters defined in advance and a combination of the specific characters are used for forming a 'regular character string', and the 'regular character string' is used for expressing a filtering logic for the character strings.
Given a regular expression and another string, we can achieve the following:
1. filtering logic (referred to as "matching") of whether a given string conforms to a regular expression;
2. the specific part that we want can be obtained from the character string by regular expression.
Table 1 shows a part-of-speech tagging table in the embodiment of the present invention, which includes code information, examples, and remark information of different parts-of-speech.
Figure BDA0001223797390000061
Figure BDA0001223797390000071
TABLE 1 part-of-speech tagging Table
S1.2, extracting nouns, verbs, quantifiers, numerators and time words from all the vocabulary combinations obtained in the step S1.2 to serve as the characteristic words.
S1.3, expressing the microblog texts as vectors by using a space vector model, wherein the feature words correspond to feature items in the vectors, and calculating the weight of each feature word in each microblog text based on a TF-PDF formula.
The text is represented as vectors by using a space vector model, so that the text can be processed by using a method of operation on the vector space model. When the text is converted into the vector, each word in the document corresponds to each feature item in the vector, the dimensionality corresponding to the words in all the documents forms the whole space, and the weight corresponds to the value of each dimension.
The parts of speech at least comprise nouns, numerals, quantifiers, position words, magnitude, time words, date words and verbs.
The invention considers that the main factors for determining the influence of the microblog comprise the following aspects:
awareness and influence of microbolors
Generally, the higher the popularity of a user is, the greater the influence is, the greater the possibility that the microblog released by the user is concerned is, the easier the message released by the user reaches the spreading range, and the greater the influence of the information on the topic is.
Quantitative measurement of daily average influence of microbolors can be achieved, and comprehensive evaluation can be achieved through indexes such as the access amount, the commented condition and the forwarding condition of microbolors, the number of active fans of the microbolors, the authentication degree and the like.
Number of vermicelli made from bean starch
The extent to which the content sent by the user is listened to for the first time is largely determined by the number of fans. The greater the listening range, the greater the possibility of being relayed, because the possibility of being relayed by the second or third listening band is correspondingly greater. The extent and influence of late-stage information dissemination is largely determined by the size of the length of the first chain in which information propagates rapidly.
Number of vermicelli concerned
The concern of fans is a negative factor of improving the influence of microblogs. If the attention content of the fan is more, the received interference information is more, the possibility of filtering the information is higher, and the probability of transferring the microblog is correspondingly reduced. If the number of times of being transferred is reduced, the influence of the corresponding information itself becomes weak.
Quality of vermicelli
The quality of the fans appears in the Xinlang microblog as a VIP real name authentication user starting with a V-shaped letter. The average fan number of the VIP real-name authenticated users is usually far greater than that of ordinary users, and the influence of celebrity effects on the information spreading process is very large. The higher the quality of the fans is, the more the users can hear, the wider the information transmission range which can be achieved by the potential second listening and the multiple listening, and the larger the influence of the information.
Liveness of microbolor and vermicelli
The liveness of the microbolor determines the quantity of sent information, the liveness is high, the more the sent information is, the lower the probability that a single piece of information is received, and the higher the liveness can be easily proved in common users without strong influence.
The vitality of the vermicelli is not high, and even if the microbolor has higher popularity and more vermicelli quantity, the influence of the information sent by the microbolor is relatively weakened.
Self influence of microblog platform
Influenced by the scale, the universality and the user activity of each microblog platform, the influence of the microblog platforms is different.
In summary, the invention provides an index of microblog comprehensive influence, which includes: liveness, transmission and coverage.
In one embodiment, the step S2 includes:
and the average microblog sending number and the average comment forwarding number per day of the bloggers based on the microblog texts in the time range to obtain the activity corresponding to the microblog texts.
For example, the blogger sends 5 microblogs on average each day, the number of microblogs forwarded and commented is 3, and the number of microblogs forwarded only is 2, because the forwarding only does not belong to the consideration range of liveness, so that the liveness is 5+ 3-8.
And obtaining the corresponding spreading force of the microblog text based on the sum of the forwarded comments of the microblog text and the number of the forwarded comments.
And obtaining the coverage corresponding to the microblog text based on the number of active fans of the blogger of the microblog text.
And respectively setting 3 influencing force parameters corresponding to the activity, the propagation force and the coverage degree based on the time of the earthquake.
In one embodiment, the coefficient can be adjusted according to actual needs, for example, if the coverage represents the potential of microblog propagation in the early stage of earthquake disaster, the relative importance is greater, and the influence parameter corresponding to the coverage can be increased; when earthquake disaster and rescue arrive at the tail, the importance of the propagation force is larger, and the influence force parameter corresponding to the propagation force is increased.
Obtaining the influence of each microblog text based on the liveness, the spreading power, the coverage and 3 influence parameters corresponding to the microblog text:
the influence p is a activity + b propagation + c coverage, wherein a, b and c are influence parameters respectively corresponding to the activity, the propagation and the coverage.
In one embodiment, the popularity in step S3 is calculated by the following formula:
Figure BDA0001223797390000101
wherein q (j, t) represents the popularity of the feature word j in the time range t, D represents the microblog text set in the time range t, p (D) is the influence of the microblog text D, and wd,jAnd representing TF-PDF weight of the feature word j in the microblog text d.
In one embodiment, the calculation formula of the occurrence frequency of the feature word i in the microblog text d is as follows:
Figure BDA0001223797390000102
wherein n isi,jRepresenting the number of occurrences of the feature word i in the microblog text d, sigmaknk,jAnd representing the total times of appearance of all the feature words in the microblog text d.
In an embodiment, the correspondingly extracting vocabulary combinations of different parts of speech of each microblog text based on different regular expressions includes:
extracting a combination of nouns, numbers or words in the microblog text based on the first regular expression:
“(?:\S*/n\s\S*/n\s|\S*/n\s)(?:\S*/m\s\S*/m\s|\S*/m\s)(?:\S*/n\s|\S*/qv\s|\S*/q\s|)(?:\S*/n\s|)|(?:\S*/mq\s|\S*/m\s\S*/m\s|\S*/m\s)(?:\S*/qt\s|\S*/qv\s|\S*/q\s|)(?:\S*/m\s|\S*/ns\s|)(?:\S*/n\s|)”
extracting a combination of position words, magnitude words or time words in the microblog text based on the second regular expression;
“(?:\S*/ns\s\S*/ns\s\S*/ns\s\S*/ns\s\S*/ns\s|\S*/ns\s\S*/ns\s\S*/ns\s\S*/ns\s|\S*/ns\s\S*/ns\s\S*/ns\s|\S*/ns\s\S*/ns\s|\S*/ns\s)(?:\S*/v\s|\S*/n\s|)(?:\S*/m\s|)(?:\S*/t\s\S*/t\s\S*/t\s|\S*/t\s\S*/t\s|\S*/t\s|\S*/qt\s|)(?:\S*/m\s|)(?:\S*/v\s|\S*/q\s|\S*/t\s|)(?:\S*/n\s|)(?:\S*/vi\s|)”
extracting a verb, a noun or a combination of the quantifier in the microblog text based on a third regular expression:
“(?:\S*/v\s|\S*/vi\s)(?:\S*/n\s|)(?:\S*/m\s\S*/m\s|\S*/m\s|)(?:\S*/qt\s|\S*/q\s|)(?:\S*/m\s|\S*/vn\s|)(?:\S*/n\s|)”
extracting a combination of date words or time words in the microblog text based on a fourth regular expression:
“(?:\S*/t\s\S*/t\s\S*/t\s\S*/t\s\S*/t\s|\S*/t\s\S*/t\s\S*/t\s\S*/t\s|\S*/t\s\S*/t\s\S*/t\s|\S*/t\s\S*/t\s|\S*/t\s)(?:\S*/m\s|\S*/ns\s|)(?:\S*/t\s|\S*/qt\s|\S*/q\s|)(?:\S*/n\s|)|\b\S*/m\s\S*/qt\s\S*/t\s\S*/m\s\S*/q\s\S*/t\s\S*/t\s\S*/m\b”
in one embodiment, the TF-PDF formula is:
wd,i=tfi*exp(dfi/D)
wherein, wd,iRepresenting the weight, tf, of the feature word i in the microblog text diRepresenting the frequency of occurrence, df, of the feature word i in the microblog text diThe number of the microblogs containing the feature word i in the microblog text set is represented, and D is the total number of the microblog texts in the microblog text set.
The invention also provides a device for mining the earthquake subject term from the microblog, which comprises the following steps:
the weight calculation unit is used for extracting a feature word from each microblog text in the microblog text set and calculating the weight of each feature word in each microblog text based on a TF-PDF formula;
the influence calculation unit is used for obtaining the influence of each microblog text based on the activity, the propagation and the coverage corresponding to the microblog text; and
and the subject term obtaining unit is used for obtaining the popularity of each feature term based on the influence of each microblog text and the weight of each feature term in the microblog text, performing descending order arrangement according to the popularity of each feature term, and taking the feature term with the top rank as the earthquake subject term.
In one embodiment, the weight calculation unit includes:
the vocabulary combination module is used for segmenting each microblog text in the microblog text set, marking the part of speech of each vocabulary, and correspondingly extracting vocabulary combinations of different parts of speech in each microblog text based on different regular expressions;
the characteristic word acquisition module extracts nouns, verbs, quantifiers, numerators and time words from all the vocabulary combinations obtained in the step S1.2 to serve as the characteristic words; and
the weight obtaining module is used for representing the microblog texts as vectors by using a space vector model, the feature words correspond to feature items in the vectors, and the weight of each feature word in each microblog text is calculated based on a TF-PDF formula;
wherein, the part of speech at least comprises nouns, number words, quantifier words, position words, magnitude, time words, date words and verbs.
In one embodiment, the influence calculation unit is specifically configured to:
based on the sum of the average microblog sending number and the average comment forwarding number of the bloggers of the microblog texts in the time range, obtaining the activity corresponding to the microblog texts;
obtaining the corresponding spreading force of the microblog text based on the sum of the forwarded comments of the microblog text and the number of the forwarded comments;
acquiring coverage corresponding to the microblog text based on the number of active fans of the bloggers of the microblog text;
respectively setting 3 influencing force parameters corresponding to the activeness, the propagation force and the coverage degree based on the time of the earthquake; and
and obtaining the influence of each microblog text based on the activity, the propagation force, the coverage and the 3 influence parameters corresponding to the microblog text.
Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A method for mining earthquake subject terms from microblogs is characterized by comprising the following steps:
s1, extracting a feature word from each microblog text in the microblog text set containing the earthquake vocabulary, and calculating the weight of each feature word in the microblog text where the feature word is located based on a TF-PDF formula;
s2, obtaining the influence of each microblog text based on the activity, the propagation force and the coverage corresponding to the microblog text; and
s3, acquiring popularity of each feature word based on influence of each microblog text and weight of each feature word in the microblog text, performing descending order arrangement according to the popularity of each feature word, and taking the feature word with the top rank as an earthquake subject word;
the characteristic words comprise nouns, verbs, quantifiers, numerators and time words;
the step S1 is preceded by:
collecting microblog texts containing earthquake vocabularies within a certain time range to form a microblog text set; and
removing microblog texts containing specific words, specific topics or specific titles, microblog texts sent by bloggers with media authentication and microblog texts which are simply forwarded from the microblog text set;
wherein the specific vocabulary includes: one or more of a Chinese earthquake table net, a Chinese earthquake bureau, statistics and Xinhua society express news;
specific topics include: one or more of recent messages, live seismic activity, and recent seismic activity;
the specific title includes: one or more of seismic tachy-time and tachy-time.
2. The method of claim 1, wherein the step S1 includes:
s1.1, segmenting words of each microblog text in the microblog text set, labeling part of speech of each word, and correspondingly extracting word combinations of different parts of speech in each microblog text based on different regular expressions;
s1.2, extracting nouns, verbs, quantifiers, numerators and time words from all the vocabulary combinations obtained in the step S1.2 to serve as the characteristic words; and
s1.3, expressing the microblog texts as vectors by using a space vector model, wherein the feature words correspond to feature items in the vectors, and calculating the weight of each feature word in each microblog text based on a TF-PDF formula;
wherein, the part of speech at least comprises nouns, number words, quantifier words, position words, magnitude, time words, date words and verbs.
3. The method of claim 1, wherein the step S2 includes:
based on the sum of the average microblog sending number and the average comment forwarding number of the bloggers of the microblog texts in the time range, obtaining the activity corresponding to the microblog texts;
obtaining the corresponding spreading force of the microblog text based on the sum of the forwarded comments of the microblog text and the number of the forwarded comments;
acquiring coverage corresponding to the microblog text based on the number of active fans of the bloggers of the microblog text;
respectively setting 3 influencing force parameters corresponding to the activeness, the propagation force and the coverage degree based on the time of the earthquake; and
and obtaining the influence of each microblog text based on the activity, the propagation force, the coverage and the 3 influence parameters corresponding to the microblog text.
4. The method of claim 1, wherein the popularity in the step S3 is calculated by the formula:
Figure FDA0002265700020000021
wherein q (j, t) represents the popularity of the feature word j in the time range t, D represents the microblog text set in the time range t, p (D) is the influence of the microblog text D, and wd,jAnd representing TF-PDF weight of the feature word j in the microblog text d.
5. The method of claim 2, wherein correspondingly extracting vocabulary combinations of different parts of speech in each microblog text based on different regular expressions comprises:
extracting a combination of nouns, numbers or words in the microblog text based on the first regular expression;
extracting a combination of position words, magnitude words or time words in the microblog text based on the second regular expression;
extracting a verb, a noun or a combination of the quantifier in the microblog text based on the third regular expression; and
and extracting a combination of date words or time words in the microblog text based on the fourth regular expression.
6. The method of claim 2, wherein the TF-PDF formula is:
wd,i=tfi*exp(dfi/D)
wherein, wd,iRepresenting the weight, tf, of the feature word i in the microblog text diRepresenting the frequency of occurrence, df, of the feature word i in the microblog text diThe number of the microblogs containing the feature word i in the microblog text set is represented, and D is the total number of the microblog texts in the microblog text set.
7. The method of claim 2, wherein step S1.1 is preceded by: and converting the microblog text in the traditional Chinese format into the simplified Chinese format.
8. The method of claim 6, wherein the frequency of occurrence of the feature word i in the microblog text d is calculated by the formula:
Figure FDA0002265700020000031
wherein n isi,jRepresents the times, sigma of the appearance of the feature word i in the microblog text dknk,jAnd representing the total times of appearance of all the feature words in the microblog text d.
9. An apparatus for mining seismic subject terms from microblogs, comprising:
the weight calculation unit is used for extracting a characteristic word from each microblog text in the microblog text set containing the earthquake vocabulary, and calculating the weight of each characteristic word in the microblog text where the characteristic word is located based on a TF-PDF formula;
the influence calculation unit is used for obtaining the influence of each microblog text based on the activity, the propagation and the coverage corresponding to the microblog text; and
the subject term obtaining unit is used for obtaining the popularity of each feature term based on the influence of each microblog text and the weight of each feature term in the microblog text, performing descending order arrangement according to the popularity of each feature term, and taking the feature term with the top rank as an earthquake subject term;
the characteristic words comprise nouns, verbs, quantifiers, numerators and time words;
the apparatus is further configured to:
collecting microblog texts containing earthquake vocabularies within a certain time range to form a microblog text set; and
removing microblog texts containing specific words, specific topics or specific titles, microblog texts sent by bloggers with media authentication and microblog texts which are simply forwarded from the microblog text set;
wherein the specific vocabulary includes: one or more of a Chinese earthquake table net, a Chinese earthquake bureau, statistics and Xinhua society express news;
specific topics include: one or more of recent messages, live seismic activity, and recent seismic activity;
the specific title includes: one or more of seismic tachy-time and tachy-time.
CN201710074352.5A 2017-02-10 2017-02-10 Method and device for mining earthquake subject term from microblog Active CN106874448B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710074352.5A CN106874448B (en) 2017-02-10 2017-02-10 Method and device for mining earthquake subject term from microblog

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710074352.5A CN106874448B (en) 2017-02-10 2017-02-10 Method and device for mining earthquake subject term from microblog

Publications (2)

Publication Number Publication Date
CN106874448A CN106874448A (en) 2017-06-20
CN106874448B true CN106874448B (en) 2020-03-06

Family

ID=59166897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710074352.5A Active CN106874448B (en) 2017-02-10 2017-02-10 Method and device for mining earthquake subject term from microblog

Country Status (1)

Country Link
CN (1) CN106874448B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629005B (en) * 2018-05-04 2021-10-22 北京林业大学 Method and device for detecting earthquake emergency subject term
CN108694247B (en) * 2018-05-08 2020-11-20 北京师范大学 Typhoon disaster analysis method based on microblog topic popularity
CN109271509B (en) * 2018-08-23 2021-05-28 武汉斗鱼网络科技有限公司 Live broadcast room topic generation method and device, computer equipment and storage medium
CN110830306B (en) * 2019-11-20 2022-03-29 北京百分点科技集团股份有限公司 Method, device, storage medium and electronic equipment for determining influence of network user
CN112948587A (en) * 2021-03-30 2021-06-11 杭州叙简科技股份有限公司 Microblog public opinion analysis method and device based on earthquake industry and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982157A (en) * 2012-12-03 2013-03-20 北京奇虎科技有限公司 Device and method used for mining microblog hot topics
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN104504024A (en) * 2014-12-11 2015-04-08 中国科学院计算技术研究所 Method and system for mining keywords based on microblog content

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982157A (en) * 2012-12-03 2013-03-20 北京奇虎科技有限公司 Device and method used for mining microblog hot topics
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN104504024A (en) * 2014-12-11 2015-04-08 中国科学院计算技术研究所 Method and system for mining keywords based on microblog content

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"《华尔街日报》中文网新浪微博传播策略研究";唐晶晶;《传播与版权》;20160715;第103-105页 *
"PWSWE:个人微博主题词提取算法的研究";高永兵等;《计算机应用与软件》;20150715;第32卷(第7期);第2.1节,第2.2节第4部分,第2.4节第1)-2)部分 *

Also Published As

Publication number Publication date
CN106874448A (en) 2017-06-20

Similar Documents

Publication Publication Date Title
CN106874448B (en) Method and device for mining earthquake subject term from microblog
CN109446404B (en) Method and device for analyzing emotion polarity of network public sentiment
CN103020140B (en) A kind of method and apparatus Internet user being commented on to content automatic fitration
CN104516947B (en) A kind of Chinese microblog emotional analysis method for merging dominant and recessive character
CN108874937B (en) Emotion classification method based on part of speech combination and feature selection
CN107239481B (en) Knowledge base construction method for multi-source network encyclopedia
CN101887414B (en) Server for automatically scoring opinion conveyed by text message containing pictorial-symbols
CN111078943B (en) Video text abstract generation method and device
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN102158428B (en) Rapid and high-accuracy junk mail filtering method
CN110852095B (en) Statement hot spot extraction method and system
CN107562728A (en) Social media short text filter method based on structure and text message
CN111460162B (en) Text classification method and device, terminal equipment and computer readable storage medium
CN109299277A (en) The analysis of public opinion method, server and computer readable storage medium
CN109783623A (en) The data analysing method of user and customer service dialogue under a kind of real scene
CN108563667A (en) Hot issue acquisition system based on new word identification and its method
CN113032557A (en) Microblog hot topic discovery method based on frequent word set and BERT semantics
CN106569989A (en) De-weighting method and apparatus for short text
Nezhad et al. Sarcasm detection in Persian
KR101541170B1 (en) Apparatus and method for summarizing text
KR20020084302A (en) Apparatus of extract and transmission of image using the character message, its method
Goh Using named entity recognition for automatic indexing
JP2008269072A (en) Dictionary preparation system and dictionary preparation method
WO2017094202A1 (en) Document structure analysis device which applies image processing
TWI534640B (en) Chinese network information monitoring and analysis system and its method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant