CN109063983B - Natural disaster damage real-time evaluation method based on social media data - Google Patents

Natural disaster damage real-time evaluation method based on social media data Download PDF

Info

Publication number
CN109063983B
CN109063983B CN201810787884.8A CN201810787884A CN109063983B CN 109063983 B CN109063983 B CN 109063983B CN 201810787884 A CN201810787884 A CN 201810787884A CN 109063983 B CN109063983 B CN 109063983B
Authority
CN
China
Prior art keywords
disaster
data
word
social media
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810787884.8A
Other languages
Chinese (zh)
Other versions
CN109063983A (en
Inventor
赵吉昌
宁云州
盛浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201810787884.8A priority Critical patent/CN109063983B/en
Publication of CN109063983A publication Critical patent/CN109063983A/en
Application granted granted Critical
Publication of CN109063983B publication Critical patent/CN109063983B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations

Abstract

A natural disaster damage real-time evaluation method based on social media data specifically comprises the following steps: step 1, acquiring social media text data by taking natural disasters as keywords; step 2, extracting text data and processing the data to finish word segmentation and emotion calibration of the text data; step 3, establishing a natural disaster related word bank and an anti-word bank; step 4, cleaning the text data based on the geographic information of the word stock related to the natural disasters, the word stock of the anti-word stock and the social media data; and 5, setting an analysis period, acquiring historical disaster data of the natural disasters in the analysis period, establishing a multiple linear regression model, and performing iterative computation until the significance level of the model coefficient is smaller than a set threshold value.

Description

Natural disaster damage real-time evaluation method based on social media data
Technical Field
The invention relates to an evaluation method, in particular to a natural disaster damage real-time evaluation method based on social media data.
Background
The natural disaster loss assessment is very important for disaster prevention and treatment and rescue and relief work, is the basis for distribution of manpower, financial resources and material resources for relief work, is the basis for fund subsidy and insurance compensation, and plays a decisive role in determining the direction, quantity and engineering scale of the investment of disaster prevention and reduction funds. Among a plurality of indexes for evaluating natural disaster loss, direct economic loss is the most important basis for disaster relief work. Around the goal of evaluating the direct economic loss of natural disasters, most of evaluation methods in the prior art are actually-measured evaluation performed after the disasters occur, and although the evaluation methods are relatively accurate, a large amount of time, manpower and material resources are consumed, and the problems of time lag, repeated calculation, exaggeration of disasters and the like exist.
With the social media increasingly integrated into the daily life of each person, once a natural disaster occurs, a large number of net citizens can spontaneously publish various information related to the disaster on the social media and express their own emotions. This process can be understood as a social perception that all netizens are "sensors" that are able to perceive surrounding events and disseminate relevant information through social media. The invention provides a natural disaster loss real-time evaluation method based on social media data, which is based on the background, can make up the defects of the traditional evaluation method, and provides early decision support for work such as rescue and relief work.
Disclosure of Invention
A natural disaster damage real-time evaluation method based on social media data specifically comprises the following steps: step 1, acquiring social media text data by taking natural disasters as keywords; step 2, extracting text data and processing the data to finish word segmentation and emotion calibration of the text data; step 3, establishing a natural disaster related word bank and an anti-word bank; step 4, cleaning the text data based on the geographic information of the word banks related to the natural disasters, the word banks of the anti-word banks and the social media data; and 5, setting an analysis period, acquiring historical disaster data of the natural disasters in the analysis period, establishing a multiple linear regression model, and performing iterative computation until the significance level of the model coefficient is smaller than a set threshold value.
Compared with the prior art, the invention has the beneficial effects that:
the model is data-driven and is based on large-scale social media data. Under the background that social media are gradually integrated into daily life of people, the data collection of the model is convenient, flexible and timely, and then the data source and high feasibility of real-time evaluation can be guaranteed.
The loss real-time estimation model established by the invention is verifiable. The estimation model is evaluated by using historical data, and the model is found to have good statistical significance and fitting effect, and the effectiveness of the model is high.
The loss estimation method established by the invention is easy to calculate and is rapid and real-time. The method is different from other common natural disaster damage assessment methods, the number of social media data texts in each provincial and administrative district classified according to emotion in the analysis period is used as an independent variable to establish a model, the algorithm thought is novel, real-time assessment of natural disaster damage can be achieved due to the rapid and real-time characteristics of the social media data, the damage assessment speed is greatly increased, real-time natural disaster damage assessment becomes possible, and early decision support is provided for work such as rescue and disaster relief.
The invention establishes a disaster related word bank and a disaster anti-word bank. Not only can data cleaning be carried out based on the two word banks and a good cleaning effect is obtained, but also the two word banks can be used for further analysis and understanding of natural disasters.
The invention uses the high-frequency co-occurrence method and the supplement of a third-party word bank in the process of establishing the word bank. The method and the device have the advantages that the related word bank width of the natural disasters is guaranteed, meanwhile, the cost of manual labeling or adding is saved, and the speed of establishing the word bank is greatly increased.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a table of details of a multiple linear regression model prior to iteration in accordance with an embodiment of the present invention;
FIG. 3 is a table of details of a post-iteration multiple linear regression model in accordance with an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a natural disaster damage real-time evaluation method based on social media data, which is characterized by data driving, real-time evaluation, abundant space-time dimensions, fast data cleaning based on a word stock, good statistical significance and the like. Fig. 1 shows a flow chart of the present invention, which specifically includes the following steps:
data in social media such as Xinlang microblogs and the like are obtained by taking natural disasters as keywords (such as typhoons, rainstorms and flood disasters), and text data containing the natural disaster keywords are collected so as to provide a social media data basis.
And extracting attributes such as the publishing time, the publishing place and the publishing text in the disaster-related social media data.
And realizing word segmentation processing of text data related to natural disasters by a word segmentation tool. And performing word segmentation processing on the text by using a Chinese word segmentation tool such as jieba and the like, classifying the emotion of the text by using an emotion recognition tool such as MoodLens and the like, and writing the result back to the original data.
Emotion recognition of natural disaster related text is achieved by an emotion classification tool, where emotions are classified into 6 categories, namely anger, disgust, happiness, sadness, fear, and neutrality.
And carrying out word frequency statistics on social media text data related to natural disasters, and sequencing word frequencies from high to low, wherein the first 600 words are defined as ultrahigh frequency words, and the first 6000 words are defined as high frequency words. And (4) manually screening words related to corresponding natural disasters from the ultrahigh frequency words and entering the words into a disaster related word bank.
And performing iterative calculation on the co-occurrence frequency of each high-frequency word and the words in the disaster-related word bank, manually screening out the words related to the disaster from the high-frequency words with the co-occurrence frequency higher than 90%, writing the words into the disaster-related word bank until the number of the words selected into the disaster-related word bank in each iteration is less than 5, and considering that the iteration result of the disaster-related word bank is close to convergence at the moment and the establishment of the disaster-related word bank is finished.
The method comprises the steps of establishing a disaster anti-word library by two methods, wherein the first method is consistent with the method for establishing the disaster related word library, only the screening standard needs to be modified into words irrelevant to the natural disaster, the words are not repeated here, the second method is to search the word library by means of keywords obviously not belonging to the disaster, such as "star", "entertainment", and the like, by means of a third party library provided by a dog searching word library and the like, match word segmentation results of corresponding natural disaster social media data texts with the word libraries, and write successfully matched words into the disaster anti-word library. And finishing the construction of the disaster anti-word library based on the two methods.
And cleaning the social media data based on the disaster related word bank and the disaster anti-word bank. Specifically, aiming at one piece of data in a social media text data set related to a certain natural disaster, matching all words of the piece of data with a related word bank of the disaster, and if at least one word is matched, considering that the piece of data is primarily related to the disaster and performing first cleaning; secondly, matching all words of the data with a disaster anti-word library, and if at least one word is matched, considering that the data is irrelevant to the disaster and not cleaning for the second time; thirdly, checking the issuing place of the data, and if the information is incomplete, not cleaning the data for the third time; and finally, continuously reading the data in sequence, and continuously matching the data with a disaster related word bank and a disaster anti-word bank until the cleaning of all the text data related to the natural disaster is completed.
Historical disaster data of natural disasters are acquired to provide a disaster data basis of the invention as a dependent variable of a model. And acquiring loss data of historical similar disasters to realize model establishment and evaluation. Taking the Xinlang microblog and the 2016 second super-strong precipitation process as examples, historical disaster data of the super-strong precipitation process are obtained as dependent variables of the model, and the analysis period is from 2016 6 months to 30 months to 7 months and 6 days.
Aiming at a certain natural disaster, an analysis period (typhoon disaster is 15 days before and after typhoon landing and rainstorm disaster is a duration time interval of strong rainfall) is selected, the space is accurate to provincial administrative areas, and social media data are counted according to different moods and used as independent variables of the model.
Aiming at a certain natural disaster, selecting different emotion microblog release amounts of each provincial administrative district in an analysis period as an x value and direct economic loss in the analysis period as a y value, and trying to establish a multiple linear regression model.
As shown in fig. 2, which is a model detail table of an embodiment, the distribution amount of the Xinlang microblogs with different emotions in each provincial administrative district in the analysis period is selected as an x value, wherein the distribution amount of the Xinlang microblogs with angry emotions is x1Aversion is x2With happy as x3Sadness being x4Fear is x5And establishing a multiple linear regression model by taking the direct economic loss in the analysis period as a y value.
The first column in fig. 2 is the regression coefficient for each variable, from which the regression model y can be found to be 32.18-0.064x1-0.448x2-0.026x3+0.039x4-0.038x5(ii) a The second column is the standard error of the estimated coefficient, and the formula for calculating the standard error of the estimated coefficient for the variable x is
Figure BDA0001734099680000051
Wherein
Figure BDA0001734099680000052
Is the standard deviation of the model estimated residual; the third column is the t statistic, which is calculated as
Figure BDA0001734099680000053
Wherein b is a regression coefficient; the fourth column is p-value, which is the t statistic t calculated in the closest third column by looking up the t statistic table when the sample size is nα/2(n-2) the significance level α corresponding to p-value is p-value, and generally, p-value is greater than 0.05, the coefficient is considered to be insignificant, and p-value is less than or equal to 0.05, the coefficient is considered to be significant; the fifth and sixth columns are 95% confidence intervals, consisting of (b-t)α/2(n-2)sb,b+tα/2(n-2)sb) It is calculated that here, in general, the threshold value of the significance level α is 0.05.
Finishing variable screening work of the model according to the significance level (p-value) of the model coefficient, selecting the variable x with the maximum p-value when the p-value of the independent variable in the model is greater than the set threshold value 0.05iAnd (i is 1,2,3,4 and 5), screening out independent variables, and establishing the multiple linear regression model again until all independent variables p-value are less than 0.05, and finishing the variable screening process of the model. The independent variables with significance level less than 0.05 are retained in the model finally established in one embodiment as shown in fig. 3 (meaning and calculation of each row of variables are the same as those in fig. 2).
The fitting effect of the model is tested, the model has good statistical significance and fitting effect, a good linear relation exists between the independent variable and the dependent variable, the coefficient of the independent variable has significance, and a direct economic loss calculation method finally obtained by the model is shown in the following formula: y 26.45513-0.62265x2+0.029051x4
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (2)

1. A natural disaster damage real-time evaluation method based on social media data specifically comprises the following steps: step 1, acquiring social media text data by taking natural disasters as keywords; step 2, extracting text data and processing the data to finish word segmentation and emotion calibration of the text data; step 3, establishing a natural disaster related word bank and an anti-word bank; step 4, cleaning the text data based on the geographic information of the word bank and the anti-word bank related to the natural disasters and the social media data; step 5, setting an analysis period, acquiring historical disaster data of natural disasters in the analysis period, establishing a multivariate linear regression model as a dependent variable of the model, and performing iterative computation until the significance level of the model coefficient is smaller than a set threshold value; in step 1, constructing a model on the basis of large-scale social media data, acquiring data in social media by taking natural disasters as keywords, and collecting text data containing the natural disasters to provide a social media data basis; in step 2, extracting text data comprises extracting release time, release place and release text attributes in the disaster-related social media data, and the processing is a word segmentation tool for implementing word segmentation processing of the natural disaster-related text data; the emotion is marked as happyAnger, disgust, sadness, fear, neutral six categories; in step 3, the method of the natural disaster related word bank includes performing word frequency statistics on social media text data related to the natural disaster, sorting word frequencies from high to low, wherein the first 600 words are defined as ultrahigh frequency words, the first 6000 words are defined as high frequency words, manually screening words related to the corresponding natural disaster from the ultrahigh frequency words to enter the disaster related word bank, iteratively calculating co-occurrence frequency of each high frequency word and the words in the disaster related word bank, the co-occurrence frequency is p, the number of occurrences of the high frequency words is n, the number of occurrences of the high frequency words and any words in the disaster related word bank at the same time is m, and calculating the co-occurrence frequencyp= m/nManually screening out words related to disasters from the high-frequency words with the co-occurrence frequency higher than 90%, writing the words into a disaster related word bank until the number of words selected into the disaster related word bank in each iteration is less than 5, and finishing building the disaster related word bank; in step 3, the method of the disaster anti-word library comprises an iteration method and a third-party library, wherein the word library is searched by using relevant keywords obviously not belonging to the disaster in the third-party library, word segmentation results of corresponding natural disaster social media data texts are matched with the word libraries, and successfully matched words are written into the disaster anti-word library; in step 4, the cleaning method is that aiming at one piece of data in a social media text data set related to a certain natural disaster, all words of the data are matched with a related word bank of the disaster, if at least one word is matched, the data is considered to be preliminarily related to the disaster, and cleaning is carried out; secondly, matching all the words of the data with a disaster anti-word library, if at least one word is matched, considering that the data is irrelevant to the disaster, and not cleaning; checking the issuing place of the data, and if the information is incomplete, not passing through data cleaning; in step 5, the independent variables of the model are microblog release amounts of different moods of each provincial administrative district in the analysis period, and the dependent variables are direct economic losses caused by natural disasters in the analysis period.
2. The method according to claim 1, wherein in step 5, the iteration method is that the release quantity of the social media data with different emotions in each provincial administrative district in the analysis period is selected as an independent variable, the independent variable with the worst significance is selected and removed through regression significance test, the multiple linear regression model is built again for the iteration until the p-values of all the remaining independent variables are smaller than a set threshold value, the variable iteration process of the model is finished, and only the remaining independent variables enter the final prediction model.
CN201810787884.8A 2018-07-18 2018-07-18 Natural disaster damage real-time evaluation method based on social media data Active CN109063983B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810787884.8A CN109063983B (en) 2018-07-18 2018-07-18 Natural disaster damage real-time evaluation method based on social media data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810787884.8A CN109063983B (en) 2018-07-18 2018-07-18 Natural disaster damage real-time evaluation method based on social media data

Publications (2)

Publication Number Publication Date
CN109063983A CN109063983A (en) 2018-12-21
CN109063983B true CN109063983B (en) 2022-06-21

Family

ID=64817114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810787884.8A Active CN109063983B (en) 2018-07-18 2018-07-18 Natural disaster damage real-time evaluation method based on social media data

Country Status (1)

Country Link
CN (1) CN109063983B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110426735A (en) * 2019-07-02 2019-11-08 武汉大学 A kind of detection method of the earthquake disaster coverage based on social media
CN111898385B (en) * 2020-07-17 2023-08-04 中国农业大学 Earthquake disaster assessment method and system
CN113011174B (en) * 2020-12-07 2023-08-11 红塔烟草(集团)有限责任公司 Method for identifying purse string based on text analysis
CN112818668B (en) * 2021-02-05 2024-03-29 上海市气象灾害防御技术中心(上海市防雷中心) Meteorological disaster data semantic recognition analysis method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346411A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Method and equipment for clustering multiple manuscripts
CN104484722A (en) * 2014-12-24 2015-04-01 贵州电网公司电力调度控制中心 CIM standard based modeling method for model about power grid disasters influenced by meteorological factors
KR101612423B1 (en) * 2013-10-21 2016-04-22 대한민국 Disaster detecting system using social media
CN106228462A (en) * 2016-07-11 2016-12-14 浙江大学 A kind of many energy-storage systems Optimization Scheduling based on genetic algorithm
CN106408223A (en) * 2016-11-30 2017-02-15 华北电力大学(保定) Short-term load prediction based on meteorological similar day and error correction
CN107590196A (en) * 2017-08-15 2018-01-16 中国农业大学 Earthquake emergency information screening and evaluating system and system in a kind of social networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346411A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Method and equipment for clustering multiple manuscripts
KR101612423B1 (en) * 2013-10-21 2016-04-22 대한민국 Disaster detecting system using social media
CN104484722A (en) * 2014-12-24 2015-04-01 贵州电网公司电力调度控制中心 CIM standard based modeling method for model about power grid disasters influenced by meteorological factors
CN106228462A (en) * 2016-07-11 2016-12-14 浙江大学 A kind of many energy-storage systems Optimization Scheduling based on genetic algorithm
CN106408223A (en) * 2016-11-30 2017-02-15 华北电力大学(保定) Short-term load prediction based on meteorological similar day and error correction
CN107590196A (en) * 2017-08-15 2018-01-16 中国农业大学 Earthquake emergency information screening and evaluating system and system in a kind of social networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于社交媒体的突发事件应急信息挖掘与分析;王艳东 等;《武汉大学学报(信息科学版)》;20160129;第41卷(第3期);全文 *
突发性灾难中受灾地区社交媒体用户行为研究——基于对"天津8.12爆炸"相关微博日志的内容分析和纵向分析;宗乾进 等;《信息资源管理学报》;20170126(第1期);全文 *

Also Published As

Publication number Publication date
CN109063983A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
CN109063983B (en) Natural disaster damage real-time evaluation method based on social media data
CN109255506B (en) Internet financial user loan overdue prediction method based on big data
Garip Discovering diverse mechanisms of migration: The Mexico–US stream 1970–2000
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN107784288B (en) Iterative positioning type face detection method based on deep neural network
CN102708153B (en) Self-adaption finding and predicting method and system for hot topics of online social network
CN110111113B (en) Abnormal transaction node detection method and device
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
CN109002492B (en) Performance point prediction method based on LightGBM
CN110415071B (en) Automobile competitive product comparison method based on viewpoint mining analysis
CN107679031B (en) Advertisement and blog identification method based on stacking noise reduction self-coding machine
CN112307153A (en) Automatic construction method and device of industrial knowledge base and storage medium
CN101710422A (en) Image segmentation method based on overall manifold prototype clustering algorithm and watershed algorithm
CN113822419B (en) Self-supervision graph representation learning operation method based on structural information
CN111860981A (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN111984790B (en) Entity relation extraction method
CN111625578B (en) Feature extraction method suitable for time series data in cultural science and technology fusion field
CN107480126B (en) Intelligent identification method for engineering material category
CN111507528A (en) Stock long-term trend prediction method based on CNN-L STM
CN110738565A (en) Real estate finance artificial intelligence composite wind control model based on data set
CN116503158A (en) Enterprise bankruptcy risk early warning method, system and device based on data driving
CN115579069A (en) Construction method and device of scRNA-Seq cell type annotation database and electronic equipment
He Rain prediction in Australia with active learning algorithm
CN114820074A (en) Target user group prediction model construction method based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant