台风事件信息聚合方法Typhoon event information aggregation method
技术领域technical field
本发明属于大数据挖掘领域,具体涉及一种台风事件信息聚合方法。The invention belongs to the field of big data mining, and in particular relates to a typhoon event information aggregation method.
背景技术Background technique
台风会对自然生态、社会经济甚至人类可持续发展造成十分严重的破坏性影响,及时地获取台风事件演化过程的相关信息,成为灾害应急响应的重要依据和参考。在当前大数据环境下,社交媒体凭借其高效的更新频率、多源的传播渠道和广泛的参与程度,在灾害管理中显示出巨大的应用潜力,并逐渐发展为获取台风事件信息的新途径。然而,由于社交媒体本身的短文本特性,也存在信息破碎度高、表达形式复杂多样、信息粒度多样化等特点。庞杂散乱的社交媒体信息不仅难以反映台风事件演化的全貌,也阻碍了用户对于台风事件过程的有效探测。Typhoons will have a very serious and destructive impact on the natural ecology, social economy and even human sustainable development. Timely acquisition of relevant information on the evolution of typhoon events has become an important basis and reference for disaster emergency response. In the current big data environment, social media has shown great application potential in disaster management with its efficient update frequency, multi-source communication channels and wide participation, and has gradually developed into a new way to obtain information on typhoon events. However, due to the short text characteristics of social media itself, it also has the characteristics of high information fragmentation, complex and diverse forms of expression, and diverse information granularity. Huge and scattered social media information is not only difficult to reflect the full picture of the evolution of typhoon events, but also hinders users from effectively detecting the process of typhoon events.
信息聚合方法通过对信息资源的有效描述,来提高信息组织的合理性并优化访问效率,以满足用户获取有效信息资源的需求和便利性。面向灾害事件的信息聚合方式主要包括基于统计的方法、基于主题模型的方法和基于知识元的方法:(1)统计方法是利用词频、TF-IDF、N-gram、互信息等统计特征计算信息单元中的关键词权重,从中选取最具代表性的关键词并基于此进行聚合。该类方法简单主观、易于理解,但由于关键词筛选精度不高,一般需要结合辅助信息进行二次筛选。(2)概率主题模型假设每个文档在所有主题词上都存在一个潜在分布,可以利用主题词概率分布表示信息单元中的主题。然而,该类方法的效果依赖于主题个数的确定,在现实中社交媒体中不同主题一直处于动态变化。社交媒体的同一条消息中可能包含多个主题的内容,也使得主题词的可解释性存在较大争议。(3)知识元是对不同概念间的逻辑关系和层次结构进行定义,常见知识元形式有本体、语义网络、关联数据等。基于知识元的聚合是以知识元理论为基础,通过构建描述灾害事件结构的概念模型,根据模型中定义的语义关系进行信息重新序化和组织,以揭示信息特征及其关联。The information aggregation method improves the rationality of information organization and optimizes the access efficiency through the effective description of information resources, so as to meet the needs and convenience of users to obtain effective information resources. Information aggregation methods for disaster events mainly include statistical-based methods, topic model-based methods and knowledge element-based methods: (1) Statistical methods use statistical features such as word frequency, TF-IDF, N-gram, and mutual information to calculate information. The keyword weight in the unit, from which the most representative keywords are selected and aggregated based on this. This kind of method is simple, subjective and easy to understand, but due to the low accuracy of keyword screening, it is generally necessary to combine auxiliary information for secondary screening. (2) The probabilistic topic model assumes that each document has a latent distribution over all the topic words, and the topic word probability distribution can be used to represent the topics in the information unit. However, the effect of this type of method depends on the determination of the number of topics. In reality, different topics in social media are always changing dynamically. The same message on social media may contain content of multiple topics, which also makes the interpretability of topic words more controversial. (3) Knowledge element is to define the logical relationship and hierarchical structure between different concepts. Common knowledge element forms include ontology, semantic network, linked data and so on. Knowledge element-based aggregation is based on knowledge element theory. By constructing a conceptual model describing the structure of disaster events, information is reordered and organized according to the semantic relationships defined in the model to reveal information features and their associations.
目前,基于统计和主题模型的方法是进行灾害事件信息聚合最常用的方式。然而,这两类方法聚合结果的信息粒度较粗,通常只是将与灾害事件有关的各类信息集中在一起。相比较而言,基于知识元的聚合方法能够依据灾害领域的概念体系对原始资源进行分解和重组,获得具有一定知识结构的深度聚合结果。但是现有的台风事件知识建模多关注于台风事件中各个概念的层次结构与关联关系,忽略了对于台风事件动态过程的描述与表达。面对海量且类型复杂的社交媒体资源分散分布的状况,有必要构建信息聚合方法,依据事件的演化过程对台风事件信息进行有序化整合。Currently, methods based on statistics and topic models are the most common ways to aggregate disaster event information. However, the information granularity of the aggregated results of these two types of methods is relatively coarse, and usually only all kinds of information related to disaster events are gathered together. In contrast, the aggregation method based on knowledge elements can decompose and reorganize the original resources according to the conceptual system of the disaster field, and obtain in-depth aggregation results with a certain knowledge structure. However, the existing knowledge modeling of typhoon events mostly focuses on the hierarchical structure and relationship of various concepts in typhoon events, ignoring the description and expression of the dynamic process of typhoon events. Faced with the scattered distribution of massive and complex social media resources, it is necessary to build an information aggregation method to orderly integrate typhoon event information according to the evolution of events.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种台风事件信息聚合方法,对社交媒体中来源分散的台风事件信息进行筛选、组织和整合,为探测台风事件过程的发展阶段和态势提供有序化的信息基础,也有利于应急管理中社交媒体资源服务能力的提升。The purpose of the present invention is to provide a typhoon event information aggregation method, which can screen, organize and integrate typhoon event information from scattered sources in social media, and provide an orderly information basis for detecting the development stage and situation of the typhoon event process. It is conducive to the improvement of social media resource service capabilities in emergency management.
为实现上述目的,本发明提供如下技术方案:To achieve the above object, the present invention provides the following technical solutions:
台风事件信息聚合方法,主要步骤如下:The main steps of the typhoon event information aggregation method are as follows:
步骤1、采集社交媒体中与台风事件相关的消息文本,并从中抽取台风事件信息,并转换为结构化的信息元组形式; Step 1. Collect the message text related to the typhoon event in the social media, extract the typhoon event information from it, and convert it into a structured information tuple form;
步骤2、基于多特征相似度的对象信息聚合:依据对象名称间的相似度判断其是否属于同一对象的信息元组,需要将描述同一对象的信息元组进行聚合; Step 2. Object information aggregation based on multi-feature similarity: according to the similarity between object names to determine whether it belongs to the information tuple of the same object, it is necessary to aggregate the information tuples describing the same object;
步骤3、基于时空特征的状态信息聚合:在对象信息聚合结果中筛选符合单一时间和位置条件要求的属性值和行为值,时间信息、位置信息与筛选出的属性值和行为值共同构成对象在特定时空下的状态信息聚合结果;Step 3. Aggregation of state information based on spatiotemporal features: In the aggregation result of object information, the attribute values and behavior values that meet the requirements of a single time and location condition are screened. Time information, location information, and the filtered attribute values and behavior values together constitute the object in the Aggregation results of state information in a specific time and space;
步骤4、基于状态的过程信息聚合:在对象信息聚合结果中筛选符合时间和位置范围要求的时空节点信息,对这些时空节点分别进行状态信息聚合,并将多个状态信息聚合结果进行排序,形成体现动态特性的过程信息聚合结果。Step 4. State-based process information aggregation: screen the space-time node information that meets the time and location range requirements in the object information aggregation result, perform state information aggregation on these space-time nodes respectively, and sort multiple state information aggregation results to form Process information aggregation results reflecting dynamic characteristics.
优选地,在步骤1中,所述台风事件信息包括对象名称、时间信息、位置信息、属性信息和行为信息。Preferably, in step 1, the typhoon event information includes object name, time information, location information, attribute information and behavior information.
优选地,在步骤2中,对于描述同一对象的不同信息元组,其中相同类型的属性项和行为项也需要进行进一步聚合。Preferably, in step 2, for different information tuples describing the same object, attribute items and behavior items of the same type also need to be further aggregated.
优选地,在步骤1中,台风事件信息抽取至少包括信息要素识别和信息要素关联两个部分:Preferably, in step 1, the extraction of typhoon event information includes at least two parts: information element identification and information element association:
信息要素识别:明确台风事件的组成对象并构建分类体系,从社交媒体文本中抽取描述不同类型对象的名称与特征信息,其中特征信息包括时间、位置、属性和行为。属性信息可以进一步分为属性项和属性值,属性项表示属性的类型,而属性值为该类型属性具有的数据或数据量。行为信息与属性信息相类似;Identification of information elements: clarify the constituent objects of typhoon events and build a classification system, extract the names and characteristic information describing different types of objects from social media texts, and the characteristic information includes time, location, attributes and behaviors. The attribute information can be further divided into attribute items and attribute values, the attribute items represent the type of the attribute, and the attribute value is the data or the amount of data possessed by the attribute of this type. Behavior information is similar to attribute information;
信息要素关联:在同一篇社交媒体文本中,将特征信息依据其表征对象与名称进行关联,形成O
n=<T,L,A,B>形式的信息元组。其中,O
n为对象名称,T为时间信息,L为位置信息,A为属性信息,B为行为信息。
Information element association: In the same social media text, the feature information is associated with the name according to its representative object to form an information tuple in the form of On =<T, L, A, B>. Among them, On is the object name, T is the time information, L is the location information, A is the attribute information, and B is the behavior information.
优选地,在步骤2中,采用词向量相似度判断对象名称、属性项和行为项之间相似性,包括以下步骤:Preferably, in step 2, the similarity between object names, attribute items and behavior items is judged by using word vector similarity, including the following steps:
S1、将全部社交媒体文本数据进行分词;S1. Perform word segmentation on all social media text data;
S2、将分词结果作为训练集,利用Skip-gram模型进行词向量训练;S2. The word segmentation result is used as the training set, and the Skip-gram model is used for word vector training;
S3、设定对象名称O
n1、O
n2,属性项A
1、A
2,行为项B
1、B
2,依据训练过的词向量模型分别获得O
n1、O
n2、A
1、A
2、B
1、B
2的词向量E(O
n1)、E(O
n2)、E(A
1)、E(A
2)、E(B
1)、E(B
2);
S3. Set object names On1 , On2 , attribute items A1 , A2, behavior items B1, B2, respectively obtain On1 , On2 , A1 , A2 , B according to the trained word vector model 1. Word vectors E(O n1 ), E(O n2 ), E(A 1 ), E(A 2 ), E(B 1 ), E(B 2 ) of B 2 ;
S4、利用余弦相似度分别计算E(O
n1)与E(O
n2)、E(A
1)与E(A
2)、E(B
1)与E(B
2)之间的相似度值sim
n、sim
a和sim
b。若sim
n≥ε
n,sim
a≥ε
a,sim
b≥ε
b,其中ε
n、ε
a、ε
b是阈值,则表明O
n1与O
n2、A
1与A
2、B
1与B
2是相同的对象名称、属性项和行为项,可以进行相应的信息聚合。
S4. Calculate the similarity value sim between E(O n1 ) and E(O n2 ), E(A 1 ) and E(A 2 ), and E(B 1 ) and E(B 2 ) respectively by using the cosine similarity n , sim a and sim b . If sim n ≥ ε n , sim a ≥ ε a , sim b ≥ ε b , where ε n , ε a , and ε b are thresholds, it means that O n1 and O n2 , A 1 and A 2 , B 1 and B 2 are the same object name, attribute item, and behavior item, and corresponding information aggregation can be performed.
优选地,在步骤4中,对多个状态信息聚合结果进行排序时,包括以下步骤:Preferably, in step 4, when sorting multiple state information aggregation results, the following steps are included:
A1、依据状态的时间信息,遵循顺序或倒序的方式进行排序;A1. According to the time information of the state, follow the order or reverse order;
A2、依据状态的位置信息,遵循尺度由大到小或由小到大的方式进行排序;A2. According to the position information of the state, follow the order of the scale from large to small or from small to large;
A3、依据状态的属性信息和行为信息,可以依据特征值的大小或等级排序,也可以依据与用户聚合条件的相似度进行排序。A3. According to the attribute information and behavior information of the state, it can be sorted according to the size or level of the feature value, or it can be sorted according to the similarity with the user aggregation condition.
采用以上技术方案,能够实现以下技术效果:By adopting the above technical solutions, the following technical effects can be achieved:
本发明构建了基于社交媒体的台风事件过程信息聚合方法,在识别出社交媒体文本中与台风事件相关的不同对象信息元组基础上,分别从“对象-状态-过程”阐述了多层次的聚合模式。首先,在对象层中依据多维特征的相似度,将同一对象各类分散的特征信息进行聚合;其次,在状态层中将对象中符合特定时空特征的属性信息和行为信息进行聚合,实现信息时空粒度的统一;最后,在过程层中将多个状态依据时空关系进行排序,达到信息有序化组织的效果。这种聚合模式针对了社交媒体中信息分散化、多粒度和无序化的描述特点,也充分顾及了台风事件的动态演化特性,可以获取任一时空节点上不同对象的特征信息,并形成体现台风事件过程特性的有序化信息。在实际应用场景中,对于满足政府机构的应急任务需求和社会公众的事理认知需求都可以发挥重要作用。The present invention constructs a social media-based typhoon event process information aggregation method. On the basis of identifying different object information tuples related to typhoon events in social media texts, the multi-level aggregation is described from "object-state-process" respectively. model. First, in the object layer, according to the similarity of multi-dimensional features, various types of scattered feature information of the same object are aggregated; secondly, in the state layer, the attribute information and behavior information in the object that conform to specific spatiotemporal characteristics are aggregated to realize information spatiotemporal. Unification of granularity; finally, in the process layer, multiple states are sorted according to the space-time relationship to achieve the effect of orderly organization of information. This aggregation mode aims at the decentralization, multi-granularity and disordered description characteristics of information in social media, and also fully takes into account the dynamic evolution characteristics of typhoon events. Ordered information on the process characteristics of typhoon events. In practical application scenarios, it can play an important role in meeting the emergency task needs of government agencies and the public's cognitive needs.
附图说明Description of drawings
图1为多层次的台风事件过程信息聚合模式;Figure 1 shows the multi-level typhoon event process information aggregation model;
图2为社交媒体中构建的时空语义单元;Figure 2 shows the spatiotemporal semantic unit constructed in social media;
图3为社交媒体中台风事件信息抽取结果示例;Figure 3 is an example of typhoon event information extraction results in social media;
图4为对象信息聚合结果的组织结构及示例;Fig. 4 is the organizational structure and example of the object information aggregation result;
图5为状态信息聚合结果的组织结构及示例;Fig. 5 is the organizational structure and example of the state information aggregation result;
图6为过程信息聚合的不同阶段;Figure 6 shows the different stages of process information aggregation;
图7为过程信息聚合结果的组织结构及示例。Figure 7 shows the organization structure and example of the process information aggregation result.
具体实施方式Detailed ways
以下结合附图和具体实施例,对本发明做进一步说明。The present invention will be further described below with reference to the accompanying drawings and specific embodiments.
实施例Example
本发明公开了基于社交媒体的台风事件过程信息聚合方法,包括:The invention discloses a social media-based typhoon event process information aggregation method, including:
步骤1、采集社交媒体中与台风事件相关的消息文本,并从中抽取台风事件信息,包括对象名称、时间信息、位置信息、属性信息和行为信息,并转换为结构化的信息元组形式。 Step 1. Collect message texts related to typhoon events in social media, and extract typhoon event information from them, including object name, time information, location information, attribute information, and behavior information, and convert them into a structured information tuple form.
步骤2、基于多特征相似度的对象信息聚合。依据对象名称间的相似度判断其是否属于同一对象的信息元组,需要将描述同一对象的信息元组进行聚合。对于描述同一对象的不同信息元组,其中相同类型的属性项和行为项也需要进行进一步聚合。 Step 2. Object information aggregation based on multi-feature similarity. To judge whether the object names belong to the information tuple of the same object according to the similarity between the object names, it is necessary to aggregate the information tuples describing the same object. For different information tuples describing the same object, attribute items and behavior items of the same type also need to be further aggregated.
步骤3、基于时空特征的状态信息聚合。在对象信息聚合结果中筛选符合单一时间和位置条件要求的属性值和行为值,时间信息、位置信息与筛选出的属性值和行为值共同构成对象在特定时空下的状态信息聚合结果。Step 3: Aggregate state information based on spatiotemporal features. Attribute values and behavior values that meet the requirements of a single time and location condition are filtered in the object information aggregation result. Time information, location information, and the filtered attribute values and behavior values together constitute the state information aggregation result of the object under a specific time and space.
步骤4、基于状态的过程信息聚合。在对象信息聚合结果中筛选符合时间和位置范围要求的时空节点信息,对这些时空节点分别进行状态信息聚合,并将多个状态聚合结果进行排序,形成体现动态特性的过程信息聚合结果。Step 4, state-based process information aggregation. The space-time node information that meets the requirements of time and location range is screened in the object information aggregation result, the state information is aggregated for these space-time nodes respectively, and the multiple state aggregation results are sorted to form the process information aggregation result that reflects the dynamic characteristics.
作为一种优选的技术方案,步骤1中台风事件信息抽取包括:As a preferred technical solution, the typhoon event information extraction in step 1 includes:
1、明确台风事件的组成对象并构建分类体系,从社交媒体文本中抽取描述不同类型对象的名称与特征信息,其中特征信息包括时间、位置、属性和行为。属性信息可以进一步分为属性项和属性值,属性项表示属性的类型,而属性值为该类型属性具有的数据或数据量。行为信息与属性信息相类似。1. Identify the constituent objects of typhoon events and build a classification system, and extract the names and feature information describing different types of objects from social media texts, where the feature information includes time, location, attributes, and behavior. The attribute information can be further divided into attribute items and attribute values, the attribute items represent the type of the attribute, and the attribute value is the data or the amount of data possessed by the attribute of this type. Behavior information is similar to attribute information.
2、在同一篇社交媒体文本中,将特征信息依据其表征对象与名称进行关联,形成O
n=<T,L,A,B>形式的信息元组。其中,O
n为对象名称,T为时间信息,L为位置信息,A为属性信息,B为行为信息。
2. In the same social media text, the feature information is associated with the name according to its representative object to form an information tuple in the form of On =<T, L, A, B>. Among them, On is the object name, T is the time information, L is the location information, A is the attribute information, and B is the behavior information.
作为一种优选的技术方案,台风事件组成对象分为主体对象和客体对象。气旋作为致灾因子就是事件中的主体对象,而受到气旋破坏、作用、影响的其他对象都是事件中的客体对象。依据客体对象的不同性质可以分别归类,主要包括人物、基础设施、交通设施、社会活动等类型。需要说明的是,不同对象可以借鉴相关领域分类方法,依据实际需要进行更加详细的类型划分(表1)。As a preferred technical solution, the typhoon event composition objects are divided into subject objects and object objects. The cyclone, as a hazard factor, is the main object in the event, and other objects that are damaged, acted, and affected by the cyclone are the object objects in the event. According to the different properties of the objects, they can be classified separately, mainly including people, infrastructure, transportation facilities, social activities and other types. It should be noted that different objects can learn from the classification methods of related fields, and make more detailed classification according to actual needs (Table 1).
表1 台风事件中主要的对象类型Table 1 Main object types in typhoon events
作为一种优选的技术方案,从社交媒体文本中对描述不同类型对象的名称与特征信息进行抽取包括:As a preferred technical solution, extracting names and feature information describing different types of objects from social media texts includes:
S1、构建社交媒体文本台风事件信息标注语料库,标注的内容包括描述不同类型对象的名称、时间、位置、属性和行为信息要素。S1. Build a social media text typhoon event information annotation corpus, and the annotated content includes name, time, location, attribute and behavior information elements describing different types of objects.
S2、依据标注语料库,基于条件随机场模型构建时间信息抽取模型,对社交媒体文本中的时间信息进行自动识别。S2. According to the annotated corpus, a time information extraction model is constructed based on the conditional random field model, and the time information in the social media text is automatically identified.
S3、依据标注语料库,基于深度信念网络构建位置信息抽取模型,对社交媒体文本中的位置信息进行自动识别。S3. According to the labeled corpus, a location information extraction model is constructed based on the deep belief network, and the location information in the social media text is automatically identified.
S4、依据标注语料库,总结对象名称、属性信息和行为信息的规则模型,包括触发词词典与句法模式,对社交媒体文本中的对象名称、属性信息和行为信息进行自动识别。S4. Summarize rule models of object names, attribute information and behavior information, including trigger word dictionaries and syntactic patterns, based on the labeled corpus, and automatically identify object names, attribute information, and behavior information in social media texts.
作为一种优选的技术方案,对于从社交媒体中抽取出的各类信息要素需要进行关联包括:As a preferred technical solution, various types of information elements extracted from social media need to be correlated, including:
S1、时空语义单元构建。字、词、短语、子句、句或段落等都是文本中的语言单位,不 同语言单位间通过语义关系形成文本的基本结构。若部分语言单位或将不同语言单位组合后,能够表达出完整的语义内涵,即为语义单元。当语义单元中包含了时间信息和空间信息,能够明确表达出语义单元中阐述内容存在的时空特征,本方法中将此语义单元定义为时空语义单元。S1. Spatiotemporal semantic unit construction. Words, words, phrases, clauses, sentences or paragraphs are all language units in text, and different language units form the basic structure of text through semantic relationships. If some language units or the combination of different language units can express the complete semantic connotation, it is a semantic unit. When the semantic unit contains temporal information and spatial information, it can clearly express the spatiotemporal characteristics of the content in the semantic unit. In this method, the semantic unit is defined as the spatiotemporal semantic unit.
对蕴含台风事件的社交媒体文本进行分析,时空语义单元的分布大致可以分为三类:(1)只描述了同一时间和位置的对象信息,此类文本占据社交媒体文本的大部分;(2)描述了同一时间不同位置的对象信息,此类文本数量相对较少;(3)将多个时间和位置的对象信息进行列举并进行比较,属于综合性报道,此类文本数量很少。By analyzing social media texts containing typhoon events, the distribution of spatiotemporal semantic units can be roughly divided into three categories: (1) only describe object information at the same time and location, and such texts occupy most of social media texts; (2) ) describes the object information at different locations at the same time, and the number of such texts is relatively small; (3) The object information of multiple times and locations is listed and compared, which is a comprehensive report, and the number of such texts is very small.
利用时空信息可以跟踪文本中对象特征的变化情况。因此,本方法基于提取出的时空信息,将社交媒体文本划分为不同的时空语义单元(图2)。以时空信息在文本中的存在位置,作为划分为时空语义单元的依据,具体包括:Using spatiotemporal information can track changes in object features in text. Therefore, this method divides social media texts into different spatiotemporal semantic units based on the extracted spatiotemporal information (Fig. 2). The location of the spatiotemporal information in the text is used as the basis for division into spatiotemporal semantic units, including:
(1)对于第一类情况,由于仅存在唯一的时间、位置信息,将文本整体划分为1个时空语义单元。(1) For the first case, since there is only unique time and location information, the entire text is divided into one spatiotemporal semantic unit.
(2)对于第二类和第三类情况,先依据时间信息划分文本为多个时间单元。当时间单元中存在多个位置信息时,则利用位置信息进一步划分,划分出时空语义单元共享时间单元中的时间信息。(2) For the second and third types of cases, first divide the text into multiple time units according to the time information. When there are multiple location information in the time unit, the location information is used for further division, and the spatiotemporal semantic units are divided to share the time information in the time unit.
S2、对象名称与特征信息的关联规则。在将社交媒体文本划分为了多个时空语义单元的基础上,识别出的对象名称以及各类特征信息分布在不同的单元内。因此,可以依据各个信息要素所隶属的单元进行结构化组织。在每个时空语义单元中,依次按照以下步骤进行不同信息要素的关联:S2. Association rules between object names and feature information. On the basis of dividing social media text into multiple spatiotemporal semantic units, the recognized object names and various feature information are distributed in different units. Therefore, it can be structured according to the unit to which each information element belongs. In each spatiotemporal semantic unit, the following steps are followed to associate different information elements:
(1)特征触发词与特征值的关联。特征触发词与特征值共同构成对象的特征信息,此时专指属性特征和行为特征,特征触发词表示属性项和行为项,特征值表示属性值和行为值。特征触发词与特征值在表达时遵循邻近规律,形成“特征触发词-特征值”的结构。通过统计属性值前三位词语的词频,出现特征触发词的频率超过99%。因此,将特征值与其位置前最接近的特征触发词相关联。(1) The association between feature trigger words and feature values. The feature trigger word and the feature value together constitute the feature information of the object. At this time, it specifically refers to the attribute feature and behavior feature. The feature trigger word represents the attribute item and the behavior item, and the feature value represents the attribute value and the behavior value. Feature trigger words and feature values follow the adjacent law when they are expressed, forming a structure of "feature trigger word-feature value". By counting the word frequencies of the top three words in the attribute value, the frequency of feature trigger words is over 99%. Therefore, the feature value is associated with the closest feature trigger word in front of its position.
(2)属性、行为信息与对象名称的关联。在中文的基本表述习惯中,通常会先提及对象名称,再分别阐述对象具有的各类特征。因此,在同一个时空语义单元中,将属性信息和行为信息分别与其位置前最接近的对象名称相关联。(2) The association of attributes, behavior information and object names. In the basic expression habits of Chinese, the name of the object is usually mentioned first, and then the various characteristics of the object are described separately. Therefore, in the same spatiotemporal semantic unit, attribute information and behavior information are respectively associated with the closest object name before its location.
(3)对象名称与时间、位置信息的关联。对于对象名称所在的时空语义单元,将其时间信息和位置信息分别与对象名称相关联。(3) The association of object names with time and location information. For the spatiotemporal semantic unit where the object name is located, its time information and location information are respectively associated with the object name.
对依次建立关联关系的对象名称与各类特征信息,按照O
n=<T,L,A,B>的元组形式进行 填充(图3)。需要说明的是,一个时空语义单元中对于台风事件的描述可能仅限于某一方面,构建对象信息元组时可以存在属性和行为其中一项缺失的情况。
The object names and various types of feature information that are associated in turn are filled according to the tuple form of On =<T, L, A, B> (FIG. 3). It should be noted that the description of typhoon events in a spatiotemporal semantic unit may be limited to a certain aspect, and one of attributes and behaviors may be missing when constructing an object information tuple.
作为一种优选的技术方案,步骤2中对象信息聚合包括:As a preferred technical solution, the object information aggregation in step 2 includes:
1、基于对象名称的聚合。设定聚合条件的对象名称为N,依次判断O
n名称与N的相似度sim
n。若sim
n≥ε
n,ε
n是对象相似度阈值,则表明是同一个对象,对于同一对象的信息元组进行合并。
1. Aggregation based on object name. The object name of the aggregation condition is set as N, and the similarity sim n between the name of On and N is judged in turn. If sim n ≥ ε n , and ε n is the object similarity threshold, it indicates that it is the same object, and the information tuples of the same object are merged.
对于判断对象名称相似度的度量方法,采用词向量相似度法。词向量相似度法在利用Skip-gram模型训练出词向量模型的基础上,首先将对象名称映射为一个多维空间的向量,通过相似度算法判断不同向量间在多维空间中的方向是否一致,并采用余弦相似度进行度量。For the measurement method of judging the similarity of object names, the word vector similarity method is used. Based on the word vector model trained by the Skip-gram model, the word vector similarity method first maps the object name to a vector in a multi-dimensional space, and determines whether the directions of different vectors in the multi-dimensional space are consistent through the similarity algorithm. It is measured by cosine similarity.
例如,O(台风)=<2019年8月10日1:45,浙江省温岭市,风力:16级,登陆>,O(热带气旋)=<2019年8月11日20:50,山东省青岛市,风力:9级,登陆>为社交媒体中抽取出的信息元组。设定聚合条件的对象名称为“台风”,分别对信息元组中的对象名称“台风”和“热带气旋”进行相似度判断,其语义都是表达气旋本体,则将这两项信息元组作为聚合结果。For example, O (typhoon) = <August 10, 2019 1:45, Wenling City, Zhejiang Province, wind force: 16, landfall>, O (tropical cyclone) = <August 11, 2019 20:50, Shandong Province Qingdao City, wind force: level 9, login> is an information tuple extracted from social media. The object name of the aggregation condition is set as "typhoon", and the similarity of the object names "typhoon" and "tropical cyclone" in the information tuple is judged respectively, and their semantics are to express the cyclone ontology. as the aggregated result.
2、结合对象特征的聚合。在对同一对象的信息元组聚合后,会出现多项相同类型的属性和行为特征信息,可以进一步聚合出符合特定特征的对象信息。在基于对象名称聚合结果的基础上,设定聚合条件的对象属性特征A和行为特征B。对于属性特征的聚合,采用词向量相似度法判断O
n属性项与A的相似度sim
a。若sim
a≥ε
a,ε
a是属性相似度阈值,则表明属性项相同,可以进行信息聚合,并且在聚合后同样保留各个属性值及时空特征;否则为描述同一对象的不同属性项,不进行属性项的聚合。
2. Combine the aggregation of object features. After aggregating the information tuple of the same object, there will be multiple pieces of attribute and behavior feature information of the same type, which can further aggregate object information that conforms to specific features. On the basis of the aggregation result based on the object name, set the object attribute feature A and behavior feature B of the aggregation condition. For the aggregation of attribute features, the word vector similarity method is used to judge the similarity sim a between the On attribute item and A. If sim a ≥ε a , and ε a is the attribute similarity threshold, it indicates that the attribute items are the same, and information aggregation can be performed, and each attribute value and space-time characteristics are also retained after the aggregation; otherwise, it is a different attribute item describing the same object, no Aggregate property items.
对于行为特征的聚合,词向量相似度法判断O
n行为项与B的相似度sim
b。若sim
b≥ε
b,ε
b是行为相似度阈值,则表明行为项相同,可以进行信息聚合,并且在聚合后同样保留各个行为信息及时空特征;否则为描述同一对象的不同行为项,不进行行为项的聚合。
For the aggregation of behavior features, the word vector similarity method judges the similarity sim b between On behavior item and B. If sim b ≥ ε b , and ε b is the behavior similarity threshold, it indicates that the behavior items are the same, information aggregation can be performed, and each behavior information and space-time characteristics are also retained after aggregation; otherwise, it is a different behavior item describing the same object, not Aggregate behavior items.
例如,基于上述的O(台风)和O(热带气旋)对象信息元组,进一步聚合台风的“风力”属性特征信息。O(台风)和O(热带气旋)中都有符合相似度阈值的属性项“风力”,因此将<2019年8月10日1:45,浙江省温岭市,风力:16级>和<2019年8月11日20:50,山东省青岛市,风力:9级>作为对象特征的聚合结果。For example, based on the above-mentioned O (typhoon) and O (tropical cyclone) object information tuples, the "wind" attribute feature information of the typhoon is further aggregated. Both O (typhoon) and O (tropical cyclone) have an attribute item "wind force" that meets the similarity threshold, so <August 10, 2019 1:45, Wenling City, Zhejiang Province, wind force: 16> and <2019 August 11, 20:50, Qingdao City, Shandong Province, wind force: level 9 > as an aggregated result of object features.
3、对象聚合结果的信息组织。对象信息聚合结果的组织形式可以表达为图4。其中,O(N)表示聚合的对象,A
l是聚合获得的对象的属性项,a
ls是具体的属性值,B
n是聚合获得的对象的行为项,b
nu是具体的行为值,<T,S>是属性值或行为值发生的时间和地点。可以看出, 原本分散的信息碎片都与其描述的对象相关联,对象中相同的属性项和行为项也合并在一起,而每个属性和行为项中都包含了多个时空条件下表现出的不同特征值。
3. Information organization of object aggregation results. The organizational form of the object information aggregation result can be expressed as Figure 4. Among them, O(N) represents the aggregated object, A l is the property item of the aggregated object, a ls is the specific property value, B n is the behavior item of the aggregated object, b nu is the specific behavior value, < T, S> is the time and place where the attribute value or behavior value occurs. It can be seen that the originally scattered pieces of information are all associated with the objects they describe, and the same attribute items and behavior items in the object are also merged together, and each attribute and behavior item contains multiple temporal and spatial conditions. different eigenvalues.
作为一种优选的技术方案,步骤3中状态信息聚合包括:As a preferred technical solution, the state information aggregation in step 3 includes:
1、时空基准统一。时空框架是状态存在的基础,在状态信息聚合中需要建立统一的时空基准。本文的时间基准中将日期设置为公历纪元,时间设置为北京时间,空间基准采用CGCS2000坐标系。1. The time and space benchmarks are unified. The spatiotemporal framework is the basis for the existence of states, and a unified spatiotemporal reference needs to be established in the aggregation of state information. In the time base of this article, the date is set to the Gregorian calendar era, the time is set to Beijing time, and the space base uses the CGCS2000 coordinate system.
2、时空信息规范化。时间信息和位置信息是判定与之关联的属性信息和行为信息是否为描述特定时空条件下对象状态特征的依据。对于时间信息,按照目前人们日常的使用习惯,使用公历纪年、日历时间和时钟时间进行规范化描述。时间规范化形式定义为“日期+时间”的格式“YYYY-MM-DD HH:MM:SS”,例如:“2019-08-10 12:00:00”。位置信息应按照统一空间基准转换为规范化的表示形式,包括地名、地址和空间坐标等描述内容。其中,地名可以参照在特定时间国家发布的标准名称、编码和类别,而地址中包含的地址要素类型和要素组合方式可以参考国家或行业发布的标准规范,空间坐标应遵循空间基准的要求进行相应的坐标转换。2. Standardization of spatiotemporal information. Time information and location information are the basis for judging whether the associated attribute information and behavior information are the basis for describing the state characteristics of objects under specific space-time conditions. For time information, according to the current daily usage habits of people, the Gregorian calendar year, calendar time and clock time are used for standardized description. The time normalization form is defined as "date+time" in the format "YYYY-MM-DD HH:MM:SS", for example: "2019-08-10 12:00:00". Location information should be converted into a normalized representation according to a unified spatial reference, including descriptions such as place names, addresses, and spatial coordinates. Among them, the place name can refer to the standard name, code and category issued by the country at a specific time, and the address element type and element combination method contained in the address can refer to the standard specification issued by the country or industry, and the spatial coordinates should follow the requirements of the spatial datum. coordinate transformation.
3、面向状态的聚合。设定聚合的时间特征t和位置特征l,基于对象层信息聚合结果O(N),在O(N)的每个属性项和行为项中,判断是否存在T=t且S=l的特征值(属性值和行为值),若存在则将此特征值作为聚合信息。否则继续判断是否存在S=l,T<t且与t最接近的特征值,若存在也将此特征值作为聚合信息。若不存在,继续判断是否存在S与l临近,T<t且与t最接近的特征值,若存在同样将此特征值作为聚合信息。若依然不存在,则此属性项或行为项不进行聚合。通过对O(N)中所有属性项和行为项的遍历,每个属性项和行为项中会筛选出最多1项最符合时空特征的特征值。将这些属性信息和行为信息进行聚合,共同形成对象在特定时空条件下的状态信息聚合结果。3. State-oriented aggregation. Set the aggregated time feature t and location feature 1, based on the object layer information aggregation result O(N), in each attribute item and behavior item of O(N), determine whether there is a feature of T=t and S=1 Value (attribute value and behavior value), if present, this feature value is used as aggregate information. Otherwise, continue to judge whether there is an eigenvalue with S=l, T<t and the closest to t, and if so, also use this eigenvalue as aggregated information. If it does not exist, continue to judge whether there is an eigenvalue that S is close to l, T<t and is closest to t, and if there is, this eigenvalue is also used as aggregate information. If it still does not exist, the attribute item or behavior item will not be aggregated. By traversing all attribute items and behavior items in O(N), each attribute item and behavior item will filter out at most one feature value that best fits the spatiotemporal characteristics. These attribute information and behavior information are aggregated to form the aggregated result of the state information of the object under specific spatiotemporal conditions.
例如:社交媒体中有消息记录在8月10日1:45气旋风力在浙江省温岭市达到16级,当聚合(2:00,温岭市)的气旋状态时,由于1:45-2:00之间没有关于风力的信息更新,因此将“风力16级”作为气旋对象在(2:00,温岭市)状态的1项属性特征。通过这种聚合机制,对于获取的任一时空节点上的聚合结果,状态信息不仅限于被明确提及属于当前时空下的对象特征,还包含之前所有时间中全部对象特征截至目前的最新进展,保证了聚合结果的全面性与完整性。For example: there is news in social media that the cyclone wind reached 16 in Wenling City, Zhejiang Province at 1:45 on August 10, when the cyclone state of aggregation (2:00, Wenling City), due to 1:45-2:00 There is no information about wind force between updates, so "wind force level 16" is used as an attribute feature of the cyclone object at (2:00, Wenling City) state. Through this aggregation mechanism, for the aggregation results obtained on any space-time node, the state information is not limited to the object features that are explicitly mentioned as belonging to the current space-time, but also includes the latest progress of all the object features in all previous times up to now, ensuring that The comprehensiveness and completeness of the aggregated results.
4、状态聚合结果的信息组织。状态信息聚合结果的组织形式可以表达为图5。其中,S表示对象O(N)在时间t和位置l上存在的状态,A
l和a
ls描述状态的属性特征,B
n和bn
u是 描述状态的行为特征,<T,S>则是属性和行为特征产生的时间和位置。
4. Information organization of state aggregation results. The organizational form of the state information aggregation results can be expressed as Figure 5. Among them, S represents the state of the object O(N) existing at time t and location l, A l and a ls describe the attribute characteristics of the state, B n and bn u are the behavioral characteristics describing the state, and <T, S> is When and where attributes and behavioral characteristics arise.
作为一种优选的技术方案,步骤4中过程信息聚合包括状态序列聚合和事件过程聚合两个部分。过程是不同状态在时空上的连接,并通过状态中属性信息和行为信息的变化体现出过程的动态性。台风事件包含了在事件发生期间多个对象的演化过程,台风事件的过程是由多个对象的不同状态共同构成。因此,在进行过程层信息聚合时采用逐级分解方式,将状态信息到过程信息的连接分级抽象为对象状态、状态序列和事件过程三个阶段(图6)。其中,对象状态聚合了某一时空下对象的属性信息和行为信息;状态序列是记录同一对象的演变历程,需要将同一对象的不同状态进行聚合;事件过程则是多个对象共同的演变历程,由多个状态序列共同构成。As a preferred technical solution, the process information aggregation in step 4 includes two parts: state sequence aggregation and event process aggregation. The process is the connection of different states in time and space, and the dynamics of the process is reflected through the changes of attribute information and behavior information in the state. A typhoon event includes the evolution process of multiple objects during the event, and the process of a typhoon event is composed of different states of multiple objects. Therefore, a step-by-step decomposition method is adopted in the aggregation of process layer information, and the connection between state information and process information is abstracted into three stages: object state, state sequence and event process (Figure 6). Among them, the object state aggregates the attribute information and behavior information of the object in a certain time and space; the state sequence is to record the evolution process of the same object, and different states of the same object need to be aggregated; the event process is the common evolution process of multiple objects. It consists of multiple state sequences.
作为一种优选的技术方案,进行状态序列聚合包括:As a preferred technical solution, performing state sequence aggregation includes:
S1、设定聚合的时间范围tr和空间范围sr,基于对象信息聚合结果O(N),依次遍历O(N)中全部的属性项和行为项。在每个属性项和行为项中,判断是否存在
和
的属性值或行为值,将全部符合tr与sr范围的<T,S>形成时空节点集合。对于集合中全部的时空节点,分别基于步骤3的方法聚合获得多个状态聚合结果。
S1. Set the time range tr and spatial range sr of the aggregation, and traverse all the attribute items and behavior items in O(N) in turn based on the object information aggregation result O(N). In each attribute item and behavior item, determine whether there is and The attribute value or behavior value of , will all conform to the <T, S> range of tr and sr to form a set of space-time nodes. For all spatiotemporal nodes in the set, multiple state aggregation results are obtained by aggregation based on the method in step 3 respectively.
S2、对全部状态聚合结果进行排序,首先依据状态的时间信息,遵循顺序或倒序的方式进行排序;其次依据状态的位置信息,遵循尺度由大到小或由小到大的方式进行排序;最后依据状态的属性信息和行为信息,可以依据特征值的大小或等级排序,也可以依据与用户聚合条件的相似度进行排序。按照三维条件排列的状态序列即为单一对象的过程聚合结果。S2. Sort all the state aggregation results. First, according to the time information of the state, sort in order or in reverse order; secondly, according to the position information of the state, follow the scale from large to small or from small to large. According to the attribute information and behavior information of the state, it can be sorted according to the size or level of the feature value, or it can be sorted according to the similarity with the user aggregation condition. The sequence of states arranged according to three-dimensional conditions is the result of the process aggregation of a single object.
S3、状态序列聚合结果的信息组织。状态序列信息聚合结果的组织形式可以表达为图5。其中,P表示对象O(N)在时间范围tr和空间范围sr上经历的过程,S表示在时空节点<t
n,l
n>上的对象状态。
S3. Information organization of state sequence aggregation results. The organizational form of the state sequence information aggregation results can be expressed as Figure 5. Among them, P represents the process experienced by the object O(N) on the temporal scope tr and the spatial scope sr, and S represents the object state on the space-time node <t n , ln >.
作为一种优选的技术方案,进行事件过程聚合包括:As a preferred technical solution, performing event process aggregation includes:
S1、设定聚合的时间范围tr和空间范围sr,基于多项对象信息聚合结果O(N
s)-O(N
t),先遍历O(N
s)中全部的属性项和行为项,获得符合tr与sr范围的<T,S>。再继续遍历O(N
s+1),直至遍历完O(N
t)。将全部符合tr与sr范围的<T,S>形成时空节点集合。
S1. Set the time range tr and spatial range sr of the aggregation, based on the multi-object information aggregation result O(N s )-O(N t ), first traverse all the attribute items and behavior items in O(N s ), and obtain <T, S> conforming to the range of tr and sr. Continue to traverse O(N s+1 ) until O(N t ) is traversed. All <T, S> in the range of tr and sr are formed into a set of space-time nodes.
S2、对于多个对象状态序列还需要采取相同的排序机制,以保证聚合结果整体次序的一致性。对于面向事件过程的聚合结果,通过比较过程前后不同时间节点的状态特征,可以分析出空间特征的移动,以及属性、行为特征的差异,显式地记录整个台风事件的动态过程(图7)。S2. For multiple object state sequences, the same sorting mechanism needs to be adopted to ensure the consistency of the overall order of the aggregation results. For the aggregated results oriented to the event process, by comparing the state characteristics of different time nodes before and after the process, the movement of spatial features, as well as the differences in attributes and behavioral characteristics can be analyzed, and the dynamic process of the entire typhoon event can be recorded explicitly (Figure 7).
以上已对本发明创造的较佳实施例进行了具体说明,但本发明创造并不限于所述实施例, 熟悉本领域的技术人员在不违背本发明创造精神的前提下还可做出种种的等同的变型或替换,这些等同的变型或替换均包含在本申请权利要求所限定的范围。The preferred embodiments of the present invention have been specifically described above, but the present invention is not limited to the embodiments. Those skilled in the art can also make various equivalents without departing from the spirit of the present invention. Modifications or substitutions of the present application, and these equivalent modifications or substitutions are all included in the scope defined by the claims of the present application.