CN117974340A - Social media event detection method combining deep learning classification and graph clustering - Google Patents

Social media event detection method combining deep learning classification and graph clustering Download PDF

Info

Publication number
CN117974340A
CN117974340A CN202410373064.XA CN202410373064A CN117974340A CN 117974340 A CN117974340 A CN 117974340A CN 202410373064 A CN202410373064 A CN 202410373064A CN 117974340 A CN117974340 A CN 117974340A
Authority
CN
China
Prior art keywords
message
graph
social media
clustering
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410373064.XA
Other languages
Chinese (zh)
Other versions
CN117974340B (en
Inventor
线岩团
鲁一苇
余正涛
相艳
黄于欣
王红斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202410373064.XA priority Critical patent/CN117974340B/en
Publication of CN117974340A publication Critical patent/CN117974340A/en
Application granted granted Critical
Publication of CN117974340B publication Critical patent/CN117974340B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2323Non-hierarchical techniques based on graph theory, e.g. minimum spanning trees [MST] or graph cuts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Economics (AREA)
  • Databases & Information Systems (AREA)
  • Discrete Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a social media event detection method combining deep learning classification and graph clustering, which is used for constructing and obtaining a message heterogram based on a message text and extracted characteristic information; constructing a plurality of shared characteristic edges in the message pairs, and constructing a multi-relation message graph; obtaining the similarity of the message pairs by using a deep learning classification model; if the similarity of the message pairs reaches a preset threshold, constructing an edge in the message pairs, and constructing to obtain a message isomorphic diagram; and clustering the message isomorphic graphs as input of a graph clustering algorithm to obtain a social media event detection result. The method avoids the process of representing the social messages as vectors, takes heterogeneous association of the message pairs and message texts as input, judges whether the message pairs belong to the same event by utilizing the deep learning classification model difference, constructs a message isomorphic graph by the prediction result of the message pairs, discovers the social message clusters with close association as social events by utilizing a graph clustering algorithm, and is used for massive social media event detection tasks.

Description

结合深度学习分类与图聚类的社交媒体事件检测方法Social media event detection method combining deep learning classification and graph clustering

技术领域Technical Field

本发明涉及自然语言处理技术和文本挖掘技术领域,具体涉及一种结合深度学习分类与图聚类的社交媒体事件检测方法。The present invention relates to the fields of natural language processing technology and text mining technology, and in particular to a social media event detection method combining deep learning classification and graph clustering.

背景技术Background technique

随着互联网的发展,社交媒体平台已经改变了人们的生活方式,成为人们的主要信息来源。社交媒体对消息的传播速度和新事件的发现敏感度明显高于传统媒体。因此,深刻分析社交媒体文本信息,发现社交媒体事件显得尤为重要。With the development of the Internet, social media platforms have changed people's lifestyles and become their main source of information. Social media is significantly more sensitive to the spread of information and the discovery of new events than traditional media. Therefore, it is particularly important to deeply analyze social media text information and discover social media events.

深度学习社交媒体事件检测是当前的主流方法,这类方法在利用深度神经网络学习社交消息向量表示的基础上,通过距离或密度聚类算法发现社交媒体事件。然而,由于社交消息文本短、词共现稀疏,利用深度学习模型将同一事件的社交消息表示为距离相似向量,不同事件的向量相互远离是困难的。事件聚类是社交媒体事件检测的重要步骤,旨在将庞大的社交媒体数据整理成有关同一事件的相关内容集合,从而帮助用户更好的理解和追踪事件的发展。大部分的事件检测任务所采用的聚类算法主要基于词的特征,并且在解决短文本聚类存在高维稀疏的问题上引入了外部特征,然而过度关注于词的特征会导致噪声和异常值的影响过大,并且无法在流式数据中找到聚类中心而影响聚类性能。对于社交媒体中的流式数据,目前常用的聚类算法对于次序的依赖性很高并且计算开销大,在处理海量的社交媒体文本时效率低下而大大影响模型的性能。Deep learning social media event detection is the current mainstream method. This type of method uses deep neural networks to learn social message vector representations and discovers social media events through distance or density clustering algorithms. However, due to the short text of social messages and sparse word co-occurrence, it is difficult to use deep learning models to represent social messages of the same event as distance-similar vectors, and the vectors of different events are far away from each other. Event clustering is an important step in social media event detection, which aims to organize huge social media data into a collection of related content about the same event, so as to help users better understand and track the development of events. Most clustering algorithms used in event detection tasks are mainly based on word features, and external features are introduced to solve the problem of high-dimensional sparsity in short text clustering. However, excessive focus on word features will lead to excessive influence of noise and outliers, and it is impossible to find cluster centers in streaming data, which affects clustering performance. For streaming data in social media, the currently commonly used clustering algorithms have a high dependence on order and high computational overhead. They are inefficient when processing massive social media texts, which greatly affects the performance of the model.

发明内容Summary of the invention

为此,本发明提供一种结合深度学习分类与图聚类的社交媒体事件检测方法,以解决现有深度学习社交媒体事件检测方法将同一事件的社交消息表示为距离相似向量的方式存在不同事件的社交消息向量相互远离困难,聚类算法过度关注于词的特征,在处理海量的社交媒体文本时效率低下等的问题。To this end, the present invention provides a social media event detection method that combines deep learning classification and graph clustering to solve the problems that the existing deep learning social media event detection method represents the social messages of the same event as distance-similar vectors, which makes it difficult for the social message vectors of different events to be far away from each other, and the clustering algorithm focuses too much on the features of words and is inefficient when processing massive social media texts.

为了实现上述目的,本发明提供如下技术方案:In order to achieve the above object, the present invention provides the following technical solutions:

根据本发明实施例的第一方面,提出一种结合深度学习分类与图聚类的社交媒体事件检测方法,所述方法包括:According to a first aspect of an embodiment of the present invention, a social media event detection method combining deep learning classification and graph clustering is proposed, the method comprising:

对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图;Feature information is extracted from message texts in social media data streams, message texts and feature information are used as nodes, and message texts and their extracted feature information are connected to form edges to construct a message heterogeneous graph;

基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图;Based on the message heterogeneous graph, two message text nodes are randomly selected to construct a message pair, and multiple shared feature edges are constructed in each message pair to obtain a multi-relation message graph;

基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度;Based on the multi-relation message graph, using a deep learning classification model to obtain similarities between message pairs;

判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图;Determine whether the similarity of the message pair reaches a preset threshold. If the similarity of the message pair reaches the preset threshold, construct an edge between the two message text nodes constituting the message pair to obtain a message isomorphism graph.

将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件。The constructed message isomorphism graph is used as the input of the graph clustering algorithm for clustering, and the clustering results are used as the detected social media events.

进一步地,所述方法还包括:Furthermore, the method further comprises:

获取社交媒体文本数据集,并按照发布时间分成多个数据块,以模拟社交媒体消息的流式属性对模型进行持续训练。Obtain a social media text dataset and divide it into multiple data blocks according to the publishing time to simulate the streaming properties of social media messages for continuous model training.

进一步地,对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图,具体包括:Furthermore, feature information is extracted from the message text in the social media data stream, the message text and feature information are used as nodes, and the message text and the extracted feature information are connected to form edges to construct a message heterogeneous graph, which specifically includes:

给定单条消息文本,从文本中抽取消息的特征信息,所述特征信息包括实体信 息、用户信息、话题标签信息及时间信息,定义为Given a single message text , extract the characteristic information of the message from the text, the characteristic information includes entity information, user information, topic tag information and time information, defined as ;

定义消息异构图,其中代表社交媒体消息文本本身及其中各种类型的 事件相关特征信息的节点集合,代表消息与对应特征之间的边的集合。 Defining a message heterogeneous graph ,in A collection of nodes representing the social media message text itself and various types of event-related feature information therein. Represents the set of edges between messages and corresponding features.

进一步地,基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图,具体包括:Furthermore, based on the message heterogeneous graph, two message text nodes are randomly selected to construct message pairs, and multiple shared feature edges are constructed in each message pair to construct a multi-relation message graph, which specifically includes:

定义多关系消息图为Define the multi-relation message graph as ;

其中是消息文本节点集合是消息文本节点个数,每个节点具 有不同的特征表示并具有表示成维特征向量,将消息节点类型表示为分别代表消息、用户、实体、话题标签和时间信息,所有节点特征的集合表示为,从其他消息文本中随机采样与节点构建成消息对in Is a collection of message text nodes , is the number of message text nodes, each node has a different feature representation and has a representation of dimensional feature vector, representing the message node type as , , , and Represent messages, users, entities, topic tags and time information respectively. The set of all node features is expressed as , randomly sample nodes from other message texts Build into message pairs ;

当消息文本共享不同类型的特征信息时分别建立属于不同共享特征关系的边,是消息对中的边,边能关联多种共享特征关系: When the message texts share different types of feature information, edges belonging to different shared feature relationships are established respectively. Yes, the message is The edges in can be associated with multiple shared feature relationships:

其中是消息异构图的邻接矩阵的子矩阵,行代表所有的特征信息节点,列代 表属于关系的所有消息节点,是矩阵的转置。 in It is a submatrix of the adjacency matrix of the message heterogeneous graph. The rows represent all feature information nodes and the columns represent the nodes belonging to the relationship. All message nodes, is the transpose of the matrix.

进一步地,基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度,具体包括:Further, based on the multi-relation message graph, the similarity of the message pairs is obtained using a deep learning classification model, specifically including:

对消息文本进行预处理,包括进行分词和去停用词操作,然后通过BERT预训练语 言模型对消息文本进行词嵌入处理,用维词向量表示词,The message text is preprocessed, including word segmentation and stop word removal. Then, the message text is word embedded using the BERT pre-trained language model. Dimensional word vector Indicative words, ;

对消息文本向量进行嵌入得消息文本的嵌入向量,消息文本的嵌入向量,将两个消息文本的嵌入向量通过编码层拼接后得到消息对的编码向量: Vector message text Embed the message text The embedding vector , message text The embedding vector , concatenate the embedding vectors of the two message texts through the encoding layer to get the encoding vector of the message pair:

其中,表示向量的拼接操作; in, Represents the concatenation operation of vectors;

将得到的编码向量通过线性变换层得到低维向量,并通过激活函数 得到消息文本的相似度,其中The obtained encoding vector is passed through a linear transformation layer to obtain a low-dimensional vector and through Activate the function to get the message text and Similarity ,in :

其中,表示模型中的参数,表示激活函数。in, , , and represents the parameters in the model, Represents the activation function.

进一步地,判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图,具体包括:Further, it is determined whether the similarity of the message pair reaches a preset threshold. If the similarity of the message pair reaches the preset threshold, an edge is constructed between the two message text nodes constituting the message pair to construct a message isomorphism graph, which specifically includes:

消息同构图仍保留消息异构图的所有公共特征,The message isomorphic graph still retains all the common features of the message heterogeneous graph.

其中是消息同构图的邻接矩阵,其中是图中消息节点的总数,表 示节点类型,是消息异构图中邻接矩阵的子矩阵,包含类型的行和类型的列,是 矩阵的转置;如果消息文本节点和消息文本节点连接到一些类型的节点时,将大于或等于1,则将等于1; in is the adjacency matrix of the message isomorphism graph, where is the total number of message nodes in the graph, Indicates the node type, is a submatrix of the adjacency matrix in the message heterogeneous graph, containing Type of row and The type of the column, is the transpose of the matrix; if the message text node and the message text node Connect to some type At node time, will be greater than or equal to 1, then will be equal to 1;

对于消息文本和消息文本组成的消息对,判断消息对的相似度是否达到 预设阈值,当相似度达到预设阈值时便构建一条边,判断完所有的消息对后构成初始 消息同构图,其中是消息文本节点集合,是边集合。 For message text and message text The message pairs composed of Whether the preset threshold is reached , when the similarity reaches the preset threshold When an edge is constructed After judging all the message pairs, the initial message isomorphism graph is formed. ,in is a collection of message text nodes, is the edge set.

进一步地,将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件,具体包括:Furthermore, the constructed message isomorphism graph is used as the input of the graph clustering algorithm for clustering, and the clustering results are used as the detected social media events, specifically including:

根据消息之间特征关联的传递性进行社区扩展,之后消息节点间进行多轮博弈选择更优的社区,最终实现社区的稳定状态,将社区划分得到的聚类结果作为检测到的社交媒体事件。The community is expanded according to the transitivity of feature associations between messages, and then multiple rounds of games are conducted between message nodes to select a better community, ultimately achieving a stable state of the community. The clustering results obtained by community division are used as detected social media events.

进一步地,根据消息之间特征关联的传递性进行社区扩展,之后消息节点间进行多轮博弈选择更优的社区,最终实现社区的稳定状态,具体包括:Furthermore, the community is expanded based on the transitivity of the feature associations between messages, and then multiple rounds of games are conducted between message nodes to select a better community, and finally a stable state of the community is achieved, including:

扩大社区规模,倘若三个点之间有两条边便形成一个半三角,根据三元闭包原理,两个节点在与一共同节点有关系的情况下也具有关联,则认为三个点之间存在隐性的关联关系,即三个点同属于一个社区;Expanding the community scale, if there are two edges between three points, a half triangle is formed. According to the ternary closure principle, two nodes are also related when they are related to a common node. It is considered that there is an implicit association between the three points, that is, the three points belong to the same community.

进行多轮迭代,在每轮迭代中所有节点均根据当前的社区划分来做出更优的选择,共有以下三种选择:不改变当前社区;离开现在的社区并不加入任何其他社区;离开当前社区并加入另一个社区;Multiple rounds of iterations are carried out. In each round of iteration, all nodes make better choices based on the current community division. There are three choices: do not change the current community; leave the current community and do not join any other community; leave the current community and join another community;

多轮博弈后算法达到纳什均衡,即所有节点均加入自己最为满意的社区,通过设置阈值对算法是否达到纳什均衡进行判断。After multiple rounds of games, the algorithm reaches Nash equilibrium, that is, all nodes join the community that they are most satisfied with. The algorithm can be judged whether it has reached Nash equilibrium by setting a threshold.

根据本发明实施例的第二方面,提出一种结合深度学习分类与图聚类的社交媒体事件检测系统,所述系统包括:According to a second aspect of an embodiment of the present invention, a social media event detection system combining deep learning classification and graph clustering is proposed, the system comprising:

消息异构图构建模块,用于对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图;A message heterogeneous graph construction module is used to extract feature information from message texts in social media data streams, use message texts and feature information as nodes, and connect message texts with their extracted feature information to form edges, thereby constructing a message heterogeneous graph;

多关系消息图构建模块,用于基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图;A multi-relation message graph construction module is used to randomly select two message text nodes to construct message pairs based on the message heterogeneous graph, and to construct multiple shared feature edges in each message pair to obtain a multi-relation message graph;

深度学习分类模块,用于基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度;A deep learning classification module, used to obtain the similarity of message pairs based on the multi-relation message graph using a deep learning classification model;

消息同构图构建模块,用于判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图;A message isomorphism graph construction module is used to determine whether the similarity of a message pair reaches a preset threshold. If the similarity of the message pair reaches the preset threshold, an edge is constructed between the two message text nodes constituting the message pair to construct a message isomorphism graph;

图聚类模块,用于将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件。The graph clustering module is used to cluster the constructed message isomorphism graph as the input of the graph clustering algorithm, and the clustering result is used as the detected social media event.

根据本发明实施例的第三方面,提出一种电子设备,所述设备包括:处理器和存储器;According to a third aspect of an embodiment of the present invention, an electronic device is provided, the device comprising: a processor and a memory;

所述存储器用于存储一个或多个程序指令;The memory is used to store one or more program instructions;

所述处理器,用于运行一个或多个程序指令,用以执行如上任一项所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法的步骤。The processor is used to run one or more program instructions to execute the steps of a social media event detection method combining deep learning classification and graph clustering as described in any of the above items.

根据本发明实施例的第四方面,提出一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上任一项所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法的步骤。According to a fourth aspect of an embodiment of the present invention, a computer-readable storage medium is proposed, on which a computer program is stored. When the computer program is executed by a processor, the steps of a social media event detection method combining deep learning classification and graph clustering as described in any one of the above items are implemented.

本发明提出一种结合深度学习分类与图聚类的社交媒体事件检测方法,对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图;基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图;基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度;判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图;将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件。本发明结合深度学习分类模型与图聚类算法,避免了将社交消息表示为向量的过程,以消息对的异构关联和消息文本为输入,利用深度学习分类模型差判别消息对是否属于同一事件,并通过消息对的预测结果构建消息同构图,利用图聚类算法发现具有紧密关联的社交消息簇作为社交事件;经过验证本发明的性能指标优于基线模型;本发明保证了计算难度和复杂度,可以用于海量的社交媒体事件检测任务。The present invention proposes a social media event detection method combining deep learning classification and graph clustering, which extracts feature information from message texts in a social media data stream, uses message texts and feature information as nodes, connects message texts with their extracted feature information to form edges, and constructs a message heterogeneous graph; based on the message heterogeneous graph, randomly selects two message text nodes to construct message pairs, and constructs multiple shared feature edges in each pair of message pairs to construct a multi-relation message graph; based on the multi-relation message graph, uses a deep learning classification model to obtain the similarity of the message pairs; determines whether the similarity of the message pairs reaches a preset threshold, and if the similarity of the message pairs reaches the preset threshold, constructs an edge between the two message text nodes that constitute the message pair to construct a message isomorphism graph; uses the constructed message isomorphism graph as the input of a graph clustering algorithm for clustering, and uses the clustering result as the detected social media event. The present invention combines a deep learning classification model with a graph clustering algorithm, avoids the process of representing social messages as vectors, takes the heterogeneous associations of message pairs and message texts as input, uses a deep learning classification model to differentially determine whether message pairs belong to the same event, and constructs a message isomorphism graph through the prediction results of the message pairs, and uses a graph clustering algorithm to discover closely related social message clusters as social events; it has been verified that the performance indicators of the present invention are better than those of the baseline model; the present invention ensures the computational difficulty and complexity, and can be used for massive social media event detection tasks.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明的实施方式或现有技术中的技术方案,下面将对实施方式或现有技术描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是示例性的,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图引伸获得其它的实施附图。In order to more clearly illustrate the implementation methods of the present invention or the technical solutions in the prior art, the drawings required for the implementation methods or the description of the prior art are briefly introduced below. Obviously, the drawings in the following description are only exemplary, and for ordinary technicians in this field, other implementation drawings can be derived from the provided drawings without creative work.

本说明书所绘示的结构、比例、大小等,均仅用以配合说明书所揭示的内容,以供熟悉此技术的人士了解与阅读,并非用以限定本发明可实施的限定条件,故不具技术上的实质意义,任何结构的修饰、比例关系的改变或大小的调整,在不影响本发明所能产生的功效及所能达成的目的下,均应仍落在本发明所揭示的技术内容得能涵盖的范围内。The structures, proportions, sizes, etc. illustrated in this specification are only used to match the contents disclosed in the specification so as to facilitate understanding and reading by persons familiar with the technology. They are not used to limit the conditions under which the present invention can be implemented, and therefore have no substantial technical significance. Any structural modification, change in proportion or adjustment of size shall still fall within the scope of the technical contents disclosed in the present invention without affecting the effects and purposes that can be achieved by the present invention.

图1为本发明实施例提供的一种结合深度学习分类与图聚类的社交媒体事件检测方法的流程图;FIG1 is a flow chart of a social media event detection method combining deep learning classification and graph clustering provided by an embodiment of the present invention;

图2为本发明实施例提供的一种结合深度学习分类与图聚类的社交媒体事件检测方法的原理图;FIG2 is a schematic diagram of a social media event detection method combining deep learning classification and graph clustering provided by an embodiment of the present invention;

图3为本发明实施例提供的一种结合深度学习分类与图聚类的社交媒体事件检测系统的结构示意图。FIG3 is a schematic diagram of the structure of a social media event detection system combining deep learning classification and graph clustering provided by an embodiment of the present invention.

具体实施方式Detailed ways

以下由特定的具体实施例说明本发明的实施方式,熟悉此技术的人士可由本说明书所揭露的内容轻易地了解本发明的其他优点及功效,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following is a description of the implementation of the present invention by specific embodiments. People familiar with the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

本发明第一实施例提供了一种结合深度学习分类与图聚类的社交媒体事件检测方法,下面结合图1和图2进行说明。The first embodiment of the present invention provides a social media event detection method combining deep learning classification and graph clustering, which is described below with reference to FIG. 1 and FIG. 2 .

如图1所示,在步骤S101中,对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图。As shown in FIG. 1 , in step S101 , feature information is extracted from message texts in a social media data stream, the message texts and feature information are used as nodes, and the message texts and the extracted feature information are connected to form edges to construct a message heterogeneous graph.

本实施例中,将社交媒体消息按照发布时间分成数据块,以模拟社交媒体消息的流式属性进行持续训练,提取消息中的实体信息、用户信息、标签信息及时间信息作为特征,构建消息异构图。In this embodiment, social media messages are divided into data blocks according to the release time to simulate the streaming properties of social media messages for continuous training, and entity information, user information, tag information and time information in the messages are extracted as features to construct a message heterogeneous graph.

具体的,从消息中抽取以下四种类型的特征来实现社交媒体数据的最大化利用, 并通过统一的方式进一步对抽取出来的特征进行处理以构建消息异构图。给定单条消息文 本,从文本中抽取消息的实体信息、用户信息、标签信息及时间信息,定义为,将这些特征及社交媒体消息文本本身作为节点,在和其抽取的特 征信息之间构成边,形成消息异构图。 Specifically, the following four types of features are extracted from messages to maximize the use of social media data, and the extracted features are further processed in a unified way to construct a message heterogeneous graph. , extracting the entity information, user information, tag information and time information of the message from the text, defined as , taking these features and the social media message text itself as nodes, The edges are formed between the feature information extracted from it, forming a message heterogeneous graph.

定义消息异构图,其中代表社交媒体消息文本本身及其中各种类型的 事件相关特征信息的节点集合,代表消息与对应特征之间的边的集合。 Defining a message heterogeneous graph ,in A collection of nodes representing the social media message text itself and various types of event-related feature information therein. Represents the set of edges between messages and corresponding features.

如图1所示,在步骤S102中,基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图。As shown in FIG. 1 , in step S102 , based on the message heterogeneous graph, two message text nodes are randomly selected to construct a message pair, and a plurality of shared feature edges are constructed in each message pair to obtain a multi-relation message graph.

本实施例中,将消息异构图中的特征映射到消息对中用来防止不同类型事件元素之间异构特征信息的丢失,在每对消息对中构建多条信息特征边,形成多关系消息图,用于保存丰富的社交媒体消息特征。In this embodiment, the features in the message heterogeneous graph are mapped to message pairs to prevent the loss of heterogeneous feature information between event elements of different types. Multiple information feature edges are constructed in each pair of messages to form a multi-relationship message graph for preserving rich social media message features.

具体的,定义多关系消息图为,其中是一个数据块中的节点 集合是节点个数,每个节点具有不同的特征表示并具有表示成维特征向量,将节点类型表示为分别代表消息、用户、实体、标签和时间 信息,所有节点特征的集合表示为,从其他消息中随机采样与之构建成 消息对Specifically, define the multi-relation message graph as ,in is a collection of nodes in a data block , is the number of nodes, each node has a different feature representation and is represented as of dimensional feature vector, representing the node type as , , , and Represent message, user, entity, tag and time information respectively, and the set of all node features is expressed as , randomly sample from other messages to form a message pair .

当消息共享不同类型的特征元素时分别建立属于不同特征关系的边,是消息对之间的边,边可以关联多种关系,总共有四种不同类型的关 系;When messages share different types of feature elements, edges belonging to different feature relationships are established respectively. Yes, the message is The edges between them can be associated with multiple relationships, with a total of four different types of relationships;

其中是消息异构图的邻接矩阵的子矩阵,行代表所有的特征信息节点,列代 表属于关系的所有消息节点,是矩阵的转置,取二者中较小的一个。 in It is a submatrix of the adjacency matrix of the message heterogeneous graph. The rows represent all feature information nodes and the columns represent the nodes belonging to the relationship. All message nodes, is the transpose of the matrix, taking the smaller of the two.

如图1所示,在步骤S103中,基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度。As shown in FIG. 1 , in step S103 , based on the multi-relation message graph, a deep learning classification model is used to obtain the similarity of message pairs.

本实施例中,将消息文本及特征利用BERT预训练语言模型对其文本进行处理,得 到消息文本的嵌入向量,将文本的嵌入向量随机拼接形成消息对,并将其送入编码层得到 消息对的编码向量。将获得的消息对的编码向量作为输入,通过线性变换层得到一个低维 向量,通过激活函数计算出消息对的相似度。 In this embodiment, the message text and features are processed using the BERT pre-trained language model to obtain the embedded vector of the message text, and the embedded vector of the text is randomly spliced to form a message pair, which is then sent to the encoding layer to obtain the encoding vector of the message pair. The obtained encoding vector of the message pair is used as input, and a low-dimensional vector is obtained through the linear transformation layer. The activation function calculates the similarity between message pairs.

具体的,在文本输入前,首先对其进行预处理,进行分词和去停用词操作,通过 BERT预训练语言模型对消息文本进行词嵌入处理,用维词向量表示词,。对消息 文本向量进行嵌入得到消息文本的嵌入向量,消息文本的嵌入向量,将两个文本的嵌入向量拼接后通过编码层得到消息对的编码向量: Specifically, before the text is input, it is first preprocessed to perform word segmentation and stop word removal operations, and the message text is word embedded using the BERT pre-trained language model. Dimensional word vector Indicative words, . For message text vector Embed to get the message text The embedding vector , message text The embedding vector , concatenate the embedding vectors of the two texts and pass them through the encoding layer to get the encoding vector of the message pair:

其中,表示向量的拼接操作。 in, Represents a vector concatenation operation.

将得到的编码向量通过线性变换层得到一个低维向量,并通过激活函 数得到消息文本的相似度,其中The obtained encoded vector is passed through a linear transformation layer to obtain a low-dimensional vector and through Activate the function to get the message text and Similarity ,in :

其中,表示模型中的参数,表示激活函数。 in, , , and represents the parameters in the model, Represents the activation function.

如图1所示,在步骤S104中,判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图。As shown in FIG. 1 , in step S104 , it is determined whether the similarity of the message pair reaches a preset threshold. If the similarity of the message pair reaches the preset threshold, an edge is constructed between the two message text nodes constituting the message pair to construct a message isomorphism graph.

本实施例中,为了增强消息之间的关联程度,将丰富的多种信息特征合并为唯一特征,将消息异构图映射为消息同构图,以实现图聚类算法。将消息对的相似度作为构建消息同构图的唯一依据,消息同构图只包含消息节点,在消息之间仅存在一种共享所有公共元素的边。In this embodiment, in order to enhance the degree of association between messages, a variety of rich information features are merged into a unique feature, and the message heterogeneous graph is mapped to a message isomorphic graph to implement a graph clustering algorithm. The similarity of message pairs is used as the only basis for constructing a message isomorphic graph, which only contains message nodes, and there is only one edge between messages that shares all common elements.

具体的,消息同构图仍保留消息异构图的所有公共特征,Specifically, the message isomorphic graph still retains all the common features of the message heterogeneous graph.

其中是消息同构图的邻接矩阵,其中是图中消息节点的总数,表 示节点类型,是消息异构图中邻接矩阵的子矩阵,包含类型的行和类型的列,是 矩阵的转置,取二者中较小的一个。如果消息和消息连接到一些类型的节点时,将大于或等于1,则将等于1。 in is the adjacency matrix of the message isomorphism graph, where is the total number of message nodes in the graph, Indicates the node type, is a submatrix of the adjacency matrix in the message heterogeneous graph, containing Type of row and The type of the column, is the transpose of the matrix, taking the smaller of the two. If the message and Message Connect to some type At node time, will be greater than or equal to 1, then will be equal to 1.

对于消息和消息,判断两者的相似度,当相似度达到阈值时便构建一条 边,判断完所有的事件对后构成初始图,其中是节点集合,是边集合。 For Messages and Message , to determine the similarity between the two , when the similarity reaches the threshold When an edge is constructed After judging all the event pairs, the initial graph is formed. ,in is a collection of nodes, is the edge set.

如图1所示,在步骤S105中,将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件。As shown in FIG. 1 , in step S105 , the constructed message isomorphism graph is used as the input of the graph clustering algorithm for clustering, and the clustering result is used as the detected social media event.

本实施例中,将构建的消息同构图作为图聚类算法的输入进行聚类,根据消息之间特征关联的传递性进行社区扩展,之后消息节点间进行多轮博弈选择更优的社区,最终实现社区的稳定状态,将社区划分结果作为聚类结果实现社交媒体事件检测。In this embodiment, the constructed message isomorphism graph is used as the input of the graph clustering algorithm for clustering, and the community is expanded according to the transitivity of the feature associations between messages. After that, multiple rounds of game are carried out between the message nodes to select a better community, and finally a stable state of the community is achieved. The community division results are used as clustering results to realize social media event detection.

具体过程包括,扩大社区规模,倘若三个点之间有两条边便形成一个半三角,根据三元闭包原理,两个节点在与一共同节点有关系的情况下也具有关联,则认为三个点之间存在隐性的关联关系,即三个点同属于一个社区;The specific process includes expanding the community size. If there are two edges between three points, a half triangle is formed. According to the ternary closure principle, two nodes are also related when they are related to a common node. It is considered that there is an implicit association between the three points, that is, the three points belong to the same community.

进行多轮迭代,在每轮迭代中所有节点均可以根据当前的社区划分来做出对于自己而言更优的选择,共有以下三种选择:不改变当前社区;离开现在的社区并不加入任何其他社区;离开当前社区并加入另一个社区;Multiple rounds of iterations are carried out. In each round of iteration, all nodes can make a better choice for themselves based on the current community division. There are three choices: do not change the current community; leave the current community and do not join any other community; leave the current community and join another community;

多轮博弈后算法达到纳什均衡,即所有节点均加入自己最为满意的社区,设置一个阈值对算法状态进行判断。After multiple rounds of games, the algorithm reaches Nash equilibrium, that is, all nodes join the community that they are most satisfied with, and a threshold is set to judge the algorithm status.

为了说明本发明的效果,本发明和已有的方法进行比较,在大规模的公开数据集 Event2012上进行实验,其发布时间约为29天,用第一周的消息组成初始块,其他按发布 时间分成21个消息块。评价指标与对比方法一致,采用归一化互信息(NMI)、调整互信息 (AMI)和调整兰德指数(ARI)作为评价聚类结果指标,聚类算法结束的阈值取为0.1%。实验 结果分别如表1-表3所示: In order to illustrate the effect of the present invention, the present invention is compared with the existing methods, and experiments are conducted on a large-scale public data set Event2012, which has a release time of about 29 days and uses the messages in the first week to form the initial block. , and the others are divided into 21 message blocks according to the release time. The evaluation index is consistent with the comparison method. Normalized mutual information (NMI), adjusted mutual information (AMI) and adjusted Rand index (ARI) are used as clustering result evaluation indicators. The threshold for the end of the clustering algorithm is 0.1%. The experimental results are shown in Tables 1 to 3 respectively:

表1 增量评估NMI得分Table 1 Incremental evaluation NMI scores

BlocksBlocks Word2vecWord2vec LDALDA WMDWMD BERTBERT BiLSTMBiLSTM PP-GCNPP-GCN EventXEventX KPGNNKPGNN FinEventdFinEventd oursours M1M1 .19±.00.19±.00 .11±.00.11±.00 .32±.00.32±.00 .36±.00.36±.00 .24±.00.24±.00 .23±.00.23±.00 .36±.00.36±.00 .39±.00.39±.00 .84±.01.84±.01 0.47±.00 ↓.360.47±.00↓.36 M2M2 .50±.00.50±.00 .27±.01.27±.01 .71±.00.71±.00 .78±.00.78±.00 .50±.00.50±.00 .57±.02.57±.02 .68±.00.68±.00 .79±.01.79±.01 .84±.00.84±.00 0.92±.01 ↑.080.92±.01↑.08 M3M3 .39±.00.39±.00 .28±.00.28±.00 .67±.00.67±.00 .75±.00.75±.00 .39±.00.39±.00 .55±.01.55±.01 .63±.00.63±.00 .76±.00.76±.00 .89±.00.89±.00 0.87±.01 ↓.010.87±.01↓.01 M4M4 .34±.00.34±.00 .25±.00.25±.00 .50±.00.50±.00 .60±.00.60±.00 .40±.00.40±.00 .46±.01.46±.01 .63±.00.63±.00 .67±.00.67±.00 .71±.01.71±.01 0.84±.00 ↑.130.84±.00↑.13 M5M5 .41±.00.41±.00 .26±.00.26±.00 .61±.00.61±.00 .72±.00.72±.00 .41±.00.41±.00 .48±.01.48±.01 .59±.00.59±.00 .73±.01.73±.01 .83±.00.83±.00 0.84±.00 ↑.010.84±.00↑.01 M6M6 .53±.00.53±.00 .32±.00.32±.00 .61±.00.61±.00 .78±.00.78±.00 .50±.00.50±.00 .57±.01.57±.01 .70±.00.70±.00 .82±.01.82±.01 .83±.00.83±.00 0.94±.00 ↑.110.94±.00↑.11 M7M7 .25±.00.25±.00 .18±.01.18±.01 .46±.00.46±.00 .54±.00.54±.00 .33±.00.33±.00 .37±.00.37±.00 .51±.00.51±.00 .55±.01.55±.01 .73±.01.73±.01 0.64±.02 ↓.080.64±.02↓.08 M8M8 .46±.00.46±.00 .37±.01.37±.01 .67±.00.67±.00 .79±.00.79±.00 .49±.00.49±.00 .55±.02.55±.02 .71±.00.71±.00 .80±.00.80±.00 .87±.02.87±.02 0.90±.00 ↑.030.90±.00↑.03 M9M9 .35±.00.35±.00 .34±.00.34±.00 .55±.00.55±.00 .70±.00.70±.00 .43±.00.43±.00 .51±.02.51±.02 .67±.00.67±.00 .74±.02.74±.02 .79±.01.79±.01 0.88±.01 ↑.090.88±.01 ↑.09 M10M10 .51±.00.51±.00 .44±.01.44±.01 .61±.00.61±.00 .74±.00.74±.00 .50±.00.50±.00 .55±.02.55±.02 .68±.00.68±.00 .80±.01.80±.01 .82±.01.82±.01 0.95±.00 ↑.130.95±.00↑.13 M11M11 .37±.00.37±.00 .33±.01.33±.01 .50±.00.50±.00 .68±.00.68±.00 .49±.00.49±.00 .50±.01.50±.01 .65±.00.65±.00 .74±.01.74±.01 .75±.00.75±.00 0.90±.00 ↑.150.90±.00↑.15 M12M12 .30±.00.30±.00 .22±.01.22±.01 .60±.00.60±.00 .59±.00.59±.00 .39±.00.39±.00 .45±.01.45±.01 .61±.00.61±.00 .68±.01.68±.01 .67±.01.67±.01 0.81±.00 ↑.140.81±.00↑.14 M13M13 .37±.00.37±.00 .27±.00.27±.00 .54±.00.54±.00 .63±.00.63±.00 .46±.00.46±.00 .47±.01.47±.01 .58±.00.58±.00 .69±.01.69±.01 .79±.00.79±.00 0.84±.01 ↑.050.84±.01 ↑.05 M14M14 .36±.00.36±.00 .21±.00.21±.00 .66±.00.66±.00 .64±.00.64±.00 .44±.00.44±.00 .44±.01.44±.01 .57±.00.57±.00 .69±.00.69±.00 .82±.00.82±.00 0.79±.01 ↓.030.79±.01↓.03 M15M15 .27±.00.27±.00 .21±.00.21±.00 .51±.00.51±.00 .54±.00.54±.00 .40±.00.40±.00 .39±.01.39±.01 .49±.00.49±.00 .58±.00.58±.00 .69±.01.69±.01 0.85±.01 ↑.160.85±.01 ↑.16 M16M16 .49±.00.49±.00 .35±.01.35±.01 .60±.00.60±.00 .75±.00.75±.00 .53±.00.53±.00 .55±.01.55±.01 .62±.00.62±.00 .79±.01.79±.01 .90±.01.90±.01 0.91±.01 ↑.010.91±.01↑.01 M17M17 .33±.00.33±.00 .19±.00.19±.00 .55±.00.55±.00 .63±.00.63±.00 .45±.00.45±.00 .48±.00.48±.00 .58±.00.58±.00 .70±.01.70±.01 .83±.00.83±.00 0.84±.00 ↑.010.84±.00↑.01 M18M18 .29±.00.29±.00 .18±.00.18±.00 .63±.00.63±.00 .57±.00.57±.00 .44±.00.44±.00 .47±.01.47±.01 .59±.00.59±.00 .68±.02.68±.02 .74±.01.74±.01 0.85±.01 ↑.110.85±.01 ↑.11 M19M19 .37±.00.37±.00 .29±.01.29±.01 .54±.00.54±.00 .66±.00.66±.00 .44±.00.44±.00 .51±.02.51±.02 .60±.00.60±.00 .73±.01.73±.01 .66±.01.66±.01 0.87±.00 ↑.210.87±.00↑.21 M20M20 .38±.00.38±.00 .35±.00.35±.00 .58±.00.58±.00 .68±.00.68±.00 .48±.00.48±.00 .51±.01.51±.01 .67±.00.67±.00 .72±.02.72±.02 .80±.00.80±.00 0.85±.02 ↑.050.85±.02 ↑.05 M21M21 .31±.00.31±.00 .19±.00.19±.00 .58±.00.58±.00 .59±.00.59±.00 .41±.00.41±.00 .41±.02.41±.02 .53±.00.53±.00 .60±.00.60±.00 .74±.01.74±.01 0.75±.01 ↑.010.75±.01 ↑.01 AVGAVG 0.37000.3700 0.26710.2671 0.57420.5742 0.65330.6533 0.43450.4345 0.47710.4771 0.80240.8024 0.68760.6876 0.78760.7876 0.83380.8338

表2 增量评估AMI得分Table 2 Incremental evaluation AMI scores

BlocksBlocks Word2vecWord2vec LDALDA WMDWMD BERTBERT BiLSTMBiLSTM PP-GCNPP-GCN EventXEventX KPGNNKPGNN FinEventdFinEventd oursours M1M1 .08±.00.08±.00 .08±.00.08±.00 .30±.00.30±.00 .34±.00.34±.00 .12±.00.12±.00 .21±.00.21±.00 .06±.00.06±.00 .37±.00.37±.00 .84±.01.84±.01 0.40±.01 ↓.430.40±.01↓.43 M2M2 .41±.00.41±.00 .20±.01.20±.01 .69±.00.69±.00 .76±.00.76±.00 .41±.00.41±.00 .55±.02.55±.02 .29±.02.29±.02 .78±.01.78±.01 .84±.01.84±.01 0.89±.00 ↑.050.89±.00↑.05 M3M3 .31±.00.31±.00 .22±.01.22±.01 .63±.00.63±.00 .73±.00.73±.00 .31±.00.31±.00 .52±.01.52±.01 .18±.01.18±.01 .74±.00.74±.00 .89±.01.89±.01 0.82±.00 ↓.060.82±.00↓.06 M4M4 .24±.00.24±.00 .17±.00.17±.00 .45±.00.45±.00 .55±.00.55±.00 .30±.00.30±.00 .42±.01.42±.01 .19±.01.19±.01 .64±.01.64±.01 .69±.00.69±.00 0.77±.00 ↑.080.77±.00↑.08 M5M5 .33±.00.33±.00 .21±.00.21±.00 .57±.00.57±.00 .71±.00.71±.00 .33±.00.33±.00 .46±.01.46±.01 .14±.00.14±.00 .71±.01.71±.01 .82±.00.82±.00 0.79±.02 ↓.030.79±.02↓.03 M6M6 .40±.00.40±.00 .20±.00.20±.00 .57±.00.57±.00 .74±.00.74±.00 .36±.00.36±.00 .52±.02.52±.02 .27±.00.27±.00 .79±.01.79±.01 .82±.02.82±.02 0.90±.00 ↑.080.90±.00↑.08 M7M7 .13±.00.13±.00 .12±.01.12±.01 .46±.00.46±.00 .50±.00.50±.00 .20±.00.20±.00 .34±.00.34±.00 .13±.00.13±.00 .51±.01.51±.01 .72±.00.72±.00 0.55±.03 ↓.140.55±.03↓.14 M8M8 .33±.00.33±.00 .24±.01.24±.01 .63±.00.63±.00 .75±.00.75±.00 .35±.00.35±.00 .49±.02.49±.02 .21±.00.21±.00 .76±.01.76±.01 .87±.01.87±.01 0.83±.00 ↓.030.83±.00↓.03 M9M9 .24±.00.24±.00 .24±.00.24±.00 .46±.00.46±.00 .66±.00.66±.00 .32±.00.32±.00 .46±.02.46±.02 .19±.00.19±.00 .71±.02.71±.02 .78±.01.78±.01 0.81±.01 ↑.030.81±.01↑.03 M10M10 .39±.00.39±.00 .36±.01.36±.01 .57±.00.57±.00 .70±.00.70±.00 .39±.00.39±.00 .51±.02.51±.02 .24±.00.24±.00 .78±.01.78±.01 .81±.00.81±.00 0.92±.00 ↑.110.92±.00↑.11 M11M11 .26±.00.26±.00 .25±.01.25±.01 .42±.00.42±.00 .65±.00.65±.00 .37±.00.37±.00 .46±.01.46±.01 .24±.00.24±.00 .71±.01.71±.01 .74±.00.74±.00 0.86±.00 ↑.120.86±.00↑.12 M12M12 .23±.00.23±.00 .16±.01.16±.01 .58±.00.58±.00 .56±.00.56±.00 .32±.00.32±.00 .42±.01.42±.01 .16±.00.16±.00 .66±.01.66±.01 .67±.02.67±.02 0.72±.01 ↑.050.72±.01 ↑.05 M13M13 .23±.00.23±.00 .19±.00.19±.00 .50±.00.50±.00 .59±.00.59±.00 .31±.00.31±.00 .43±.01.43±.01 .16±.00.16±.00 .67±.01.67±.01 .79±.00.79±.00 0.80±.00 ↑.010.80±.00↑.01 M14M14 .26±.00.26±.00 .15±.00.15±.00 .64±.00.64±.00 .61±.00.61±.00 .34±.00.34±.00 .41±.01.41±.01 .14±.00.14±.00 .65±.00.65±.00 .82±.01.82±.01 0.72±.01 ↓.090.72±.01↓.09 M15M15 .15±.00.15±.00 .13±.00.13±.00 .47±.00.47±.00 .50±.00.50±.00 .26±.00.26±.00 .35±.01.35±.01 .07±.00.07±.00 .54±.00.54±.00 .67±.01.67±.01 0.80±.00 ↑.130.80±.00↑.13 M16M16 .36±.00.36±.00 .27±.01.27±.01 .59±.00.59±.00 .72±.00.72±.00 .41±.00.41±.00 .52±.01.52±.01 .19±.00.19±.00 .77±.01.77±.01 .90±.01.90±.01 0.88±.02 ↓.010.88±.02↓.01 M17M17 .24±.00.24±.00 .13±.00.13±.00 .57±.00.57±.00 .60±.00.60±.00 .35±.00.35±.00 .45±.00.45±.00 .18±.00.18±.00 .68±.01.68±.01 .82±.01.82±.01 0.79±.00 ↓.020.79±.00↓.02 M18M18 .21±.00.21±.00 .12±.00.12±.00 .60±.00.60±.00 .53±.00.53±.00 .35±.00.35±.00 .45±.01.45±.01 .16±.00.16±.00 .66±.02.66±.02 .74±.00.74±.00 0.81±.00 ↑.070.81±.00↑.07 M19M19 .28±.00.28±.00 .22±.01.22±.01 .49±.00.49±.00 .63±.00.63±.00 .35±.00.35±.00 .48±.02.48±.02 .16±.00.16±.00 .71±.01.71±.01 .66±.01.66±.01 0.83±.01 ↑.170.83±.01 ↑.17 M20M20 .24±.00.24±.00 .23±.00.23±.00 .55±.00.55±.00 .62±.00.62±.00 .34±.00.34±.00 .45±.02.45±.02 .18±.00.18±.00 .68±.02.68±.02 .78±.00.78±.00 0.76±.00 ↓.020.76±.00↓.02 M21M21 .21±.00.21±.00 .13±.00.13±.00 .52±.00.52±.00 .57±.00.57±.00 .31±.00.31±.00 .38±.02.38±.02 .10±.00.10±.00 .57±.00.57±.00 .64±.01.64±.01 0.69±.01 ↑.050.69±.01↑.05 AVGAVG 0.26330.2633 0.19140.1914 0.53620.5362 0.62000.6200 0.32380.3238 0.44190.4419 0.17330.1733 0.67100.6710 0.77670.7767 0.77810.7781

表3 增量评估ARI得分Table 3 Incremental evaluation ARI scores

BlocksBlocks Word2vecWord2vec LDALDA WMDWMD BERTBERT BiLSTMBiLSTM PP-GCNPP-GCN EventXEventX KPGNNKPGNN FinEventdFinEventd oursours M1M1 .01±.00.01±.00 .01±.00.01±.00 .04±.00.04±.00 .03±.00.03±.00 .03±.00.03±.00 .05±.00.05±.00 .01±.00.01±.00 .07±.01.07±.01 .90±.00.90±.00 0.11±.01 ↓.780.11±.01↓.78 M2M2 .49±.00.49±.00 .08±.00.08±.00 .48±.00.48±.00 .64±.00.64±.00 .49±.00.49±.00 .67±.03.67±.03 .45±.02.45±.02 .76±.02.76±.02 .90±.01.90±.01 0.91±.00 ↑.010.91±.00↑.01 M3M3 .16±.00.16±.00 .02±.01.02±.01 .28±.00.28±.00 .43±.00.43±.00 .17±.00.17±.00 .47±.01.47±.01 .09±.01.09±.01 .58±.00.58±.00 .89±.01.89±.01 0.78±.01 ↓.100.78±.01↓.10 M4M4 .07±.00.07±.00 .07±.00.07±.00 .11±.00.11±.00 .19±.00.19±.00 .11±.00.11±.00 .24±.01.24±.01 .07±.01.07±.01 .29±.01.29±.01 .27±.01.27±.01 0.56±.01 ↑.290.56±.01 ↑.29 M5M5 .17±.00.17±.00 .06±.00.06±.00 .26±.00.26±.00 .44±.00.44±.00 .19±.00.19±.00 .34±.00.34±.00 .04±.00.04±.00 .47±.03.47±.03 .63±.02.63±.02 0.76±.01 ↑.130.76±.01 ↑.13 M6M6 .25±.00.25±.00 .07±.01.07±.01 .16±.00.16±.00 .44±.00.44±.00 .18±.00.18±.00 .55±.03.55±.03 .14±.00.14±.00 .72±.03.72±.03 .74±.00.74±.00 0.90±.00 ↑.160.90±.00↑.16 M7M7 .02±.00.02±.00 .01±.00.01±.00 .08±.00.08±.00 .07±.00.07±.00 .08±.00.08±.00 .11±.02.11±.02 .02±.00.02±.00 .12±.00.12±.00 .45±.01.45±.01 0.15±.02 ↓.280.15±.02↓.28 M8M8 .17±.00.17±.00 .03±.00.03±.00 .22±.00.22±.00 .50±.00.50±.00 .08±.00.08±.00 .43±.04.43±.04 .09±.00.09±.00 .60±.01.60±.01 .72±.01.72±.01 0.79±.00 ↑.070.79±.00↑.07 M9M9 .08±.00.08±.00 .03±.01.03±.01 .12±.00.12±.00 .33±.00.33±.00 .27±.00.27±.00 .31±.02.31±.02 .07±.00.07±.00 .46±.02.46±.02 .68±.00.68±.00 0.63±.02 ↓.030.63±.02↓.03 M10M10 .23±.00.23±.00 .09±.02.09±.02 .20±.00.20±.00 .44±.00.44±.00 .22±.00.22±.00 .50±.07.50±.07 .13±.00.13±.00 .70±.06.70±.06 .74±.01.74±.01 0.88±.00 ↑.140.88±.00↑.14 M11M11 .09±.00.09±.00 .03±.01.03±.01 .12±.00.12±.00 .27±.00.27±.00 .17±.00.17±.00 .38±.02.38±.02 .16±.00.16±.00 .49±.03.49±.03 .60±.01.60±.01 0.93±.00 ↑.330.93±.00↑.33 M12M12 .09±.00.09±.00 .02±.01.02±.01 .27±.00.27±.00 .31±.00.31±.00 .13±.00.13±.00 .34±.03.34±.03 .07±.00.07±.00 .48±.01.48±.01 .26±.00.26±.00 0.60±.01 ↑.140.60±.01↑.14 M13M13 .06±.00.06±.00 .01±.00.01±.00 .13±.00.13±.00 .14±.00.14±.00 .13±.00.13±.00 .19±.01.19±.01 .04±.00.04±.00 .29±.03.29±.03 .75±.02.75±.02 0.73±.01 ↓.010.73±.01↓.01 M14M14 .10±.00.10±.00 .02±.00.02±.00 .33±.00.33±.00 .30±.00.30±.00 .16±.00.16±.00 .29±.01.29±.01 .10±.00.10±.00 .42±.02.42±.02 .81±.01.81±.01 0.58±.00 ↓.220.58±.00↓.22 M15M15 .09±.00.09±.00 .01±.00.01±.00 .16±.00.16±.00 .10±.00.10±.00 .14±.00.14±.00 .15±.00.15±.00 .01±.00.01±.00 .17±.00.17±.00 .46±.00.46±.00 0.86±.00 ↑.400.86±.00 ↑.40 M16M16 .10±.00.10±.00 .11±.01.11±.01 .32±.00.32±.00 .41±.00.41±.00 .10±.00.10±.00 .51±.03.51±.03 .08±.00.08±.00 .66±.05.66±.05 .88±.01.88±.01 0.85±.01 ↓.020.85±.01↓.02 M17M17 .06±.00.06±.00 .02±.00.02±.00 .26±.00.26±.00 .24±.00.24±.00 .17±.00.17±.00 .35±.03.35±.03 .12±.00.12±.00 .43±.05.43±.05 .81±.01.81±.01 0.70±.03 ↓.080.70±.03↓.08 M18M18 .21±.00.21±.00 .02±.00.02±.00 .35±.00.35±.00 .24±.00.24±.00 .19±.00.19±.00 .39±.03.39±.03 .08±.00.08±.00 .47±.04.47±.04 .52±.01.52±.01 0.76±.01 ↑.240.76±.01 ↑.24 M19M19 .28±.00.28±.00 .03±.00.03±.00 .12±.00.12±.00 .32±.00.32±.00 .16±.00.16±.00 .41±.02.41±.02 .07±.00.07±.00 .51±.03.51±.03 .35±.01.35±.01 0.75±.02 ↑.400.75±.02 ↑.40 M20M20 .24±.00.24±.00 .02±.01.02±.01 .19±.00.19±.00 .33±.00.33±.00 .20±.00.20±.00 .41±.01.41±.01 .11±.00.11±.00 .51±.04.51±.04 .71±.01.71±.01 0.67±.01 ↓.030.67±.01↓.03 M21M21 .21±.00.21±.00 .01±.01.01±.01 .19±.00.19±.00 .18±.00.18±.00 .16±.00.16±.00 .20±.03.20±.03 .01±.00.01±.00 .20±.01.20±.01 .48±.00.48±.00 0.51±.00 ↑.030.51±.00↑.03 AVGAVG 0.15140.1514 0.03670.0367 0.20900.2090 0.30240.3024 0.16810.1681 0.34710.3471 0.09330.0933 0.44760.4476 0.64520.6452 0.68610.6861

由实验结果可知,实验组的模型与基线模型FinEvent相比,NMI值平均提高了4.62%,AMI值平均提高了0.14%,ARI值平均提高了4.09%。可知本发明在保证计算难度和复杂度的情况下,关注图结构,采用图聚类算法来挖掘社交媒体消息中的连接关系,可以更好的提升社交媒体事件检测的性能。The experimental results show that compared with the baseline model FinEvent, the NMI value of the experimental group model increased by 4.62% on average, the AMI value increased by 0.14% on average, and the ARI value increased by 4.09% on average. It can be seen that the present invention pays attention to the graph structure and uses a graph clustering algorithm to mine the connection relationship in social media messages while ensuring the difficulty and complexity of calculation, which can better improve the performance of social media event detection.

与上述公开的一种结合深度学习分类与图聚类的社交媒体事件检测方法相对应,本发明实施例还公开了一种结合深度学习分类与图聚类的社交媒体事件检测系统,如图3所示,其具体包括:Corresponding to the above-disclosed social media event detection method combining deep learning classification and graph clustering, an embodiment of the present invention further discloses a social media event detection system combining deep learning classification and graph clustering, as shown in FIG3 , which specifically includes:

消息异构图构建模块,用于对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图;A message heterogeneous graph construction module is used to extract feature information from message texts in social media data streams, use message texts and feature information as nodes, and connect message texts with their extracted feature information to form edges, thereby constructing a message heterogeneous graph;

多关系消息图构建模块,用于基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图;A multi-relation message graph construction module is used to randomly select two message text nodes to construct message pairs based on the message heterogeneous graph, and to construct multiple shared feature edges in each message pair to obtain a multi-relation message graph;

深度学习分类模块,用于基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度;A deep learning classification module, used to obtain the similarity of message pairs based on the multi-relation message graph using a deep learning classification model;

消息同构图构建模块,用于判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图;A message isomorphism graph construction module is used to determine whether the similarity of a message pair reaches a preset threshold. If the similarity of the message pair reaches the preset threshold, an edge is constructed between the two message text nodes constituting the message pair to construct a message isomorphism graph;

图聚类模块,用于将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件。The graph clustering module is used to cluster the constructed message isomorphism graph as the input of the graph clustering algorithm, and the clustering results are used as the detected social media events.

需要说明的是,对于本发明实施例提供的一种结合深度学习分类与图聚类的社交媒体事件检测系统的详细描述可以参考对本申请实施例提供的一种结合深度学习分类与图聚类的社交媒体事件检测方法的相关描述,这里不再赘述。It should be noted that for the detailed description of a social media event detection system combining deep learning classification and graph clustering provided in an embodiment of the present invention, reference can be made to the relevant description of a social media event detection method combining deep learning classification and graph clustering provided in an embodiment of the present application, which will not be repeated here.

另外,本发明实施例还提供了一种电子设备,所述设备包括:处理器和存储器;所述存储器用于存储一个或多个程序指令;所述处理器,用于运行一个或多个程序指令,用以执行如上所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法的步骤。In addition, an embodiment of the present invention also provides an electronic device, comprising: a processor and a memory; the memory is used to store one or more program instructions; the processor is used to run one or more program instructions to execute the steps of a social media event detection method combining deep learning classification and graph clustering as described above.

需要说明的是,对于本发明实施例提供的一种电子设备的详细描述可以参考对本申请实施例提供的一种结合深度学习分类与图聚类的社交媒体事件检测方法的相关描述,这里不再赘述。It should be noted that, for a detailed description of an electronic device provided in an embodiment of the present invention, reference can be made to the relevant description of a social media event detection method combining deep learning classification and graph clustering provided in an embodiment of the present application, which will not be repeated here.

另外,本发明实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述一种结合深度学习分类与图聚类的社交媒体事件检测方法的步骤。In addition, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the social media event detection method combining deep learning classification and graph clustering are implemented as described above.

需要说明的是,对于本发明实施例提供的一种计算机可读存储介质的详细描述可以参考对本申请实施例提供的一种结合深度学习分类与图聚类的社交媒体事件检测方法的相关描述,这里不再赘述。It should be noted that, for a detailed description of a computer-readable storage medium provided in an embodiment of the present invention, reference can be made to the relevant description of a social media event detection method combining deep learning classification and graph clustering provided in an embodiment of the present application, which will not be repeated here.

在本发明实施例中,处理器可以是一种集成电路芯片,具有信号的处理能力。处理器可以是通用处理器、数字信号处理器(Digital Signal Processor,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、现场可编程门阵列(FieldProgrammable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。In the embodiment of the present invention, the processor may be an integrated circuit chip having the signal processing capability. The processor may be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。处理器读取存储介质中的信息,结合其硬件完成上述方法的步骤。The methods, steps and logic block diagrams disclosed in the embodiments of the present invention can be implemented or executed. The general processor can be a microprocessor or the processor can also be any conventional processor, etc. The steps of the method disclosed in the embodiments of the present invention can be directly embodied as a hardware decoding processor for execution, or can be executed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, etc. The processor reads the information in the storage medium and completes the steps of the above method in combination with its hardware.

存储介质可以是存储器,例如可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。The storage medium may be a memory, which may be, for example, a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memory.

其中,非易失性存储器可以是只读存储器(Read-Only Memory,简称ROM)、可编程只读存储器(Programmable ROM,简称PROM)、可擦除可编程只读存储器(Erasable PROM,简称EPROM)、电可擦除可编程只读存储器(Electrically EPROM,简称EEPROM)或闪存。Among them, the non-volatile memory can be a read-only memory (ROM), a programmable ROM (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory.

易失性存储器可以是随机存取存储器(Random Access Memory,简称RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,简称SRAM)、动态随机存取存储器(Dynamic RAM,简称DRAM)、同步动态随机存取存储器(Synchronous DRAM,简称SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data RateSDRAM,简称DDRSDRAM)、增强型同步动态随机存取存储器(EnhancedSDRAM,简称ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,简称SLDRAM)和直接内存总线随机存取存储器(DirectRambus RAM,简称DRRAM)。The volatile memory may be a random access memory (RAM) which is used as an external cache. By way of example and not limitation, many forms of RAM are available, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct RAM bus random access memory (DRRAM).

本发明实施例描述的存储介质旨在包括但不限于这些和任意其它适合类型的存储器。The storage media described in the embodiments of the present invention are intended to include, but are not limited to, these and any other suitable types of memory.

本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件与软件组合来实现。当应用软件时,可以将相应功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。Those skilled in the art will appreciate that in one or more of the above examples, the functions described in the present invention may be implemented using a combination of hardware and software. When software is used, the corresponding functions may be stored in a computer-readable medium or transmitted as one or more instructions or codes on a computer-readable medium. Computer-readable media include computer storage media and communication media, wherein communication media include any media that facilitates the transmission of computer programs from one place to another. Storage media may be any available media that can be accessed by a general or special purpose computer.

虽然,上文中已经用一般性说明及具体实施例对本发明作了详尽的描述,但在本发明基础上,可以对之作一些修改或改进,这对本领域技术人员而言是显而易见的。因此,在不偏离本发明精神的基础上所做的这些修改或改进,均属于本发明要求保护的范围。Although the present invention has been described in detail above by general description and specific embodiments, it is obvious to those skilled in the art that some modifications or improvements can be made to the present invention. Therefore, these modifications or improvements made without departing from the spirit of the present invention all belong to the scope of protection claimed by the present invention.

Claims (10)

1.一种结合深度学习分类与图聚类的社交媒体事件检测方法,其特征在于,所述方法包括:1. A social media event detection method combining deep learning classification and graph clustering, characterized in that the method comprises: 对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图;Feature information is extracted from message texts in social media data streams, message texts and feature information are used as nodes, and message texts and their extracted feature information are connected to form edges to construct a message heterogeneous graph; 基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图;Based on the message heterogeneous graph, two message text nodes are randomly selected to construct a message pair, and multiple shared feature edges are constructed in each message pair to obtain a multi-relation message graph; 基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度;Based on the multi-relation message graph, using a deep learning classification model to obtain similarities between message pairs; 判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图;Determine whether the similarity of the message pair reaches a preset threshold. If the similarity of the message pair reaches the preset threshold, construct an edge between the two message text nodes constituting the message pair to obtain a message isomorphism graph. 将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件。The constructed message isomorphism graph is used as the input of the graph clustering algorithm for clustering, and the clustering results are used as the detected social media events. 2.根据权利要求1所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法,其特征在于,所述方法还包括:2. According to claim 1, a social media event detection method combining deep learning classification and graph clustering is characterized in that the method further comprises: 获取社交媒体文本数据集,并按照发布时间分成多个数据块,以模拟社交媒体消息的流式属性对模型进行持续训练。Obtain a social media text dataset and divide it into multiple data blocks according to the publishing time to simulate the streaming properties of social media messages for continuous model training. 3.根据权利要求1所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法,其特征在于,对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图,具体包括:3. According to claim 1, a social media event detection method combining deep learning classification and graph clustering is characterized in that feature information is extracted from message texts in social media data streams, message texts and feature information are used as nodes, and message texts and their extracted feature information are connected to form edges to construct a message heterogeneous graph, specifically including: 给定单条消息文本,从文本中抽取消息的特征信息,所述特征信息包括实体信息、用户信息、话题标签信息及时间信息,定义为/>Given a single message text , extract the characteristic information of the message from the text, the characteristic information includes entity information, user information, topic tag information and time information, defined as/> ; 定义消息异构图,其中/>代表社交媒体消息文本本身及其中各种类型的事件相关特征信息的节点集合,/>代表消息与对应特征之间的边的集合。Defining a message heterogeneous graph , where/> A node set representing the social media message text itself and various types of event-related feature information therein, /> Represents the set of edges between messages and corresponding features. 4.根据权利要求3所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法,其特征在于,基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图,具体包括:4. According to claim 3, a social media event detection method combining deep learning classification and graph clustering is characterized in that based on the message heterogeneous graph, two message text nodes are randomly selected to construct message pairs, and multiple shared feature edges are constructed in each pair of message pairs to construct a multi-relation message graph, which specifically includes: 定义多关系消息图为Define the multi-relation message graph as ; 其中是消息文本节点集合/>,/>是消息文本节点个数,每个节点具有不同的特征表示并具有表示成/>的/>维特征向量,将消息节点类型表示为/>,/>,/>,/>和/>分别代表消息、用户、实体、话题标签和时间信息,所有节点特征的集合表示为,从其他消息文本中随机采样与节点/>构建成消息对/>in Is a message text node collection/> ,/> is the number of message text nodes, each node has a different feature representation and is represented as/> /> dimension feature vector, representing the message node type as/> ,/> ,/> ,/> and/> Represent messages, users, entities, topic tags and time information respectively. The set of all node features is expressed as , randomly sample nodes from other message texts/> Constructed into message pairs/> ; 当消息文本共享不同类型的特征信息时分别建立属于不同共享特征关系的边,是消息对/>中的边,边能关联多种共享特征关系:When the message texts share different types of feature information, edges belonging to different shared feature relationships are established respectively. It is a message to/> The edges in can be associated with multiple shared feature relationships: 其中/>是消息异构图的邻接矩阵的子矩阵,行代表所有的特征信息节点,列代表属于关系/>的所有消息节点,/>是矩阵的转置。 Where/> It is a submatrix of the adjacency matrix of the message heterogeneous graph. The rows represent all feature information nodes and the columns represent the relationships./> All message nodes, /> is the transpose of the matrix. 5.根据权利要求1所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法,其特征在于,基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度,具体包括:5. According to claim 1, a social media event detection method combining deep learning classification and graph clustering is characterized in that based on the multi-relation message graph, a deep learning classification model is used to obtain the similarity of message pairs, specifically including: 对消息文本进行预处理,包括进行分词和去停用词操作,然后通过BERT预训练语言模型对消息文本进行词嵌入处理,用维词向量/>表示词,/>The message text is preprocessed, including word segmentation and stop word removal. Then, the message text is word embedded using the BERT pre-trained language model. Dimensional word vector/> Indicates word, /> ; 对消息文本向量进行嵌入得消息文本/>的嵌入向量/>,消息文本/>的嵌入向量/>,将两个消息文本的嵌入向量通过编码层拼接后得到消息对的编码向量:Vector message text Embed the message text/> Embedding vector of , message text/> Embedding vector of , the embedding vectors of the two message texts are concatenated through the encoding layer to obtain the encoding vector of the message pair: 其中,/>表示向量的拼接操作; Among them,/> Represents the concatenation operation of vectors; 将得到的编码向量通过线性变换层得到低维向量,并通过/>激活函数得到消息文本/>和/>的相似度/>,其中/>The obtained encoding vector is passed through a linear transformation layer to obtain a low-dimensional vector , and by /> Activate the function to get the message text/> and/> Similarity of/> , where/> : ,/>,其中,/>,/>,/>和/>表示模型中的参数,/>表示激活函数。 ,/> , where /> ,/> ,/> and/> Represents the parameters in the model, /> Represents the activation function. 6.根据权利要求3所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法,其特征在于,判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图,具体包括:6. According to claim 3, a social media event detection method combining deep learning classification and graph clustering is characterized in that it is determined whether the similarity of the message pair reaches a preset threshold. If the similarity of the message pair reaches the preset threshold, an edge is constructed between the two message text nodes constituting the message pair to construct a message isomorphism graph, which specifically includes: 消息同构图仍保留消息异构图的所有公共特征,The message isomorphic graph still retains all the common features of the message heterogeneous graph. ,其中/>是消息同构图的邻接矩阵,其中/>是图中消息节点的总数,/>表示节点类型,/>是消息异构图中邻接矩阵的子矩阵,包含类型的行和/>类型的列,/>是矩阵的转置;如果消息文本节点/>和消息文本节点/>连接到一些类型的/>节点时,/>将大于或等于1,则/>将等于1; , where/> is the adjacency matrix of the message isomorphism graph, where/> is the total number of message nodes in the graph, /> Indicates the node type, /> is a submatrix of the adjacency matrix in the message heterogeneous graph, containing Type of line and /> Type of column, /> is the transpose of the matrix; if the message text node /> and the message text node /> Connect to some type of /> Node, /> will be greater than or equal to 1, then/> will be equal to 1; 对于消息文本和消息文本/>组成的消息对,判断消息对的相似度/>是否达到预设阈值/>,当相似度达到预设阈值/>时便构建一条边/>,判断完所有的消息对后构成初始消息同构图/>,其中/>是消息文本节点集合,/>是边集合。For message text and the message text/> The message pairs are composed of the following: Whether the preset threshold is reached/> , when the similarity reaches the preset threshold/> When an edge is constructed/> , after judging all the message pairs, the initial message isomorphism graph is formed/> , where/> Is a collection of message text nodes, /> is the edge set. 7.根据权利要求1所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法,其特征在于,将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件,具体包括:7. According to claim 1, a social media event detection method combining deep learning classification and graph clustering is characterized in that the constructed message isomorphism graph is used as the input of the graph clustering algorithm for clustering, and the clustering result is used as the detected social media event, specifically including: 根据消息之间特征关联的传递性进行社区扩展,之后消息节点间进行多轮博弈选择更优的社区,最终实现社区的稳定状态,将社区划分得到的聚类结果作为检测到的社交媒体事件。The community is expanded according to the transitivity of feature associations between messages, and then multiple rounds of games are conducted between message nodes to select a better community, ultimately achieving a stable state of the community. The clustering results obtained by community division are used as detected social media events. 8.根据权利要求7所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法,其特征在于,根据消息之间特征关联的传递性进行社区扩展,之后消息节点间进行多轮博弈选择更优的社区,最终实现社区的稳定状态,具体包括:8. According to claim 7, a social media event detection method combining deep learning classification and graph clustering is characterized in that community expansion is performed according to the transitivity of feature associations between messages, and then multiple rounds of game selection are performed between message nodes to select a better community, and finally a stable state of the community is achieved, which specifically includes: 扩大社区规模,倘若三个点之间有两条边便形成一个半三角,根据三元闭包原理,两个节点在与一共同节点有关系的情况下也具有关联,则认为三个点之间存在隐性的关联关系,即三个点同属于一个社区;Expanding the community scale, if there are two edges between three points, a half triangle is formed. According to the ternary closure principle, two nodes are also related when they are related to a common node. It is considered that there is an implicit association between the three points, that is, the three points belong to the same community. 进行多轮迭代,在每轮迭代中所有节点均根据当前的社区划分来做出更优的选择,共有以下三种选择:不改变当前社区;离开现在的社区并不加入任何其他社区;离开当前社区并加入另一个社区;Multiple rounds of iterations are carried out. In each round of iteration, all nodes make better choices based on the current community division. There are three choices: do not change the current community; leave the current community and do not join any other community; leave the current community and join another community; 多轮博弈后算法达到纳什均衡,即所有节点均加入自己最为满意的社区,通过设置阈值对算法是否达到纳什均衡进行判断。After multiple rounds of games, the algorithm reaches Nash equilibrium, that is, all nodes join the community that they are most satisfied with. The algorithm can be judged whether it has reached Nash equilibrium by setting a threshold. 9.一种结合深度学习分类与图聚类的社交媒体事件检测系统,其特征在于,所述系统包括:9. A social media event detection system combining deep learning classification and graph clustering, characterized in that the system includes: 消息异构图构建模块,用于对社交媒体数据流中的消息文本进行特征信息提取,将消息文本及特征信息作为节点,并将消息文本与其提取的特征信息连接构成边,构建得到消息异构图;A message heterogeneous graph construction module is used to extract feature information from message texts in social media data streams, use message texts and feature information as nodes, and connect message texts with their extracted feature information to form edges, thereby constructing a message heterogeneous graph; 多关系消息图构建模块,用于基于所述消息异构图,随机选取两个消息文本节点构建消息对,并在每对消息对中构建多条共享特征边,构建得到多关系消息图;A multi-relation message graph construction module is used to randomly select two message text nodes to construct message pairs based on the message heterogeneous graph, and to construct multiple shared feature edges in each message pair to obtain a multi-relation message graph; 深度学习分类模块,用于基于所述多关系消息图,利用深度学习分类模型得到消息对的相似度;A deep learning classification module, used to obtain the similarity of message pairs based on the multi-relation message graph using a deep learning classification model; 消息同构图构建模块,用于判断消息对的相似度是否达到预设阈值,若消息对的相似度达到预设阈值,则在组成消息对的两个消息文本节点之间构建一条边,构建得到消息同构图;A message isomorphism graph construction module is used to determine whether the similarity of a message pair reaches a preset threshold. If the similarity of the message pair reaches the preset threshold, an edge is constructed between the two message text nodes constituting the message pair to construct a message isomorphism graph; 图聚类模块,用于将构建的消息同构图作为图聚类算法的输入进行聚类,聚类结果作为检测到的社交媒体事件。The graph clustering module is used to cluster the constructed message isomorphism graph as the input of the graph clustering algorithm, and the clustering result is used as the detected social media event. 10.一种电子设备,其特征在于,所述设备包括:处理器和存储器;10. An electronic device, characterized in that the device comprises: a processor and a memory; 所述存储器用于存储一个或多个程序指令;The memory is used to store one or more program instructions; 所述处理器,用于运行一个或多个程序指令,用以执行如权利要求1至8任一项所述的一种结合深度学习分类与图聚类的社交媒体事件检测方法的步骤。The processor is used to run one or more program instructions to execute the steps of a social media event detection method combining deep learning classification and graph clustering as described in any one of claims 1 to 8.
CN202410373064.XA 2024-03-29 2024-03-29 Social media event detection method combining deep learning classification and graph clustering Active CN117974340B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410373064.XA CN117974340B (en) 2024-03-29 2024-03-29 Social media event detection method combining deep learning classification and graph clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410373064.XA CN117974340B (en) 2024-03-29 2024-03-29 Social media event detection method combining deep learning classification and graph clustering

Publications (2)

Publication Number Publication Date
CN117974340A true CN117974340A (en) 2024-05-03
CN117974340B CN117974340B (en) 2024-06-18

Family

ID=90859854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410373064.XA Active CN117974340B (en) 2024-03-29 2024-03-29 Social media event detection method combining deep learning classification and graph clustering

Country Status (1)

Country Link
CN (1) CN117974340B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118537157A (en) * 2024-05-10 2024-08-23 石家庄铁道大学 Social event detection method based on hierarchical multi-relation structure entropy minimization
CN119441632A (en) * 2024-10-15 2025-02-14 昆明理工大学 A method for generating social event context based on double-layer structure entropy graph clustering

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260356A (en) * 2015-10-10 2016-01-20 西安交通大学 Chinese interactive text emotion and topic identification method based on multitask learning
CN105956197A (en) * 2016-06-15 2016-09-21 杭州量知数据科技有限公司 Social media graph representation model-based social risk event extraction method
US20160343027A1 (en) * 2015-05-22 2016-11-24 Facebook, Inc. Clustering users of a social networking system based on user interactions with content items associated with a topic
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN108959323A (en) * 2017-05-25 2018-12-07 腾讯科技(深圳)有限公司 Video classification methods and device
CN110457711A (en) * 2019-08-20 2019-11-15 电子科技大学 A topic recognition method for social media events based on keywords
CN113158194A (en) * 2021-03-30 2021-07-23 西北大学 Vulnerability model construction method and detection method based on multi-relation graph network
CN114093422A (en) * 2021-11-23 2022-02-25 湖南大学 A prediction method and system for miRNA-gene interaction based on multi-relational graph convolutional network
CN115859793A (en) * 2022-11-21 2023-03-28 河北工业大学 Attention-based method and system for detecting abnormal behavior of users in heterogeneous information networks
CN115936104A (en) * 2022-09-16 2023-04-07 中国银联股份有限公司 Method and apparatus for training machine learning models
CN115952362A (en) * 2023-01-03 2023-04-11 西北工业大学 A self-evolving fake news detection method for social media
CN116468586A (en) * 2023-04-20 2023-07-21 福建商学院 Intelligent wholesale method and system for appeal events in social media
CN116681176A (en) * 2023-06-12 2023-09-01 济南大学 A Traffic Flow Prediction Method Based on Clustering and Heterogeneous Graph Neural Network
CN117670571A (en) * 2024-01-30 2024-03-08 昆明理工大学 Incremental social media event detection method based on heterogeneous message graph relation embedding

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160343027A1 (en) * 2015-05-22 2016-11-24 Facebook, Inc. Clustering users of a social networking system based on user interactions with content items associated with a topic
CN105260356A (en) * 2015-10-10 2016-01-20 西安交通大学 Chinese interactive text emotion and topic identification method based on multitask learning
CN105956197A (en) * 2016-06-15 2016-09-21 杭州量知数据科技有限公司 Social media graph representation model-based social risk event extraction method
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN108959323A (en) * 2017-05-25 2018-12-07 腾讯科技(深圳)有限公司 Video classification methods and device
CN110457711A (en) * 2019-08-20 2019-11-15 电子科技大学 A topic recognition method for social media events based on keywords
CN113158194A (en) * 2021-03-30 2021-07-23 西北大学 Vulnerability model construction method and detection method based on multi-relation graph network
CN114093422A (en) * 2021-11-23 2022-02-25 湖南大学 A prediction method and system for miRNA-gene interaction based on multi-relational graph convolutional network
CN115936104A (en) * 2022-09-16 2023-04-07 中国银联股份有限公司 Method and apparatus for training machine learning models
CN115859793A (en) * 2022-11-21 2023-03-28 河北工业大学 Attention-based method and system for detecting abnormal behavior of users in heterogeneous information networks
CN115952362A (en) * 2023-01-03 2023-04-11 西北工业大学 A self-evolving fake news detection method for social media
CN116468586A (en) * 2023-04-20 2023-07-21 福建商学院 Intelligent wholesale method and system for appeal events in social media
CN116681176A (en) * 2023-06-12 2023-09-01 济南大学 A Traffic Flow Prediction Method Based on Clustering and Heterogeneous Graph Neural Network
CN117670571A (en) * 2024-01-30 2024-03-08 昆明理工大学 Incremental social media event detection method based on heterogeneous message graph relation embedding

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118537157A (en) * 2024-05-10 2024-08-23 石家庄铁道大学 Social event detection method based on hierarchical multi-relation structure entropy minimization
CN119441632A (en) * 2024-10-15 2025-02-14 昆明理工大学 A method for generating social event context based on double-layer structure entropy graph clustering

Also Published As

Publication number Publication date
CN117974340B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
CN117974340A (en) Social media event detection method combining deep learning classification and graph clustering
Peng et al. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network
CN103258000B (en) Method and device for clustering high-frequency keywords in webpages
CN111563192B (en) Entity alignment method, device, electronic equipment and storage medium
CN102890698B (en) Method for automatically describing microblogging topic tag
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN103914527B (en) Graphic image recognition and matching method based on genetic programming algorithms of novel coding modes
CN110543603B (en) Collaborative filtering recommendation method, device, equipment and medium based on user behaviors
TW202217597A (en) Image incremental clustering method, electronic equipment, computer storage medium thereof
CN107517201A (en) A Network Vulnerability Identification Method Based on Timing Removal
CN118278519B (en) Knowledge graph completion method and related equipment
CN110853672B (en) Data expansion method and device for audio scene classification
CN110753065A (en) Network behavior detection method, device, device and storage medium
CN108763221B (en) Attribute name representation method and device
CN106776641B (en) Data processing method and device
CN104933143A (en) Method and device for acquiring recommended object
CN115599541A (en) A sorting device and method
CN115905630A (en) Graph database query method, device, equipment and storage medium
CN112115261B (en) Knowledge Graph Data Expansion Method Based on Symmetric and Reciprocal Relationship Statistics
CN117788629B (en) Image generation method, device and storage medium with style personalization
CN113743079A (en) Text similarity calculation method and device based on co-occurrence entity interaction graph
WO2024131115A1 (en) Text sorting method and apparatus, and electronic device and storage medium
CN108921213B (en) Entity classification model training method and device
Cho et al. Finite difference schemes for an axisymmetric nonlinear heat equation with blow-up
CN109033746A (en) A kind of protein complex recognizing method based on knot vector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant