CN111291182A - Hotspot event discovery method, device, equipment and storage medium - Google Patents

Hotspot event discovery method, device, equipment and storage medium Download PDF

Info

Publication number
CN111291182A
CN111291182A CN202010033828.2A CN202010033828A CN111291182A CN 111291182 A CN111291182 A CN 111291182A CN 202010033828 A CN202010033828 A CN 202010033828A CN 111291182 A CN111291182 A CN 111291182A
Authority
CN
China
Prior art keywords
articles
event
same
graph
vertex
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010033828.2A
Other languages
Chinese (zh)
Inventor
郑勇升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010033828.2A priority Critical patent/CN111291182A/en
Publication of CN111291182A publication Critical patent/CN111291182A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hot event discovery method, which comprises the following steps: collecting articles released by a specified website; clustering all articles by using a preset clustering algorithm to obtain a rough clustering result; sequentially pairing articles in the same category from the rough clustering result to construct text pairs; preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event; constructing an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges; and segmenting the event graph to obtain a plurality of sub-event graphs, wherein the articles corresponding to all vertexes in the same sub-event graph are the articles corresponding to the same hot event. The invention also discloses a device, equipment and a computer readable storage medium for discovering the hot event. The invention effectively improves the clustering accuracy of the hot events on the Internet.

Description

Hotspot event discovery method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a hot event discovery method, a device, equipment and a storage medium.
Background
The hot event refers to an event widely discussed or propagated in the society (internet), the discovery of the hot event is an important field of natural language processing, and a series of text processing is needed to discover the hot event in massive articles of the internet, and various algorithms and models are used for mining. The hot event discovery plays an important role in public opinion monitoring, client marketing opportunity discovery, intelligent recommendation, public opinion guidance and the like.
The traditional hot event discovery algorithm mainly uses an unsupervised text clustering algorithm, for example, after a TF-IDF algorithm is used for extracting document word frequency characteristics, similar files are aggregated together by using clustering algorithms such as K-means or LDA (latent dirichlet allocation) and the like to form a hot event article. However, the unsupervised text clustering algorithm has a general problem that the aggregated similar articles are not necessarily the same event, or the same event is not aggregated due to the difference of the character expressions, so that the clustering effect is not good, that is, the practicability of extracting the hot spot event by the unsupervised text clustering algorithm is not high.
Disclosure of Invention
The invention mainly aims to provide a hot event discovery method, a hot event discovery device, equipment and a storage medium, and aims to solve the technical problem of improving the clustering accuracy of hot events.
In order to achieve the above object, the present invention provides a method for discovering a hot spot event, including the following steps:
adopting a web crawler technology to collect articles released by a specified website;
determining the category number of rough clustering based on the number of the crawled articles, and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result;
sequentially pairing articles in the same category from the rough clustering result to construct text pairs;
preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;
constructing an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges;
and segmenting the event graph to obtain a plurality of sub-event graphs, wherein the articles corresponding to all the vertexes in the same sub-event graph are the articles corresponding to the same hot event.
Optionally, before the step of collecting the articles published by the specified website by using the web crawler technology, the method further includes:
obtaining training samples for training a text pair model, wherein the training samples are text pairs and comprise positive samples and negative samples, one text pair comprises two articles, the positive samples are obtained by pairwise matching of different articles in the same event, and the negative samples are obtained by pairwise matching of articles in different events, sampling and pairwise matching of articles in similar events;
performing word segmentation processing on two articles in each text pair to obtain a plurality of independent words or characters;
coding each word or character by adopting a preset dictionary or dictionary to obtain character coding vectors corresponding to two articles in each text pair;
inputting the character coding vectors corresponding to the two articles in each text pair into the embedding layer to be converted into matrix vectors, and respectively inputting the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer to perform feature extraction;
respectively inputting the features extracted from the two convolutional layers into the pooling layer to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features;
and inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-connection layer for classification and normalization processing to obtain the text pair model.
Optionally, the segmenting the event graph to obtain a plurality of sub-event graphs includes:
initializing the event graph to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;
partitioning and probing each vertex one by one to partition each vertex into partitions in which neighbor vertices adjacent to the vertex are located, calculating modularity change values of the event graph corresponding to the vertices before and after the vertex is partitioned, and recording neighbor vertices corresponding to the maximum modularity change values;
if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, giving up the vertex trial division;
repeatedly executing the processing flow of partitioning and probing division until the partitions corresponding to all the vertexes do not change;
compressing all the vertexes in the same partition into a new vertex to construct a new event graph, setting the weight of edges between the vertexes in the same partition as the weight of a ring of the new vertex, and setting the weight of edges between different partitions as the weight of edges between the new vertexes;
and repeatedly executing the process flow of constructing a new event graph until the modularity of the whole event graph is not changed any more, wherein one partition corresponds to one sub-event graph.
Optionally, the determining the category number of the rough clustering based on the number of the crawled articles, and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result includes:
determining the category number of rough clusters based on the number of the crawled articles;
performing word segmentation processing on the crawled article to obtain a plurality of independent words or characters;
and converting the words or characters after word segmentation into word vectors, and carrying out rough clustering on the word vectors corresponding to the crawled chapters by adopting a preset clustering algorithm according to the category number to obtain a rough clustering result.
Optionally, after the step of sequentially pairing the articles in the same category from the rough clustering result to construct text pairs, the method further includes:
acquiring titles of articles in each constructed text pair in the same category;
judging whether the titles of articles in the same text pair are the same;
if the text pairs are the same, the corresponding text pairs are reserved, otherwise, the corresponding text pairs are removed.
Further, to achieve the above object, the present invention further provides a device for discovering a hot spot event, where the device for discovering a hot spot event includes:
the acquisition module is used for acquiring articles released by a specified website by adopting a web crawler technology;
the clustering module is used for determining the category number of rough clustering based on the number of the crawled articles and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result;
the matching module is used for sequentially matching every two articles in the same category from the rough clustering result to construct a text pair;
the model processing module is used for preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;
the graph building module is used for taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of the corresponding edges to build the event graph;
and the segmentation module is used for segmenting the event graph to obtain a plurality of sub-event graphs, wherein articles corresponding to all vertexes in the same sub-event graph are articles corresponding to the same hotspot event.
Optionally, the hotspot event discovery device further includes:
the acquisition module is used for acquiring training samples for training a text pair model, wherein the training samples are text pairs and comprise positive samples and negative samples, one text pair comprises two articles, the positive samples are obtained by pairwise matching of different articles in the same event, and the negative samples are obtained by pairwise matching of articles in different events and sampling and pairwise matching of articles in similar events;
the word segmentation module is used for carrying out word segmentation on two articles in each text pair to obtain a plurality of independent words or characters;
the encoding module is used for encoding each word or character by adopting a preset dictionary or dictionary to obtain character encoding vectors corresponding to two articles in each text pair;
the model building module is used for inputting the character coding vectors corresponding to the two articles in each text pair into the embedding layer to be converted into matrix vectors, and respectively inputting the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer to carry out feature extraction; respectively inputting the features extracted from the two convolutional layers into the pooling layer to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features; and inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-connection layer for classification and normalization processing to obtain the text pair model.
Optionally, the segmentation module comprises:
the initialization unit is used for initializing the event graph so as to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;
the partitioning unit is used for partitioning each vertex one by performing partitioning trial partitioning to partition each vertex into partitions in which neighboring vertices adjacent to the vertex are located, calculating modularity change values of the event graph corresponding to the vertices before and after the partitioning, and recording the neighboring vertices corresponding to the maximum modularity change value; if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, giving up the vertex trial division; repeatedly executing the processing flow of partitioning and probing division until the partitions corresponding to all the vertexes do not change;
the new graph building unit is used for compressing all the vertexes in the same partition into a new vertex to build a new event graph, setting the weight of edges among the vertexes in the same partition as the weight of a ring of the new vertex, and setting the weight of edges among different partitions as the weight of edges among the new vertexes; and repeatedly executing the process flow of constructing a new event graph until the modularity of the whole event graph is not changed any more, wherein one partition corresponds to one sub-event graph.
Optionally, the clustering module is specifically configured to:
determining the category number of rough clusters based on the number of the crawled articles;
performing word segmentation processing on the crawled article to obtain a plurality of independent words or characters;
and converting the words or characters after word segmentation into word vectors, and carrying out rough clustering on the word vectors corresponding to the crawled chapters by adopting a preset clustering algorithm according to the category number to obtain a rough clustering result.
Optionally, the hotspot event discovery device further includes:
the text pair removing module is used for acquiring the titles of the articles in each constructed text pair in the same category; judging whether the titles of articles in the same text pair are the same; if the text pairs are the same, the corresponding text pairs are reserved, otherwise, the corresponding text pairs are removed.
Further, to achieve the above object, the present invention also provides a hotspot event discovery device, which includes a memory, a processor, and a hotspot event discovery program stored on the memory and operable on the processor, where the hotspot event discovery program, when executed by the processor, implements the steps of the hotspot event discovery method according to any one of the above embodiments.
Further, to achieve the above object, the present invention also provides a computer readable storage medium, on which a hot spot event discovery program is stored, where the hot spot event discovery program, when executed by a processor, implements the steps of the hot spot event discovery method according to any one of the above items.
On the basis of text preprocessing and a traditional clustering method, the invention adds a deep learning-based classification model for accurate classification, so that the accuracy of event discovery is improved, and the corpus can be used for training in combination with requirements. The text pair model can learn the matching task aiming at the text pair so as to learn the important features used for distinguishing the two texts in the text pair and carry out similarity calculation on the features, thereby judging whether the two texts speak the same event or not, and simultaneously outputting the similarity of the text pair for further use in a graph-based aggregation algorithm. The invention effectively improves the finding effect of the hot events on the Internet, and can apply the found event results to a plurality of fields such as public opinion monitoring, client marketing opportunity finding, intelligent recommendation and the like.
Drawings
Fig. 1 is a schematic structural diagram of an operating environment of a hotspot event discovery device according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a hot spot event discovery method according to a first embodiment of the present invention;
fig. 3 is a flowchart illustrating a hot spot event discovery method according to a second embodiment of the present invention;
FIG. 4 is a schematic view of a detailed process of step S160 in FIG. 2;
FIG. 5 is a schematic view of a detailed process of step S120 in FIG. 2;
fig. 6 is a flowchart illustrating a hot spot event discovery method according to a third embodiment of the present invention;
fig. 7 is a functional block diagram of a hot spot event discovery apparatus according to a first embodiment of the present invention;
fig. 8 is a functional block diagram of a hot spot event discovery apparatus according to a second embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a hot event discovery device.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an operating environment of a hotspot event discovery device according to an embodiment of the present application.
As shown in fig. 1, the hotspot event discovery device includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the hardware configuration of the hotspot event discovery device shown in fig. 1 does not constitute a limitation of the hotspot event discovery device, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a computer program. The operating system is a program for managing and controlling the hot event discovery device and software resources, and supports the operation of the hot event discovery program and other software and/or programs.
In the hardware structure of the hotspot event discovery device shown in fig. 1, the network interface 1004 is mainly used for accessing a network; the user interface 1003 is mainly used for detecting a confirmation instruction, an editing instruction, and the like. And the processor 1001 may be configured to invoke a hotspot event discovery program stored in the memory 1005 and perform the operations of the following embodiments of the hotspot event discovery method.
Based on the above hardware structure of the hot event discovery device, various embodiments of the hot event discovery method are provided.
Referring to fig. 2, fig. 2 is a flowchart illustrating a hot spot event discovery method according to a first embodiment of the present invention. In this embodiment, the method for discovering the hot event includes the following steps:
step S110, adopting a web crawler technology to collect articles released by a specified website;
in this embodiment, it is preferable to directly collect and publish an article from a specific website to perform hot event identification, and a crawling manner of the text published by the website is not limited. Preferably, the designated crawler program is deployed by multiple machines by using the Docker container as a medium, so that the designated contents are crawled by the multiple machines.
Step S120, determining the category number of rough clustering based on the number of the crawled articles, and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result;
in this embodiment, in order to obtain a better clustering effect, it is preferable to preset a comparison relationship between the number of articles and the number of clustering categories, for example, 1W articles can be grouped into more than 10 categories, and more than 10W articles can be grouped into 100 categories, so as to ensure that the number of articles in each category after clustering is less than 1000.
In this embodiment, a large number of articles can be roughly divided into several categories by a preset clustering algorithm, and although the hot event articles can be formed by rough clustering division, the same event may exist in which aggregated similar articles are not described, or the same event is not aggregated in the same category due to the difference of text expressions. Therefore, further effect optimization of the rough clustering result is needed.
Step S130, sequentially taking articles in the same category from the rough clustering result, and pairing the articles in the same category to construct a text pair;
in this embodiment, a text pair specifically refers to a small group of two articles, for example, A, B, C articles exist in a certain category, and the constructed text pair includes: (A, B), (A, C), (B, C).
In this embodiment, only two articles in the same category are paired to construct a text pair, and articles in different categories do not usually belong to the same or similar events, so that the text pair does not need to be constructed.
Step S140, preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;
in this embodiment, after the text pairs are constructed, the text pairs are preprocessed, for example, word segmentation is performed and converted into word vectors, then the preprocessed data of the text pairs are sequentially input into the text to perform relevant processing on the model, for example, feature extraction and feature similarity comparison are performed, then the similarity of any two articles is output, and whether the two articles belong to the same event or not is judged.
In this embodiment, a text pair model is trained in advance in a machine learning manner, and feature extraction and feature similarity comparison can be performed on the contents of two articles through the model, so that whether the two articles belong to the same event or not is identified. The text pair model can identify whether two articles in the same clustering category belong to the same event or not, and can also identify whether two articles between different clustering categories belong to the same event or not.
S150, constructing an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges;
in this embodiment, the similarity of any two articles in the same category and different categories and whether the two articles belong to the same event can be identified through the text pair model, but all the articles belonging to the same event cannot be identified as a whole, so that hot event clustering is performed on the articles corresponding to a plurality of the same events again by adopting a manner of constructing an event graph in this embodiment, and the articles corresponding to the same hot event are determined.
In this embodiment, all articles are taken as the vertices of the graph, pairwise connecting lines of the articles of the same event are taken as the edges of the graph, and the similarity between the articles is taken as the weight of the corresponding edges to construct the event graph.
Step S160, segmenting the event graph to obtain a plurality of sub-event graphs, wherein the articles corresponding to all vertices in the same sub-event graph are the articles corresponding to the same hotspot event.
In this embodiment, after an event graph including all articles and articles corresponding to the same event is constructed, the event graph is segmented by a preset graph segmentation algorithm based on the degree of association between vertices, so as to obtain a plurality of sub-event graphs, and the articles corresponding to all vertices in the same sub-event graph are the articles corresponding to the same hot event, so that hot event clustering is further implemented on the basis of a plurality of the same events.
On the basis of text preprocessing and a traditional clustering method, a deep learning-based classification model is added for accurate classification, so that the accuracy of event discovery is improved, and the corpus can be used for training in combination with requirements. The text pair model of the embodiment can learn the matching task for the text pair so as to learn important features in the text pair for distinguishing the two texts and perform similarity calculation on the features, so that whether the two texts speak the same event can be judged, and meanwhile, the similarity of the text pair can be output for further use in a graph-based aggregation algorithm. The embodiment can effectively improve the finding effect of the hot events on the Internet, and the found event results are applied to a plurality of fields such as public opinion monitoring, client marketing opportunity finding, intelligent recommendation and the like.
Referring to fig. 3, fig. 3 is a flowchart illustrating a hot spot event discovery method according to a second embodiment of the present invention. In this embodiment, before the step S110, the method further includes:
step S210, obtaining training samples for training a text pair model, wherein the training samples are text pairs and comprise positive samples and negative samples, one text pair comprises two articles, the positive samples are obtained by pairwise matching of different articles in the same event, and the negative samples are obtained by pairwise matching of articles in different events and sampling and pairwise matching of articles in similar events;
in this embodiment, each training sample is a text pair including two articles, and the text pair is used as a training sample to facilitate feature comparison, so as to determine whether the training samples belong to the same event.
The embodiment is provided with a positive sample and a negative sample:
(1) positive sample
The positive sample is obtained by pairing different articles in the same event, for example, A, B, C articles are reported as the same event, and the three articles can be used as positive samples, and specifically, three samples (A, B), (A, C), (B, C) can be formed.
(2) Negative sample
The negative sample is obtained by pairwise matching articles of different events and sampling, and pairwise matching articles of similar events are all selected as the negative sample.
Step S220, performing word segmentation processing on two articles in each text pair to obtain a plurality of independent words or characters;
step S230, a preset dictionary or dictionary is adopted to encode each word or character to obtain character encoding vectors corresponding to two articles in each text pair;
in this embodiment, in order to facilitate feature extraction, it is necessary to perform word segmentation on an article in each text pair, so as to decompose the article into a plurality of independent words or characters, and then encode the words or characters obtained by word segmentation by using a preset dictionary or dictionary, so as to form a character encoding vector, so as to facilitate feature extraction on the text.
Step S240, inputting the character coding vectors corresponding to the two articles in each text pair into the embedding layer to be converted into matrix vectors, and respectively inputting the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer to carry out feature extraction;
step S250, respectively inputting the features respectively extracted by the two convolution layers into a pooling layer so as to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features;
and step S260, inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-link layer for classification and normalization processing to obtain the text pair model.
In this embodiment, it is preferable to use a PairCNN model to perform deep learning on a text pair, first convert a character encoding vector into a matrix vector through an embedding layer (embedding layer), then use two CNN networks with the same parameters to perform feature extraction on the two texts, then calculate the similarity output by the pooling layer, and use the similarity and the features output by the pooling layer as the input of a classifier, the classifier uses a full connection layer, the output of the full connection layer calculates Softmax and then calculates Loss with the label of a sample, and a random gradient descent algorithm is used to update model parameters in an iterative manner, so as to finally obtain a text pair model.
In the embodiment, a text pair is introduced as a training sample, and two independent and same-layer convolutional layers are used for feature extraction in the model training process, so that feature extraction and similarity comparison between different articles are realized, and a technical basis is provided for clustering of hot events. The text pair model of the embodiment not only identifies the features of any two articles in the same category, but also identifies the features of any two articles in different categories, so that the hot event discovery accuracy can be improved.
Referring to fig. 4, fig. 4 is a schematic view of a detailed flow of the step S160 in fig. 2. In this embodiment, the step S160 further includes:
step S1601, initializing the event graph to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;
in this embodiment, before the event graph is divided, the event graph is initialized, specifically, each vertex in the graph is regarded as an independent partition, and the number of the initial partitions is the same as the number of the vertices. For example, if there are 100 vertices in the event graph, the event graph is divided into 100 partitions, and one vertex corresponds to one partition.
Step S1602, performing partition trial division on each vertex one by one to divide each vertex into partitions in which neighboring vertices adjacent to the vertex are located, calculating modularity variation values of the event graph corresponding to the vertices before and after division, and recording the neighboring vertices corresponding to the maximum modularity variation value;
the embodiment introduces a metric index for evaluating partition network partition quality: and (5) modularity. The quality of the partition dividing mode of the vertex can be reflected through the change value of the modularity. The calculation formula of the modularity is as follows:
Figure BDA0002365313170000111
wherein W represents modularity, m represents a middle parameter, c represents a partition, i, j represents vertices i, j within the partition c; a. theijRepresents the weight between vertex i and vertex j, Σ in represents the sum of the weights of the edges within partition c, and Σ tot represents the sum of the weights of all the edges connected to the vertices within partition c.
In this embodiment, for each vertex, an attempt is made to assign the vertex to the partition where its neighbor vertex is located, calculate the modularity variation values of the event graph before and after assignment, and record the neighbor vertex with the largest modularity variation value.
Step S1603, if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, abandoning the vertex heuristic division;
step S1604, repeatedly executing the processing flow of partition trial division until all the vertexes corresponding to the partitions do not change;
in this embodiment, if the maximum modularity variation value is greater than 0, that is, the corresponding partitioning manner is better, so the tentative partitioning is feasible, so that the corresponding vertex is partitioned into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, the tentative partitioning of the vertex is abandoned.
In this embodiment, after one round of tentative partitioning for all vertices is completed, the next round of tentative partitioning for all vertices is continued, that is, S1602 and S1603 are repeatedly executed until the partitions corresponding to all vertices are not changed any more, and the tentative partitioning is stopped.
Step S1605, compressing all the vertexes in the same partition into a new vertex to construct a new event graph, setting the weight of the edge between the vertexes in the same partition as the weight of the ring of the new vertex, and setting the weight of the edge between different partitions as the weight of the edge between the new vertexes;
step S1606 repeatedly executes the process flow of constructing a new event graph until the modularity of the entire event graph is not changed, where one partition corresponds to one sub-event graph.
In this embodiment, after the tentative partitioning of the partitions is completed, that is, the current event graph cannot be partitioned any more, at this time, the original event graph needs to be compressed, so as to construct a new event graph, and then the tentative partitioning of the partitions of the new event graph is repeated, that is, S1601-S1605 is repeatedly performed until the modularity of the entire event graph is no longer changed, that is, if the modularity of the newly constructed event graph is no longer changed when the tentative partitioning of the partitions is performed, the partitioning of the event graph is stopped, each partition in the last event graph corresponds to one subevent graph, and articles corresponding to all vertices in the same subevent graph are articles corresponding to the same hotspot event, so that the clustering of the hotspot events is further implemented on the basis of a plurality of identical events.
Referring to fig. 5, fig. 5 is a schematic view of a detailed flow of the step S120 in fig. 2. In this embodiment, the step S120 further includes:
step S1201, determining the category number of rough clustering based on the number of the crawled articles;
step S1202, performing word segmentation processing on the crawled article to obtain a plurality of independent words or characters;
step S1203, converting the words or characters after word segmentation into word vectors, and according to the category number, roughly clustering the word vectors corresponding to each crawled text by using a preset clustering algorithm to obtain a rough clustering result.
In order to obtain a better clustering effect, a comparison relationship between the article number and the clustering category number is preferably set in advance, for example, 1W articles can be clustered into more than 10 categories, and more than 10W articles can be clustered into 100 categories, so that the article number of each category after clustering is guaranteed to be less than 1000.
In this embodiment, through the jieba word segmentation and other technologies, the sentences in the article are decomposed into a plurality of independent words or characters, so that the words or characters after word segmentation are converted into word vectors in the following process. In this embodiment, the conversion method of the word vector is not limited, for example, each word or word is encoded based on a preset dictionary or dictionary, and then the word or word is encoded and converted into the word vector by using word2vec and other methods.
In this embodiment, the preset clustering method may use a traditional K-means or LDA topic model.
In this embodiment, a large number of articles can be roughly divided into several categories by a preset clustering algorithm, and although the hot event articles can be formed by rough clustering division, the same event may exist in which aggregated similar articles are not described, or the same event is not aggregated in the same category due to the difference of text expressions. Therefore, further effect optimization of the rough clustering result is needed.
Referring to fig. 6, fig. 6 is a flowchart illustrating a hot spot event discovery method according to a third embodiment of the present invention. In this embodiment, after the step S130, the method further includes:
step S310, acquiring titles of articles in each constructed text pair in the same category;
step S320, judging whether the titles of the articles in the same text pair are the same;
and step S330, if the two text pairs are the same, keeping the corresponding text pairs, otherwise, removing the corresponding text pairs.
In this embodiment, in order to reduce model input and improve model processing efficiency, each text pair is filtered before the constructed text pair is processed on the input model, and specifically, completely dissimilar text pairs are removed by comparing article titles.
In this embodiment, the manner of determining whether the article titles are the same is not limited, for example, by comparing keywords, if the keywords in the article titles are completely the same, it is determined that the article titles are the same. Or determining whether the titles of the two articles in the text pair are the same through a semantic recognition mode. If the two types of the text are the same, the text is reserved as an object to be classified, otherwise, the text pair corresponding to the title is removed.
The invention also provides a device for discovering the hot event.
Referring to fig. 7, fig. 7 is a functional module diagram of a hot spot event discovery apparatus according to a first embodiment of the present invention. In this embodiment, the hotspot event discovery device includes:
the acquisition module 10 is used for acquiring articles released by a specified website by adopting a web crawler technology;
the clustering module 20 is configured to determine the number of categories of rough clustering based on the number of the crawled articles, and cluster all the articles by using a preset clustering algorithm to obtain a rough clustering result;
the matching module 30 is used for sequentially matching every two articles in the same category from the rough clustering result to construct a text pair;
the model processing module 40 is used for preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;
the graph building module 50 is used for building an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges;
the segmenting module 60 is configured to segment the event graph to obtain a plurality of sub-event graphs, where articles corresponding to all vertices in the same sub-event graph are articles corresponding to the same hotspot event.
Based on the same embodiment description content as the hot event discovery method of the present invention, the embodiment of the hot event discovery apparatus is not described in detail in this embodiment.
On the basis of text preprocessing and a traditional clustering method, a deep learning-based classification model is added for accurate classification, so that the accuracy of event discovery is improved, and the corpus can be used for training in combination with requirements. The text pair model of the embodiment can learn the matching task for the text pair so as to learn important features in the text pair for distinguishing the two texts and perform similarity calculation on the features, so that whether the two texts speak the same event can be judged, and meanwhile, the similarity of the text pair can be output for further use in a graph-based aggregation algorithm. The embodiment can effectively improve the finding effect of the hot events on the Internet, and the found event results are applied to a plurality of fields such as public opinion monitoring, client marketing opportunity finding, intelligent recommendation and the like.
Referring to fig. 8, fig. 8 is a functional module diagram of a hot spot event discovery apparatus according to a second embodiment of the present invention. Based on the first embodiment of the foregoing apparatus, in this embodiment, the apparatus for discovering a hotspot event further includes:
an obtaining module 70, configured to obtain training samples for training a text pair model, where the training samples are text pairs and include a positive sample and a negative sample, one text pair includes two articles, the positive sample is obtained by pairing different articles in the same event, and the negative sample is obtained by pairing articles in different events and sampling, and pairing articles in similar events;
a word segmentation module 80, configured to perform word segmentation processing on two articles in each text pair to obtain multiple independent words or characters;
the encoding module 90 is configured to encode each word or word by using a preset dictionary or dictionary to obtain character encoding vectors corresponding to two articles in each text pair;
the model building module 100 is configured to input the character encoding vectors corresponding to the two articles in each text pair into the embedding layer to convert the character encoding vectors into matrix vectors, and input the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer respectively to perform feature extraction; respectively inputting the features extracted from the two convolutional layers into the pooling layer to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features; and inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-connection layer for classification and normalization processing to obtain the text pair model.
Based on the same embodiment description content as the hot event discovery method of the present invention, the embodiment of the hot event discovery apparatus is not described in detail in this embodiment.
Optionally, in a specific embodiment, the segmentation module includes:
the initialization unit is used for initializing the event graph so as to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;
the partitioning unit is used for partitioning each vertex one by performing partitioning trial partitioning to partition each vertex into partitions in which neighboring vertices adjacent to the vertex are located, calculating modularity change values of the event graph corresponding to the vertices before and after the partitioning, and recording the neighboring vertices corresponding to the maximum modularity change value; if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, giving up the vertex trial division; repeatedly executing the processing flow of partitioning and probing division until the partitions corresponding to all the vertexes do not change;
the new graph building unit is used for compressing all the vertexes in the same partition into a new vertex to build a new event graph, setting the weight of edges among the vertexes in the same partition as the weight of a ring of the new vertex, and setting the weight of edges among different partitions as the weight of edges among the new vertexes; and repeatedly executing the process flow of constructing a new event graph until the modularity of the whole event graph is not changed any more, wherein one partition corresponds to one sub-event graph.
Optionally, in a specific embodiment, the clustering module is specifically configured to:
determining the category number of rough clusters based on the number of the crawled articles;
performing word segmentation processing on the crawled article to obtain a plurality of independent words or characters;
and converting the words or characters after word segmentation into word vectors, and carrying out rough clustering on the word vectors corresponding to the crawled chapters by adopting a preset clustering algorithm according to the category number to obtain a rough clustering result.
Optionally, in a specific embodiment, the hotspot event discovery device further includes:
the text pair removing module is used for acquiring the titles of the articles in each constructed text pair in the same category; judging whether the titles of articles in the same text pair are the same; if the text pairs are the same, the corresponding text pairs are reserved, otherwise, the corresponding text pairs are removed.
The present invention also provides a non-volatile computer-readable storage medium.
In this embodiment, the computer-readable storage medium stores a hotspot event discovery program, and the hotspot event discovery program, when executed by a processor, implements the steps of the hotspot event discovery method in any one of the above embodiments. The method implemented by the hotspot event discovery program executed by the processor may refer to the embodiments of the hotspot event discovery method of the present invention, and therefore, redundant description is not repeated.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims (10)

1. A hot spot event discovery method is characterized by comprising the following steps:
adopting a web crawler technology to collect articles released by a specified website;
determining the category number of rough clustering based on the number of the crawled articles, and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result;
sequentially pairing articles in the same category from the rough clustering result to construct text pairs;
preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;
constructing an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges;
and segmenting the event graph to obtain a plurality of sub-event graphs, wherein the articles corresponding to all the vertexes in the same sub-event graph are the articles corresponding to the same hot event.
2. The method for discovering hot spot events according to claim 1, wherein before the step of collecting the articles published by the specified website by using web crawler technology, the method further comprises:
obtaining training samples for training a text pair model, wherein the training samples are text pairs and comprise positive samples and negative samples, one text pair comprises two articles, the positive samples are obtained by pairwise matching of different articles in the same event, and the negative samples are obtained by pairwise matching of articles in different events, sampling and pairwise matching of articles in similar events;
performing word segmentation processing on two articles in each text pair to obtain a plurality of independent words or characters;
coding each word or character by adopting a preset dictionary or dictionary to obtain character coding vectors corresponding to two articles in each text pair;
inputting the character coding vectors corresponding to the two articles in each text pair into the embedding layer to be converted into matrix vectors, and respectively inputting the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer to perform feature extraction;
respectively inputting the features extracted from the two convolutional layers into the pooling layer to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features;
and inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-connection layer for classification and normalization processing to obtain the text pair model.
3. The method for discovering hot spot events according to claim 1, wherein the step of segmenting the event graph to obtain a plurality of sub-event graphs comprises:
initializing the event graph to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;
partitioning and probing each vertex one by one to partition each vertex into partitions in which neighbor vertices adjacent to the vertex are located, calculating modularity change values of the event graph corresponding to the vertices before and after the vertex is partitioned, and recording neighbor vertices corresponding to the maximum modularity change values;
if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, giving up the vertex trial division;
repeatedly executing the processing flow of partitioning and probing division until the partitions corresponding to all the vertexes do not change;
compressing all the vertexes in the same partition into a new vertex to construct a new event graph, setting the weight of edges between the vertexes in the same partition as the weight of a ring of the new vertex, and setting the weight of edges between different partitions as the weight of edges between the new vertexes;
and repeatedly executing the process flow of constructing a new event graph until the modularity of the whole event graph is not changed any more, wherein one partition corresponds to one sub-event graph.
4. The method for discovering hot spot events according to claim 1, wherein the determining a category number of the rough clustering based on a number of articles crawled and clustering all the articles using a preset clustering algorithm to obtain a rough clustering result comprises:
determining the category number of rough clusters based on the number of the crawled articles;
performing word segmentation processing on the crawled article to obtain a plurality of independent words or characters;
and converting the words or characters after word segmentation into word vectors, and carrying out rough clustering on the word vectors corresponding to the crawled chapters by adopting a preset clustering algorithm according to the category number to obtain a rough clustering result.
5. The method for discovering hot spot events according to any one of claims 1-4, wherein after the step of sequentially pairing articles in the same category from the rough clustering result to construct text pairs, further comprising:
acquiring titles of articles in each constructed text pair in the same category;
judging whether the titles of articles in the same text pair are the same;
if the text pairs are the same, the corresponding text pairs are reserved, otherwise, the corresponding text pairs are removed.
6. A hotspot event discovery device, comprising:
the acquisition module is used for acquiring articles released by a specified website by adopting a web crawler technology;
the clustering module is used for determining the category number of rough clustering based on the number of the crawled articles and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result;
the matching module is used for sequentially matching every two articles in the same category from the rough clustering result to construct a text pair;
the model processing module is used for preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;
the graph building module is used for taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of the corresponding edges to build the event graph;
and the segmentation module is used for segmenting the event graph to obtain a plurality of sub-event graphs, wherein articles corresponding to all vertexes in the same sub-event graph are articles corresponding to the same hotspot event.
7. The hotspot event discovery apparatus of claim 6, wherein said hotspot event discovery apparatus further comprises:
the acquisition module is used for acquiring training samples for training a text pair model, wherein the training samples are text pairs and comprise positive samples and negative samples, one text pair comprises two articles, the positive samples are obtained by pairwise matching of different articles in the same event, and the negative samples are obtained by pairwise matching of articles in different events and sampling and pairwise matching of articles in similar events;
the word segmentation module is used for carrying out word segmentation on two articles in each text pair to obtain a plurality of independent words or characters;
the encoding module is used for encoding each word or character by adopting a preset dictionary or dictionary to obtain character encoding vectors corresponding to two articles in each text pair;
the model building module is used for inputting the character coding vectors corresponding to the two articles in each text pair into the embedding layer to be converted into matrix vectors, and respectively inputting the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer to carry out feature extraction; respectively inputting the features extracted from the two convolutional layers into the pooling layer to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features; and inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-connection layer for classification and normalization processing to obtain the text pair model.
8. The hotspot event discovery apparatus of claim 6, wherein said segmentation module comprises:
the initialization unit is used for initializing the event graph so as to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;
the partitioning unit is used for partitioning each vertex one by performing partitioning trial partitioning to partition each vertex into partitions in which neighboring vertices adjacent to the vertex are located, calculating modularity change values of the event graph corresponding to the vertices before and after the partitioning, and recording the neighboring vertices corresponding to the maximum modularity change value; if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, giving up the vertex trial division; repeatedly executing the processing flow of partitioning and probing division until the partitions corresponding to all the vertexes do not change;
the new graph building unit is used for compressing all the vertexes in the same partition into a new vertex to build a new event graph, setting the weight of edges among the vertexes in the same partition as the weight of a ring of the new vertex, and setting the weight of edges among different partitions as the weight of edges among the new vertexes; and repeatedly executing the process flow of constructing a new event graph until the modularity of the whole event graph is not changed any more, wherein one partition corresponds to one sub-event graph.
9. A hotspot event discovery device, comprising a memory, a processor and a hotspot event discovery program stored on the memory and executable on the processor, wherein the hotspot event discovery program, when executed by the processor, implements the steps of the hotspot event discovery method according to any one of claims 1 to 5.
10. A computer-readable storage medium, having stored thereon a hotspot event discovery program, which when executed by a processor, implements the steps of the hotspot event discovery method of any one of claims 1-5.
CN202010033828.2A 2020-01-13 2020-01-13 Hotspot event discovery method, device, equipment and storage medium Pending CN111291182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010033828.2A CN111291182A (en) 2020-01-13 2020-01-13 Hotspot event discovery method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010033828.2A CN111291182A (en) 2020-01-13 2020-01-13 Hotspot event discovery method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111291182A true CN111291182A (en) 2020-06-16

Family

ID=71025434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010033828.2A Pending CN111291182A (en) 2020-01-13 2020-01-13 Hotspot event discovery method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111291182A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139082A (en) * 2021-05-14 2021-07-20 北京字节跳动网络技术有限公司 Multimedia content processing method, apparatus, device and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139082A (en) * 2021-05-14 2021-07-20 北京字节跳动网络技术有限公司 Multimedia content processing method, apparatus, device and medium

Similar Documents

Publication Publication Date Title
TWI677852B (en) A method and apparatus, electronic equipment, computer readable storage medium for extracting image feature
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN106599037B (en) Normalized recommendation method based on tag semantics
CN111932386B (en) User account determining method and device, information pushing method and device, and electronic equipment
KR20200127020A (en) Computer-readable storage medium storing method, apparatus and instructions for matching semantic text data with tags
CN112836509A (en) Expert system knowledge base construction method and system
CN111368529B (en) Mobile terminal sensitive word recognition method, device and system based on edge calculation
CN112860685A (en) Automatic recommendation of analysis of data sets
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN116610818A (en) Construction method and system of power transmission and transformation project knowledge base
CN110765276A (en) Entity alignment method and device in knowledge graph
CN113869609A (en) Method and system for predicting confidence of frequent subgraph of root cause analysis
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN111291182A (en) Hotspot event discovery method, device, equipment and storage medium
CN112463974A (en) Method and device for establishing knowledge graph
CN116796288A (en) Industrial document-oriented multi-mode information extraction method and system
Menon et al. Gmm-based document clustering of knowledge graph embeddings
CN108197295B (en) Application method of attribute reduction in text classification based on multi-granularity attribute tree
Dhoot et al. Efficient Dimensionality Reduction for Big Data Using Clustering Technique
CN111160077A (en) Large-scale dynamic face clustering method
CN114997360A (en) Evolution parameter optimization method, system and storage medium of neural architecture search algorithm
Xu et al. Estimating similarity of rich internet pages using visual information
CN107391674B (en) New type mining method and device
CN112632229A (en) Text clustering method and device
US11537647B2 (en) System and method for decision driven hybrid text clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination