CN111291182A

CN111291182A - Hotspot event discovery method, device, equipment and storage medium

Info

Publication number: CN111291182A
Application number: CN202010033828.2A
Authority: CN
Inventors: 郑勇升
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2020-06-16

Abstract

The invention discloses a hot event discovery method, which comprises the following steps: collecting articles released by a specified website; clustering all articles by using a preset clustering algorithm to obtain a rough clustering result; sequentially pairing articles in the same category from the rough clustering result to construct text pairs; preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event; constructing an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges; and segmenting the event graph to obtain a plurality of sub-event graphs, wherein the articles corresponding to all vertexes in the same sub-event graph are the articles corresponding to the same hot event. The invention also discloses a device, equipment and a computer readable storage medium for discovering the hot event. The invention effectively improves the clustering accuracy of the hot events on the Internet.

Description

Hotspot event discovery method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a hot event discovery method, a device, equipment and a storage medium.

Background

The hot event refers to an event widely discussed or propagated in the society (internet), the discovery of the hot event is an important field of natural language processing, and a series of text processing is needed to discover the hot event in massive articles of the internet, and various algorithms and models are used for mining. The hot event discovery plays an important role in public opinion monitoring, client marketing opportunity discovery, intelligent recommendation, public opinion guidance and the like.

The traditional hot event discovery algorithm mainly uses an unsupervised text clustering algorithm, for example, after a TF-IDF algorithm is used for extracting document word frequency characteristics, similar files are aggregated together by using clustering algorithms such as K-means or LDA (latent dirichlet allocation) and the like to form a hot event article. However, the unsupervised text clustering algorithm has a general problem that the aggregated similar articles are not necessarily the same event, or the same event is not aggregated due to the difference of the character expressions, so that the clustering effect is not good, that is, the practicability of extracting the hot spot event by the unsupervised text clustering algorithm is not high.

Disclosure of Invention

The invention mainly aims to provide a hot event discovery method, a hot event discovery device, equipment and a storage medium, and aims to solve the technical problem of improving the clustering accuracy of hot events.

In order to achieve the above object, the present invention provides a method for discovering a hot spot event, including the following steps:

adopting a web crawler technology to collect articles released by a specified website;

determining the category number of rough clustering based on the number of the crawled articles, and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result;

sequentially pairing articles in the same category from the rough clustering result to construct text pairs;

preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;

constructing an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges;

and segmenting the event graph to obtain a plurality of sub-event graphs, wherein the articles corresponding to all the vertexes in the same sub-event graph are the articles corresponding to the same hot event.

Optionally, before the step of collecting the articles published by the specified website by using the web crawler technology, the method further includes:

obtaining training samples for training a text pair model, wherein the training samples are text pairs and comprise positive samples and negative samples, one text pair comprises two articles, the positive samples are obtained by pairwise matching of different articles in the same event, and the negative samples are obtained by pairwise matching of articles in different events, sampling and pairwise matching of articles in similar events;

performing word segmentation processing on two articles in each text pair to obtain a plurality of independent words or characters;

coding each word or character by adopting a preset dictionary or dictionary to obtain character coding vectors corresponding to two articles in each text pair;

inputting the character coding vectors corresponding to the two articles in each text pair into the embedding layer to be converted into matrix vectors, and respectively inputting the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer to perform feature extraction;

respectively inputting the features extracted from the two convolutional layers into the pooling layer to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features;

and inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-connection layer for classification and normalization processing to obtain the text pair model.

Optionally, the segmenting the event graph to obtain a plurality of sub-event graphs includes:

initializing the event graph to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;

partitioning and probing each vertex one by one to partition each vertex into partitions in which neighbor vertices adjacent to the vertex are located, calculating modularity change values of the event graph corresponding to the vertices before and after the vertex is partitioned, and recording neighbor vertices corresponding to the maximum modularity change values;

if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, giving up the vertex trial division;

repeatedly executing the processing flow of partitioning and probing division until the partitions corresponding to all the vertexes do not change;

compressing all the vertexes in the same partition into a new vertex to construct a new event graph, setting the weight of edges between the vertexes in the same partition as the weight of a ring of the new vertex, and setting the weight of edges between different partitions as the weight of edges between the new vertexes;

and repeatedly executing the process flow of constructing a new event graph until the modularity of the whole event graph is not changed any more, wherein one partition corresponds to one sub-event graph.

Optionally, the determining the category number of the rough clustering based on the number of the crawled articles, and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result includes:

determining the category number of rough clusters based on the number of the crawled articles;

performing word segmentation processing on the crawled article to obtain a plurality of independent words or characters;

and converting the words or characters after word segmentation into word vectors, and carrying out rough clustering on the word vectors corresponding to the crawled chapters by adopting a preset clustering algorithm according to the category number to obtain a rough clustering result.

Optionally, after the step of sequentially pairing the articles in the same category from the rough clustering result to construct text pairs, the method further includes:

acquiring titles of articles in each constructed text pair in the same category;

judging whether the titles of articles in the same text pair are the same;

if the text pairs are the same, the corresponding text pairs are reserved, otherwise, the corresponding text pairs are removed.

Further, to achieve the above object, the present invention further provides a device for discovering a hot spot event, where the device for discovering a hot spot event includes:

the acquisition module is used for acquiring articles released by a specified website by adopting a web crawler technology;

the clustering module is used for determining the category number of rough clustering based on the number of the crawled articles and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result;

the matching module is used for sequentially matching every two articles in the same category from the rough clustering result to construct a text pair;

the model processing module is used for preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;

the graph building module is used for taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of the corresponding edges to build the event graph;

and the segmentation module is used for segmenting the event graph to obtain a plurality of sub-event graphs, wherein articles corresponding to all vertexes in the same sub-event graph are articles corresponding to the same hotspot event.

Optionally, the hotspot event discovery device further includes:

the acquisition module is used for acquiring training samples for training a text pair model, wherein the training samples are text pairs and comprise positive samples and negative samples, one text pair comprises two articles, the positive samples are obtained by pairwise matching of different articles in the same event, and the negative samples are obtained by pairwise matching of articles in different events and sampling and pairwise matching of articles in similar events;

the word segmentation module is used for carrying out word segmentation on two articles in each text pair to obtain a plurality of independent words or characters;

the encoding module is used for encoding each word or character by adopting a preset dictionary or dictionary to obtain character encoding vectors corresponding to two articles in each text pair;

the model building module is used for inputting the character coding vectors corresponding to the two articles in each text pair into the embedding layer to be converted into matrix vectors, and respectively inputting the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer to carry out feature extraction; respectively inputting the features extracted from the two convolutional layers into the pooling layer to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features; and inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-connection layer for classification and normalization processing to obtain the text pair model.

Optionally, the segmentation module comprises:

the initialization unit is used for initializing the event graph so as to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;

the partitioning unit is used for partitioning each vertex one by performing partitioning trial partitioning to partition each vertex into partitions in which neighboring vertices adjacent to the vertex are located, calculating modularity change values of the event graph corresponding to the vertices before and after the partitioning, and recording the neighboring vertices corresponding to the maximum modularity change value; if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, giving up the vertex trial division; repeatedly executing the processing flow of partitioning and probing division until the partitions corresponding to all the vertexes do not change;

the new graph building unit is used for compressing all the vertexes in the same partition into a new vertex to build a new event graph, setting the weight of edges among the vertexes in the same partition as the weight of a ring of the new vertex, and setting the weight of edges among different partitions as the weight of edges among the new vertexes; and repeatedly executing the process flow of constructing a new event graph until the modularity of the whole event graph is not changed any more, wherein one partition corresponds to one sub-event graph.

Optionally, the clustering module is specifically configured to:

Optionally, the hotspot event discovery device further includes:

the text pair removing module is used for acquiring the titles of the articles in each constructed text pair in the same category; judging whether the titles of articles in the same text pair are the same; if the text pairs are the same, the corresponding text pairs are reserved, otherwise, the corresponding text pairs are removed.

Further, to achieve the above object, the present invention also provides a hotspot event discovery device, which includes a memory, a processor, and a hotspot event discovery program stored on the memory and operable on the processor, where the hotspot event discovery program, when executed by the processor, implements the steps of the hotspot event discovery method according to any one of the above embodiments.

Further, to achieve the above object, the present invention also provides a computer readable storage medium, on which a hot spot event discovery program is stored, where the hot spot event discovery program, when executed by a processor, implements the steps of the hot spot event discovery method according to any one of the above items.

On the basis of text preprocessing and a traditional clustering method, the invention adds a deep learning-based classification model for accurate classification, so that the accuracy of event discovery is improved, and the corpus can be used for training in combination with requirements. The text pair model can learn the matching task aiming at the text pair so as to learn the important features used for distinguishing the two texts in the text pair and carry out similarity calculation on the features, thereby judging whether the two texts speak the same event or not, and simultaneously outputting the similarity of the text pair for further use in a graph-based aggregation algorithm. The invention effectively improves the finding effect of the hot events on the Internet, and can apply the found event results to a plurality of fields such as public opinion monitoring, client marketing opportunity finding, intelligent recommendation and the like.

Drawings

Fig. 1 is a schematic structural diagram of an operating environment of a hotspot event discovery device according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a hot spot event discovery method according to a first embodiment of the present invention;

fig. 3 is a flowchart illustrating a hot spot event discovery method according to a second embodiment of the present invention;

FIG. 4 is a schematic view of a detailed process of step S160 in FIG. 2;

FIG. 5 is a schematic view of a detailed process of step S120 in FIG. 2;

fig. 6 is a flowchart illustrating a hot spot event discovery method according to a third embodiment of the present invention;

fig. 7 is a functional block diagram of a hot spot event discovery apparatus according to a first embodiment of the present invention;

fig. 8 is a functional block diagram of a hot spot event discovery apparatus according to a second embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a hot event discovery device.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an operating environment of a hotspot event discovery device according to an embodiment of the present application.

As shown in fig. 1, the hotspot event discovery device includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the hardware configuration of the hotspot event discovery device shown in fig. 1 does not constitute a limitation of the hotspot event discovery device, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a computer program. The operating system is a program for managing and controlling the hot event discovery device and software resources, and supports the operation of the hot event discovery program and other software and/or programs.

In the hardware structure of the hotspot event discovery device shown in fig. 1, the network interface 1004 is mainly used for accessing a network; the user interface 1003 is mainly used for detecting a confirmation instruction, an editing instruction, and the like. And the processor 1001 may be configured to invoke a hotspot event discovery program stored in the memory 1005 and perform the operations of the following embodiments of the hotspot event discovery method.

Based on the above hardware structure of the hot event discovery device, various embodiments of the hot event discovery method are provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a hot spot event discovery method according to a first embodiment of the present invention. In this embodiment, the method for discovering the hot event includes the following steps:

step S110, adopting a web crawler technology to collect articles released by a specified website;

in this embodiment, it is preferable to directly collect and publish an article from a specific website to perform hot event identification, and a crawling manner of the text published by the website is not limited. Preferably, the designated crawler program is deployed by multiple machines by using the Docker container as a medium, so that the designated contents are crawled by the multiple machines.

Step S120, determining the category number of rough clustering based on the number of the crawled articles, and clustering all the articles by using a preset clustering algorithm to obtain a rough clustering result;

in this embodiment, in order to obtain a better clustering effect, it is preferable to preset a comparison relationship between the number of articles and the number of clustering categories, for example, 1W articles can be grouped into more than 10 categories, and more than 10W articles can be grouped into 100 categories, so as to ensure that the number of articles in each category after clustering is less than 1000.

In this embodiment, a large number of articles can be roughly divided into several categories by a preset clustering algorithm, and although the hot event articles can be formed by rough clustering division, the same event may exist in which aggregated similar articles are not described, or the same event is not aggregated in the same category due to the difference of text expressions. Therefore, further effect optimization of the rough clustering result is needed.

Step S130, sequentially taking articles in the same category from the rough clustering result, and pairing the articles in the same category to construct a text pair;

in this embodiment, a text pair specifically refers to a small group of two articles, for example, A, B, C articles exist in a certain category, and the constructed text pair includes: (A, B), (A, C), (B, C).

In this embodiment, only two articles in the same category are paired to construct a text pair, and articles in different categories do not usually belong to the same or similar events, so that the text pair does not need to be constructed.

Step S140, preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;

in this embodiment, after the text pairs are constructed, the text pairs are preprocessed, for example, word segmentation is performed and converted into word vectors, then the preprocessed data of the text pairs are sequentially input into the text to perform relevant processing on the model, for example, feature extraction and feature similarity comparison are performed, then the similarity of any two articles is output, and whether the two articles belong to the same event or not is judged.

In this embodiment, a text pair model is trained in advance in a machine learning manner, and feature extraction and feature similarity comparison can be performed on the contents of two articles through the model, so that whether the two articles belong to the same event or not is identified. The text pair model can identify whether two articles in the same clustering category belong to the same event or not, and can also identify whether two articles between different clustering categories belong to the same event or not.

S150, constructing an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges;

in this embodiment, the similarity of any two articles in the same category and different categories and whether the two articles belong to the same event can be identified through the text pair model, but all the articles belonging to the same event cannot be identified as a whole, so that hot event clustering is performed on the articles corresponding to a plurality of the same events again by adopting a manner of constructing an event graph in this embodiment, and the articles corresponding to the same hot event are determined.

In this embodiment, all articles are taken as the vertices of the graph, pairwise connecting lines of the articles of the same event are taken as the edges of the graph, and the similarity between the articles is taken as the weight of the corresponding edges to construct the event graph.

Step S160, segmenting the event graph to obtain a plurality of sub-event graphs, wherein the articles corresponding to all vertices in the same sub-event graph are the articles corresponding to the same hotspot event.

In this embodiment, after an event graph including all articles and articles corresponding to the same event is constructed, the event graph is segmented by a preset graph segmentation algorithm based on the degree of association between vertices, so as to obtain a plurality of sub-event graphs, and the articles corresponding to all vertices in the same sub-event graph are the articles corresponding to the same hot event, so that hot event clustering is further implemented on the basis of a plurality of the same events.

On the basis of text preprocessing and a traditional clustering method, a deep learning-based classification model is added for accurate classification, so that the accuracy of event discovery is improved, and the corpus can be used for training in combination with requirements. The text pair model of the embodiment can learn the matching task for the text pair so as to learn important features in the text pair for distinguishing the two texts and perform similarity calculation on the features, so that whether the two texts speak the same event can be judged, and meanwhile, the similarity of the text pair can be output for further use in a graph-based aggregation algorithm. The embodiment can effectively improve the finding effect of the hot events on the Internet, and the found event results are applied to a plurality of fields such as public opinion monitoring, client marketing opportunity finding, intelligent recommendation and the like.

Referring to fig. 3, fig. 3 is a flowchart illustrating a hot spot event discovery method according to a second embodiment of the present invention. In this embodiment, before the step S110, the method further includes:

step S210, obtaining training samples for training a text pair model, wherein the training samples are text pairs and comprise positive samples and negative samples, one text pair comprises two articles, the positive samples are obtained by pairwise matching of different articles in the same event, and the negative samples are obtained by pairwise matching of articles in different events and sampling and pairwise matching of articles in similar events;

in this embodiment, each training sample is a text pair including two articles, and the text pair is used as a training sample to facilitate feature comparison, so as to determine whether the training samples belong to the same event.

The embodiment is provided with a positive sample and a negative sample:

(1) positive sample

The positive sample is obtained by pairing different articles in the same event, for example, A, B, C articles are reported as the same event, and the three articles can be used as positive samples, and specifically, three samples (A, B), (A, C), (B, C) can be formed.

(2) Negative sample

The negative sample is obtained by pairwise matching articles of different events and sampling, and pairwise matching articles of similar events are all selected as the negative sample.

Step S220, performing word segmentation processing on two articles in each text pair to obtain a plurality of independent words or characters;

step S230, a preset dictionary or dictionary is adopted to encode each word or character to obtain character encoding vectors corresponding to two articles in each text pair;

in this embodiment, in order to facilitate feature extraction, it is necessary to perform word segmentation on an article in each text pair, so as to decompose the article into a plurality of independent words or characters, and then encode the words or characters obtained by word segmentation by using a preset dictionary or dictionary, so as to form a character encoding vector, so as to facilitate feature extraction on the text.

Step S240, inputting the character coding vectors corresponding to the two articles in each text pair into the embedding layer to be converted into matrix vectors, and respectively inputting the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer to carry out feature extraction;

step S250, respectively inputting the features respectively extracted by the two convolution layers into a pooling layer so as to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features;

and step S260, inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-link layer for classification and normalization processing to obtain the text pair model.

In this embodiment, it is preferable to use a PairCNN model to perform deep learning on a text pair, first convert a character encoding vector into a matrix vector through an embedding layer (embedding layer), then use two CNN networks with the same parameters to perform feature extraction on the two texts, then calculate the similarity output by the pooling layer, and use the similarity and the features output by the pooling layer as the input of a classifier, the classifier uses a full connection layer, the output of the full connection layer calculates Softmax and then calculates Loss with the label of a sample, and a random gradient descent algorithm is used to update model parameters in an iterative manner, so as to finally obtain a text pair model.

In the embodiment, a text pair is introduced as a training sample, and two independent and same-layer convolutional layers are used for feature extraction in the model training process, so that feature extraction and similarity comparison between different articles are realized, and a technical basis is provided for clustering of hot events. The text pair model of the embodiment not only identifies the features of any two articles in the same category, but also identifies the features of any two articles in different categories, so that the hot event discovery accuracy can be improved.

Referring to fig. 4, fig. 4 is a schematic view of a detailed flow of the step S160 in fig. 2. In this embodiment, the step S160 further includes:

step S1601, initializing the event graph to divide each vertex into different partitions, wherein the number of the initial partitions is the same as that of the vertices;

in this embodiment, before the event graph is divided, the event graph is initialized, specifically, each vertex in the graph is regarded as an independent partition, and the number of the initial partitions is the same as the number of the vertices. For example, if there are 100 vertices in the event graph, the event graph is divided into 100 partitions, and one vertex corresponds to one partition.

Step S1602, performing partition trial division on each vertex one by one to divide each vertex into partitions in which neighboring vertices adjacent to the vertex are located, calculating modularity variation values of the event graph corresponding to the vertices before and after division, and recording the neighboring vertices corresponding to the maximum modularity variation value;

the embodiment introduces a metric index for evaluating partition network partition quality: and (5) modularity. The quality of the partition dividing mode of the vertex can be reflected through the change value of the modularity. The calculation formula of the modularity is as follows:

wherein W represents modularity, m represents a middle parameter, c represents a partition, i, j represents vertices i, j within the partition c; a. the_ijRepresents the weight between vertex i and vertex j, Σ in represents the sum of the weights of the edges within partition c, and Σ tot represents the sum of the weights of all the edges connected to the vertices within partition c.

In this embodiment, for each vertex, an attempt is made to assign the vertex to the partition where its neighbor vertex is located, calculate the modularity variation values of the event graph before and after assignment, and record the neighbor vertex with the largest modularity variation value.

Step S1603, if the maximum modularity variation value is larger than 0, dividing the corresponding vertex into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, abandoning the vertex heuristic division;

step S1604, repeatedly executing the processing flow of partition trial division until all the vertexes corresponding to the partitions do not change;

in this embodiment, if the maximum modularity variation value is greater than 0, that is, the corresponding partitioning manner is better, so the tentative partitioning is feasible, so that the corresponding vertex is partitioned into the partition where the neighbor vertex corresponding to the maximum modularity variation value is located, otherwise, the tentative partitioning of the vertex is abandoned.

In this embodiment, after one round of tentative partitioning for all vertices is completed, the next round of tentative partitioning for all vertices is continued, that is, S1602 and S1603 are repeatedly executed until the partitions corresponding to all vertices are not changed any more, and the tentative partitioning is stopped.

Step S1605, compressing all the vertexes in the same partition into a new vertex to construct a new event graph, setting the weight of the edge between the vertexes in the same partition as the weight of the ring of the new vertex, and setting the weight of the edge between different partitions as the weight of the edge between the new vertexes;

step S1606 repeatedly executes the process flow of constructing a new event graph until the modularity of the entire event graph is not changed, where one partition corresponds to one sub-event graph.

In this embodiment, after the tentative partitioning of the partitions is completed, that is, the current event graph cannot be partitioned any more, at this time, the original event graph needs to be compressed, so as to construct a new event graph, and then the tentative partitioning of the partitions of the new event graph is repeated, that is, S1601-S1605 is repeatedly performed until the modularity of the entire event graph is no longer changed, that is, if the modularity of the newly constructed event graph is no longer changed when the tentative partitioning of the partitions is performed, the partitioning of the event graph is stopped, each partition in the last event graph corresponds to one subevent graph, and articles corresponding to all vertices in the same subevent graph are articles corresponding to the same hotspot event, so that the clustering of the hotspot events is further implemented on the basis of a plurality of identical events.

Referring to fig. 5, fig. 5 is a schematic view of a detailed flow of the step S120 in fig. 2. In this embodiment, the step S120 further includes:

step S1201, determining the category number of rough clustering based on the number of the crawled articles;

step S1202, performing word segmentation processing on the crawled article to obtain a plurality of independent words or characters;

step S1203, converting the words or characters after word segmentation into word vectors, and according to the category number, roughly clustering the word vectors corresponding to each crawled text by using a preset clustering algorithm to obtain a rough clustering result.

In order to obtain a better clustering effect, a comparison relationship between the article number and the clustering category number is preferably set in advance, for example, 1W articles can be clustered into more than 10 categories, and more than 10W articles can be clustered into 100 categories, so that the article number of each category after clustering is guaranteed to be less than 1000.

In this embodiment, through the jieba word segmentation and other technologies, the sentences in the article are decomposed into a plurality of independent words or characters, so that the words or characters after word segmentation are converted into word vectors in the following process. In this embodiment, the conversion method of the word vector is not limited, for example, each word or word is encoded based on a preset dictionary or dictionary, and then the word or word is encoded and converted into the word vector by using word2vec and other methods.

In this embodiment, the preset clustering method may use a traditional K-means or LDA topic model.

Referring to fig. 6, fig. 6 is a flowchart illustrating a hot spot event discovery method according to a third embodiment of the present invention. In this embodiment, after the step S130, the method further includes:

step S310, acquiring titles of articles in each constructed text pair in the same category;

step S320, judging whether the titles of the articles in the same text pair are the same;

and step S330, if the two text pairs are the same, keeping the corresponding text pairs, otherwise, removing the corresponding text pairs.

In this embodiment, in order to reduce model input and improve model processing efficiency, each text pair is filtered before the constructed text pair is processed on the input model, and specifically, completely dissimilar text pairs are removed by comparing article titles.

In this embodiment, the manner of determining whether the article titles are the same is not limited, for example, by comparing keywords, if the keywords in the article titles are completely the same, it is determined that the article titles are the same. Or determining whether the titles of the two articles in the text pair are the same through a semantic recognition mode. If the two types of the text are the same, the text is reserved as an object to be classified, otherwise, the text pair corresponding to the title is removed.

The invention also provides a device for discovering the hot event.

Referring to fig. 7, fig. 7 is a functional module diagram of a hot spot event discovery apparatus according to a first embodiment of the present invention. In this embodiment, the hotspot event discovery device includes:

the acquisition module 10 is used for acquiring articles released by a specified website by adopting a web crawler technology;

the clustering module 20 is configured to determine the number of categories of rough clustering based on the number of the crawled articles, and cluster all the articles by using a preset clustering algorithm to obtain a rough clustering result;

the matching module 30 is used for sequentially matching every two articles in the same category from the rough clustering result to construct a text pair;

the model processing module 40 is used for preprocessing each text pair, then sequentially inputting preset texts to process the model, and outputting the similarity of any two articles and whether the articles belong to the same event;

the graph building module 50 is used for building an event graph by taking each article as a vertex of the graph, taking pairwise connecting lines of the articles of the same event as edges of the graph and taking the similarity between the articles as the weight of corresponding edges;

the segmenting module 60 is configured to segment the event graph to obtain a plurality of sub-event graphs, where articles corresponding to all vertices in the same sub-event graph are articles corresponding to the same hotspot event.

Based on the same embodiment description content as the hot event discovery method of the present invention, the embodiment of the hot event discovery apparatus is not described in detail in this embodiment.

Referring to fig. 8, fig. 8 is a functional module diagram of a hot spot event discovery apparatus according to a second embodiment of the present invention. Based on the first embodiment of the foregoing apparatus, in this embodiment, the apparatus for discovering a hotspot event further includes:

an obtaining module 70, configured to obtain training samples for training a text pair model, where the training samples are text pairs and include a positive sample and a negative sample, one text pair includes two articles, the positive sample is obtained by pairing different articles in the same event, and the negative sample is obtained by pairing articles in different events and sampling, and pairing articles in similar events;

a word segmentation module 80, configured to perform word segmentation processing on two articles in each text pair to obtain multiple independent words or characters;

the encoding module 90 is configured to encode each word or word by using a preset dictionary or dictionary to obtain character encoding vectors corresponding to two articles in each text pair;

the model building module 100 is configured to input the character encoding vectors corresponding to the two articles in each text pair into the embedding layer to convert the character encoding vectors into matrix vectors, and input the matrix vectors corresponding to the two articles in the same text pair into two independent convolution layers on the same layer respectively to perform feature extraction; respectively inputting the features extracted from the two convolutional layers into the pooling layer to calculate the similarity between the corresponding features of the two articles in the same text pair and reduce the dimension of the extracted features; and inputting the similarity between the corresponding features of the two articles in each text pair and the features subjected to dimension reduction into a full-connection layer for classification and normalization processing to obtain the text pair model.

Optionally, in a specific embodiment, the segmentation module includes:

Optionally, in a specific embodiment, the clustering module is specifically configured to:

Optionally, in a specific embodiment, the hotspot event discovery device further includes:

The present invention also provides a non-volatile computer-readable storage medium.

In this embodiment, the computer-readable storage medium stores a hotspot event discovery program, and the hotspot event discovery program, when executed by a processor, implements the steps of the hotspot event discovery method in any one of the above embodiments. The method implemented by the hotspot event discovery program executed by the processor may refer to the embodiments of the hotspot event discovery method of the present invention, and therefore, redundant description is not repeated.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims

1. A hot spot event discovery method is characterized by comprising the following steps:

2. The method for discovering hot spot events according to claim 1, wherein before the step of collecting the articles published by the specified website by using web crawler technology, the method further comprises:

3. The method for discovering hot spot events according to claim 1, wherein the step of segmenting the event graph to obtain a plurality of sub-event graphs comprises:

4. The method for discovering hot spot events according to claim 1, wherein the determining a category number of the rough clustering based on a number of articles crawled and clustering all the articles using a preset clustering algorithm to obtain a rough clustering result comprises:

5. The method for discovering hot spot events according to any one of claims 1-4, wherein after the step of sequentially pairing articles in the same category from the rough clustering result to construct text pairs, further comprising:

judging whether the titles of articles in the same text pair are the same;

6. A hotspot event discovery device, comprising:

7. The hotspot event discovery apparatus of claim 6, wherein said hotspot event discovery apparatus further comprises:

8. The hotspot event discovery apparatus of claim 6, wherein said segmentation module comprises:

9. A hotspot event discovery device, comprising a memory, a processor and a hotspot event discovery program stored on the memory and executable on the processor, wherein the hotspot event discovery program, when executed by the processor, implements the steps of the hotspot event discovery method according to any one of claims 1 to 5.

10. A computer-readable storage medium, having stored thereon a hotspot event discovery program, which when executed by a processor, implements the steps of the hotspot event discovery method of any one of claims 1-5.