CN113821739A

CN113821739A - Local event detection method, device, equipment and storage medium

Info

Publication number: CN113821739A
Application number: CN202111381988.7A
Authority: CN
Inventors: 宋轩; 李永康; 范子沛; 尹渡; 冯德帆; 邓锦亮; 王宏俊
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology; Southern University of Science and Technology
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2021-12-21
Anticipated expiration: 2041-11-22
Also published as: CN113821739B

Abstract

The invention discloses a local event detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring tweet data of a preset area in real time; performing two-stage classification on each tweed data through a preset two-stage classifier to obtain a first-stage label and a second-stage label of each tweed data, and acquiring tweed data belonging to the same event class according to the first-stage label and the second-stage label; respectively acquiring the position information of each piece of tweet data belonging to the same event type; clustering the tweet data belonging to the same event category according to the text, the release time and the position information of the tweet data to obtain tweet clusters belonging to the same event category; and respectively generating an event abstract of each tweet cluster as a local event corresponding to each tweet cluster. The invention can ensure the real-time performance and the accuracy of the local event detection.

Description

Local event detection method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data mining technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a local event.

Background

The real-time detection of local events in a city is very important for city management, and is beneficial to perception of city managers and policy implementation. For citizens, most people are trapped in daily work and few channels are available for knowing events occurring around the living places, more local information can be provided for the citizens through local event detection, so that the citizens can have more participation, the citizens can really feel that the citizens are concerned about, and the living happiness of the residents is improved. However, since the past news media have limited resources and tend to focus on only some high-priority events in cities (such as major accidents, major competitions, etc.), and the reports tend to have significant time delay (such as the event occurred from reading the newspaper today to yesterday), the real-time detection of local events has been a difficult problem to solve.

With the progress of the times, network terminals such as mobile phones and computers gradually enter the lives of everyone, and the development of online social media enables people to share the lives of the people online in real time. The microblog, twitter and Instagram are representative software, and the software is characterized in that users can share picture text videos in real time and can add positioning information, and the users can select to disclose to enable any user to view the sent contents. By the end of 2020, microblog monthly active users are 5.23 hundred million, twitter monthly activities are more than 3.3 hundred million, Instagram monthly active users are more than 10 hundred million, massive users update massive information on a social platform every day, and the massive information contains the content of local events shared by many users, such as participation in a sporting event, a traffic accident on a road and the like. Local events in these cities are not only geographically close, but are semantically identical or related. Unlike large-scale news, the number of relevant tweets for a single city event is often very small, and may be only dozens or twenty, and how to mine and explore events occurring in a city from massive social media tweet information streams in real time is also a difficult problem.

At present, a method for detecting a microblog emergency is proposed by the prior art, which comprises the following steps: firstly, acquiring a microblog text data set, and then carrying out noise filtration on the microblog text data set based on the attention of microblog texts and the influence of publishers corresponding to the microblog texts; establishing a plurality of time windows according to preset duration, and dividing microblog texts in the microblog text data set into corresponding time windows; preprocessing the microblog texts in each time window; and extracting the burst characteristic word set of each time window based on the preset characteristic attributes, and then respectively calculating the similarity between the burst characteristic words in the target time window to generate the burst event of the target time window.

A microblog emergency detection method based on the BERT-BTM network is also provided, and comprises the following steps: processing the microblog data set (segmenting words and removing stop words) to obtain an original data set, and then carrying out vector coding on the original data set by using a pre-training BERT model, namely, each microblog text is represented by a group of vectors with fixed length. And then constructing a BERT-BTM model according to the Dirichlet prior parameter alpha and the prior parameter beta i fused with the BERT word vector set, and processing the original data set through the BERT-BTM model to obtain an emergency word set. And finally, constructing a BERT-BTM network according to the emergent event word set and the co-occurrence relation between words in the emergent event word set, and completing the emergent event detection by dividing the BERT-BTM network.

However, the above method mainly focuses on the detection of an emergency, which often becomes a news hotspot and is of great interest. Such news often can quickly gain hundreds of millions of concerns, and meanwhile, a large number of related tweets are used for reporting and commenting, so that the detection of related events in the social media information stream is relatively simple. But such events do not help city managers know city status in real time and residents know activities that are taking place near the residential site.

The local event detection focuses on events occurring in a certain city, the events are closely related to life information of citizens, the fact that surrounding events are known quickly in time is very important for improving life happiness of the citizens, meanwhile, city managers can know the current situation of the city quickly, and disposal schemes are prepared for related problems and potential risks.

Moreover, the above method for detecting an emergency cannot detect an event in real time, and the working basis is the existing microblog data set within a period of time, which means that a time window must exist. The model must be functional after this time window has ended. The difference between the data set based on the time window and the real-time inflow characteristics of the real-time information stream of the online social media is large, and the forced migration can cause the effect of the model to be poor or even the model cannot run.

In addition, a local event detection method based on microblog data with geographical position labels is also proposed at present. The method comprises the steps that space information, time information and text information of a microblog are coded and mapped to the same low-dimensional vector space, and then three parts of information in the microblog are spliced to form a vector to represent the whole microblog; after vector representation of microblogs is obtained, clustering continuously arrived microblog information streams by using a Bayesian hybrid clustering model, wherein each cluster is a potential local event, and then constructing a logistic regression classifier to classify the clustered clusters and judge whether the clustered clusters are a real local event or not.

However, this method requires transcoding and online clustering of all the micro-blogs with geo-tags, requires a large amount of computing resources, is high in computing complexity, takes a long time, and is not favorable for sensing local events in real time.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the local event detection method, the local event detection device, the local event detection equipment and the storage medium are provided, so that the local event detection efficiency can be improved, and the detection real-time performance and accuracy are ensured.

In a first aspect, the present invention provides a local event detection method, including:

acquiring text pushing data of a preset region in real time, wherein the text pushing data of the preset region comprises text pushing data of interest points of which the sign-in information is the preset region and text pushing data containing keywords corresponding to the preset region;

performing two-stage classification on each piece of tweet data through a preset two-stage classifier to obtain a first-stage label and a second-stage label of each tweet data, wherein the first-stage label is used for indicating whether the tweet data has a potential event, the second-stage label is used for indicating an event type to which the tweet data belongs when the potential event exists, and the tweet data belonging to the same event type is obtained according to the first-stage label and the second-stage label of each tweet data;

respectively acquiring the position information of each piece of tweet data belonging to the same event type;

clustering the tweet data belonging to the same event category according to the text, the release time and the position information of the tweet data to obtain tweet clusters belonging to the same event category;

and respectively generating an event abstract of each tweet cluster as a local event corresponding to each tweet cluster, wherein the event abstract comprises texts with a preset first quantity of tweet data and keywords with a preset second quantity in the corresponding tweet cluster.

In a second aspect, the present invention further provides a local event detection apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring text pushing data of a preset region in real time, and the text pushing data of the preset region comprises text pushing data of interest points of which the sign-in information is the preset region and text pushing data containing keywords corresponding to the preset region;

the classification module is used for performing two-stage classification on each tweet data through a preset two-stage classifier to obtain a first-stage label and a second-stage label of each tweet data, wherein the first-stage label is used for indicating whether the tweet data has a potential event, the second-stage label is used for indicating an event type to which the tweet data belongs when the potential event exists, and the tweet data belonging to the same event type is obtained according to the first-stage label and the second-stage label of each tweet data;

the second acquisition module is used for respectively acquiring the position information of each piece of tweet data belonging to the same event type;

the clustering module is used for clustering the tweet data belonging to the same event category according to the text, the release time and the position information of the tweet data to obtain tweet clusters belonging to the same event category;

the generation module is used for respectively generating event summaries of the tweet clusters as local events corresponding to the tweet clusters, and the event summaries comprise texts with a preset first amount of tweet data and keywords with a preset second amount of keywords in the corresponding tweet clusters.

In a third aspect, the present invention also provides an electronic device, including:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the local event detection method as provided in the first aspect.

In a fourth aspect, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the local event detection method as provided in the first aspect.

The invention has the beneficial effects that: the method comprises the steps of classifying the text-pushing data through a two-stage classifier, well removing text pushing irrelevant to an event, avoiding a large amount of irrelevant text-pushing data from participating in a subsequent clustering process, and clustering the text-pushing data belonging to the same event category, so that each text-pushing data only needs to be subjected to similarity judgment with a text-pushing cluster of the same event category, thereby greatly reducing required computing resources, accelerating the processing speed, improving the efficiency of local event detection and ensuring the real-time performance of the detection; through clustering according to the text, the release time and the position information, each piece of text pushing data added into the text pushing cluster can be guaranteed to belong to the same local event with the text pushing cluster, and therefore the accuracy of local event detection is guaranteed.

Drawings

FIG. 1 is a flow chart of a local event detection method according to the present invention;

fig. 2 is a schematic structural diagram of a local event detection device according to the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention;

fig. 4 is a flowchart of a local event detection method according to a first embodiment of the present invention;

fig. 5 is a schematic structural diagram of a named entity recognition model according to a first embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a sub computer program, or the like.

Furthermore, the terms "first," "second," and the like may be used herein to describe various orientations, actions, steps, elements, or the like, but the orientations, actions, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. For example, the first information may be referred to as second information, and similarly, the second information may be referred to as first information, without departing from the scope of the present application. The first information and the second information are both information, but they are not the same information. The terms "first", "second", etc. are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

As shown in fig. 1, a local event detection method includes:

s101: acquiring text pushing data of a preset region in real time, wherein the text pushing data of the preset region comprises text pushing data of interest points of which the sign-in information is the preset region and text pushing data containing keywords corresponding to the preset region;

s102: performing two-stage classification on each piece of tweet data through a preset two-stage classifier to obtain a first-stage label and a second-stage label of each tweet data, wherein the first-stage label is used for indicating whether the tweet data has a potential event, the second-stage label is used for indicating an event type to which the tweet data belongs when the potential event exists, and the tweet data belonging to the same event type is obtained according to the first-stage label and the second-stage label of each tweet data;

s103: respectively acquiring the position information of each piece of tweet data belonging to the same event type;

s104: clustering the tweet data belonging to the same event category according to the text, the release time and the position information of the tweet data to obtain tweet clusters belonging to the same event category;

s105: and respectively generating an event abstract of each tweet cluster as a local event corresponding to each tweet cluster, wherein the event abstract comprises texts with a preset first quantity of tweet data and keywords with a preset second quantity in the corresponding tweet cluster.

In the method, the two-stage classifier is adopted for filtering, the tweet data of each event category are separated independently, and then clustering is carried out, so that the calculation resources required for realizing the event category are greatly reduced, the processing speed is increased, and the detection efficiency is improved. Three elements are used in the clustering process to ensure that each piece of tweet data added into the tweet cluster belongs to the same local event as the tweet cluster, so that the authenticity of the event represented by each tweet cluster which is finally output is ensured.

In an optional embodiment, before performing two-stage classification on each piece of ciphertext data through a preset two-stage classifier to obtain a first-stage label and a second-stage label of each piece of ciphertext data, the method further includes:

building a BERT text classifier;

obtaining sample data, labeling the sample data with a label to obtain training data, wherein the label comprises a primary label and a secondary label, the value of the primary label is a first value representing that no potential event exists or a second value representing that the potential event exists, and the value of the secondary label is a preset event type;

and training the BERT text classifier according to the training data to obtain a two-stage classifier, wherein the two-stage classifier comprises a first-stage classifier and a second-stage classifier, the first-stage classifier is used for classifying the tweet data into tweet data with potential events and tweet data without potential events, and the second-stage classifier is used for determining the event category to which the tweet data with potential events belong.

In the method, the two-stage classifier is trained by adopting a BERT pre-training model and a data set fine adjustment mode, the tweet irrelevant to the event can be well removed, a large amount of useless tweets are reduced to participate in the subsequent clustering process, and the tweet data in each event category are respectively clustered, so that the complexity of the whole process is reduced again, and the operation of the whole system becomes faster and more accurate.

In an optional embodiment, the respectively obtaining location information of pieces of ciphertext data belonging to the same event category includes:

if the tweet data contain the check-in information, acquiring the position information of the interest point in the check-in information as the position information of the tweet data;

if the tweet data does not contain the sign-in information, identifying the potential address of the tweet data through a preset named body identification model, and acquiring the position information of the potential address through a map as the position information of the tweet data.

Since local events are events that occur in a defined city, a very precise address location is required, and accurate potential addresses can be obtained through points of interest in the check-in information or through a named body recognition model. If the potential address is not identified, the tweet data is considered to have no useful address information, and the tweet data is discarded.

In an optional embodiment, the clustering the tweet data belonging to the same event category according to the text, the publishing time, and the location information of the tweet data to obtain a tweet cluster belonging to the same event category includes:

respectively generating text vector codes of the text pushing data according to the texts of the text pushing data;

according to the release time of the text pushing data, sequentially acquiring text pushing data in the text pushing data belonging to the same event type as the current text pushing data;

judging whether the current tweet data is the first tweet data or not;

if so, establishing a text pushing cluster, adding the current text pushing data into the text pushing cluster, and setting the time, the address and the text vector code of the text pushing cluster according to the issuing time, the position information and the text vector code of the current text pushing data;

if not, judging whether a tweet cluster exists or not, wherein the distance between the current tweet data and the tweet cluster is smaller than or equal to a preset first distance threshold, the time difference between the issuing time of the current tweet data and the time of the tweet cluster is smaller than a preset first time threshold, and the distance between the text vector code of the current tweet data and the text vector code of the tweet cluster is smaller than a preset second distance threshold;

if the current tweet data exists, adding the current tweet data into the tweet cluster, and updating the time, the address and the text vector code of the tweet cluster according to the release time, the position information and the text vector code of each tweet data in the tweet cluster;

and if the current text pushing data does not exist in the text pushing cluster, establishing a new text pushing cluster, adding the current text pushing data into the new text pushing cluster, and setting the time, the address and the text vector code of the new text pushing cluster according to the release time, the position information and the text vector code of the current text pushing data.

From the above description, when the tweet data and the tweet cluster satisfy three conditions: and (3) determining the position information of the text pushing data (the distance between the position information of the text pushing data and the address of the text pushing cluster does not exceed a preset distance, the time difference between the publication time of the text pushing data and the time of the text pushing cluster is less than a preset time, and the distance between the text vector code of the text pushing data and the text vector code of the text pushing cluster is less than a preset distance, and adding the text pushing data into the text pushing cluster.

In an optional embodiment, the updating the time, the address, and the text vector code of the tweet cluster according to the publishing time, the location information, and the text vector code of each tweet data in the tweet cluster includes:

acquiring the latest release time according to the release time of each piece of tweet data in the tweet cluster, and updating the time of the tweet cluster according to the latest release time;

calculating a central point according to the position information of each tweet data in the tweet cluster, and updating the address of the tweet cluster according to the position information of the central point;

and calculating the average value of the text vector codes of all tweet data in the tweet cluster, and updating the text vector codes of the tweet cluster according to the average value.

As can be seen from the above description, the time of the tweet cluster is defined as the publishing time of a piece of tweet data closest to the current time in the cluster, that is, the latest publishing time of all pieces of tweet data in the cluster; the address of the tweet cluster is defined as the central point of the position information of all tweet data in the cluster on the map, namely the average value of longitude and latitude; the text vector encoding of a cluster is defined as the average of the text vector encodings of all the ciphertext data in the cluster.

In an optional embodiment, after updating the time, the address, and the text vector encoding of the tweet cluster according to the publishing time, the location information, and the text vector encoding of each tweet data in the tweet cluster, the method further includes:

respectively calculating the distance between the text vector code of the tweet cluster and the text vector codes of other tweet clusters belonging to the same event category;

and if the distance between the text vector code of the tweet cluster and the text vector code of another tweet cluster belonging to the same event type is smaller than a preset third distance threshold, merging the tweet cluster and the another tweet cluster, and updating the time, the address and the text vector code of the merged tweet cluster according to the release time, the position information and the text vector code of each tweet data in the merged tweet cluster.

Because the discussions are performed from different angles, the situation that the tweet clusters originally corresponding to the same event are separated may exist, and as more and more tweet data are added into the tweet clusters, it can be gradually seen that the two tweet clusters discuss the same event, and therefore, the two tweet clusters need to be merged into a new tweet cluster, and the time, address and text vector codes of the merged tweet clusters are updated at the same time.

In an optional embodiment, after clustering the tweet data belonging to the same event category according to the text, the publishing time, and the location information of the tweet data to obtain tweet clusters belonging to the same event category, the method further includes:

and if the time difference between the time of a text pushing cluster and the current time exceeds a preset second time threshold, setting the text pushing cluster as a history cluster.

Because the local event has timeliness, when new tweet data which is not available in the tweet cluster within a certain time is added, the life cycle of the tweet cluster is considered to be finished, the tweet cluster can automatically die and become a history cluster, and the tweet cluster does not participate in subsequent clustering judgment.

In an optional embodiment, the generating the event summary of each tweet cluster as the local event corresponding to each tweet cluster includes:

calculating the influence factors of the tweet clusters according to the total number of tweet data in each tweet cluster, the total number of comments, the total number of forwarding, the total number of words and the corresponding scores of the event types to which the tweet clusters belong;

respectively generating event summaries of all the tweet clusters, wherein the event summaries comprise texts with a preset first number of tweet data and keywords with a preset second number in the corresponding tweet clusters;

receiving a first request sent by a client, wherein the first request comprises an event category;

and acquiring N tweet clusters which belong to the event category in the first request and have the highest influence factor, and returning event summaries of the N tweet clusters to the client, wherein N is a preset natural number.

According to the description, the user can define one or more event types to be checked, the system returns the event summaries of the N tweet clusters with the highest influence factors in the event types to be checked by the user, and the user can conveniently and quickly check more important events.

In an optional embodiment, the generating the event summary of each tweet cluster as the local event corresponding to each tweet cluster further includes:

receiving a second request sent by a client, wherein the second request comprises a keyword;

and matching to obtain a text pushing cluster containing the key words in the second request in the event abstract, and returning the event abstract of the text pushing cluster obtained by matching to the client.

As can be seen from the above description, the user can also customize the keywords to query for relevant local events.

As shown in fig. 2, the present invention also provides a local event detection apparatus, including:

the first obtaining module 201 is configured to obtain tweet data of a preset region in real time, where the tweet data of the preset region includes tweet data whose sign-in information is an interest point of the preset region and tweet data including a keyword corresponding to the preset region;

the classification module 202 is configured to perform two-stage classification on each piece of tweet data through a preset two-stage classifier to obtain a first-stage label and a second-stage label of each piece of tweet data, where the first-stage label is used to indicate whether a potential event exists in the tweet data, the second-stage label is used to indicate an event category to which the tweet data belongs when the potential event exists, and the tweet data belonging to the same event category is obtained according to the first-stage label and the second-stage label of each piece of tweet data;

a second obtaining module 203, configured to obtain location information of each piece of tweet data belonging to the same event category;

the clustering module 204 is configured to cluster the tweet data belonging to the same event category according to the text, the publishing time, and the location information of the tweet data, so as to obtain tweet clusters belonging to the same event category;

the generating module 205 is configured to generate an event summary of each tweet cluster as a local event corresponding to each tweet cluster, where the event summary includes a text of tweet data of a preset first quantity and a keyword of a preset second quantity in the corresponding tweet cluster.

As shown in fig. 3, the present invention also provides an electronic device, including:

one or more processors 301;

a storage device 302 for storing one or more programs;

when executed by the one or more processors 301, cause the one or more processors 301 to implement the local event detection method as described above.

The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the local event detection method as described above.

Example one

Referring to fig. 4-5, a first embodiment of the present invention is: a local event detection method is based on an online social media real-time information stream and can be used for solving the problems that a city manager monitors and checks city events in real time and city residents pay attention to city development in real time.

In this embodiment, an example of detecting a local event in Shenzhen city by using a microblog platform is described, as shown in fig. 4, the method of this embodiment includes the following steps:

s401: acquiring text pushing data of a preset area in real time, and cleaning the text pushing data; the text pushing data of the preset region comprises text pushing data of interest points of which the sign-in information is the preset region and text pushing data containing keywords corresponding to the preset region.

Specifically, the public tweet information flow on the platform is obtained in real time through an API in the open platform of the online social media. According to different types of APIs, there are various ways to obtain the public tweet.

a) By detecting interest points (POI, about 2 ten thousand different points) of the whole Shenzhen city, once a user carries geographical position information when sending a microblog, namely, check-in is carried out on the interest points, and a corresponding API can be used for obtaining a corresponding microblog.

b) And by setting some trigger words, such as Shenzhen + noise, the search API in the open platform corresponding to the online social media can be used for searching the tweet containing a certain keyword.

Through the two modes, the microblog Purchase of the whole Shenzhen city can be obtained in real time.

For the obtained tweet data, a regularization expression can be used for preliminary text cleaning, such as removing URL links therein, removing emoticons, and the like.

S402: two-stage classifiers are constructed and trained.

Specifically, first, a BERT text classifier is constructed.

In the large amount of text-pushing data obtained in step S401, a large amount of useless text pushing (such as spam advertisements, whispering of users, etc.) is included, and therefore, it is necessary to train and learn a classifier to quickly classify the text-pushing data into two categories, i.e., potential events and no potential events. In this embodiment, a BERT (bidirectional Encoder from Transformer) text classifier is adopted, wherein BERT is a pre-trained model, and is characterized by a transform-based bidirectional Encoder, and a BERT model which is pre-trained for chinese can be used to perform downstream tasks well.

And then, acquiring sample data, and labeling the sample data to obtain training data.

Before training the classifier, sample data needs to be acquired and labeled to obtain training data. In this embodiment, the value of the first-level label is 0 or 1, when the first-level label is 0, it indicates that no potential event exists, and when the first-level label is 1, it indicates that a potential event exists. The secondary labels are multi-classification labels and comprise a plurality of preset event categories, such as fire events, drowning events, noise complaints, concert holding events and the like; and accurately classifying the tweet data with the primary label of 1 into each event category. For tweed data where no potential event exists, its secondary label may be null.

And finally, training the BERT text classifier according to the training data to obtain a two-stage classifier, wherein the two-stage classifier comprises a first-stage classifier and a second-stage classifier, the first-stage classifier is used for classifying the tweet data into tweet data with potential events and tweet data without potential events, and the second-stage classifier is used for determining the event category to which the tweet data with potential events belong.

And training two different classifiers by using a training mode of using a BERT pre-training model to fine tune on the training data according to the training data, wherein the first classifier is used for judging whether the tweet data has potential local events, and the second classifier classifies the tweet data classified as the tweet data having the potential local events in the first classifier into the event classes to which the tweet data belongs. The two-stage classifier is used for classifying the tweet data into specific event categories, so that the classification precision is improved, the tweet data of the same event category can be conveniently clustered subsequently, and the complexity of subsequent clustering is reduced.

That is, the first classifier is a second class and the second classifier is a multi-class.

Further, Cross Entropy (CE) was used as a loss function during training.

S403: and performing two-stage classification on each text pushing data through a trained two-stage classifier to obtain a first-stage label and a second-stage label of each text pushing data, and acquiring text pushing data belonging to the same event class according to the first-stage label and the second-stage label of each text pushing data.

The first-level label is used for indicating whether potential events exist in the ciphertext data, and the second-level label is used for indicating event categories to which the ciphertext data belong when the potential events exist. Therefore, the tweet data with the potential events can be obtained according to the primary labels of the tweet data, and then the tweet data belonging to the same event category can be obtained according to the secondary labels of the tweet data with the potential events.

After the two-stage classifier is trained, inputting the tweet data flowing in along with the time stream into the two-stage classifier to obtain two-stage labels of the tweet data, if the one-stage label of the tweet data is 0, indicating that no potential local event exists in the tweet data, and discarding the tweet data; if the primary label of the tweet data is 1, the tweet data has potential local events, and the event category to which the tweet data belongs is determined through the secondary label of the tweet data.

And after the text pushing data of each event category is obtained, subsequently clustering in each event category respectively.

S404: and respectively acquiring the position information of the tweet data of each event type.

Specifically, if the tweet data contains sign-in information, acquiring position information of an interest point in the sign-in information as position information of the tweet data; if the tweet data does not contain sign-in information, identifying a potential address of the tweet data through a preset named body identification model, and acquiring position information of the potential address through a map to serve as the position information of the tweet data; if there is no potential address, discarding the push text data.

Before clustering the tweet data of each event category, the position information of each tweet data needs to be recorded. Since local events are events that occur in a determined city, a very precise address location, such as a street number, is required. In this embodiment, the method for extracting an address from tweet data includes the following two ways:

a) if the user self-carries the check-in when sending the pushtext, the check-in information is based on a point of interest (POI), so that the user can be considered to be located at the POI when sending the pushtext. The interest points all have clear longitude and latitude information, so that the longitude and latitude of the interest point where the user is located can be used as the address of the microblog.

b) And if the user does not carry sign-in information during sending the tweet, extracting potential addresses in the text by using an address extraction named body recognition model trained based on a BERT pre-training model, and converting the addresses into longitude and latitude by using a high-resolution geocoding API. And if the namespace recognition model cannot extract potential addresses in the tweet text, the tweet text is considered to have no useful address information, and the tweet is discarded.

The Named Entity Recognition (NER) model refers to an Entity with a specific meaning in a Recognition text, and mainly includes characters such as a name of a person, a place, a name of an organization, a proper noun, time, quantity, currency, a proportional numerical value and the like. The models that work better on the NER at present are based on deep learning or statistical learning methods.

Each training sample for named entity recognition consists of a sentence and its corresponding label, the label set adopts BIOES (B denotes the beginning of an entity, E denotes the end of an entity, I denotes inside an entity, and O denotes a non-entity), and the sentences are separated by an empty line. For example:

china visited by Wallace in the United states

B-LOC E-LOC O B-PER I-PER E-PER O O B-LOC E-LOC

Since the present embodiment only needs to recognize address information, only labels regarding places need to be concentrated.

Since the detection of local events requires precise addresses, there is often a large difference between the address description in the tweet text and the address data distribution in news stories sent by most users. There are many instances of omission, shorthand, etc. on social media. The optimal choice is to label the tweet data for training.

Each word in the tweet data used for training is labeled O, B-LOC I-LOC E-LOC or B-LOC I-LOC E-LOC when labeled here, these three labels denote the beginning, middle and end of this address segment, respectively.

In this embodiment, two fully-connected layers are added after the output of the BERT model, and a cross entropy loss function is used for training, and a model structure diagram is shown in fig. 5.

S405: and clustering the tweet data belonging to the same event category according to the text, the release time and the position information of the tweet data to obtain tweet clusters belonging to the same event category.

Firstly, a text vector code of each tweet data is generated according to the text of each tweet data.

After the address acquisition operation, all the reserved text pushing data have position information (longitude and latitude). At this time, each piece of tweet data can be described using three elements (text, posting time, and location information). The text of the tweet data is converted into a vector form, and a text vector code is generated, namely a group of one-dimensional vectors is used for representing semantic information of a piece of tweet text.

And then clustering the tweet data in each event category respectively according to the text vector codes, the release time and the position information of the tweet data to obtain tweet clusters belonging to the same event category. In this embodiment, the following requirements are applied to the tweet cluster: the time of the tweet cluster is within 24 hours from the current time, and the positioning distance ranges of all tweets in the tweet cluster are within 2 kilometers.

Therefore, the method specifically comprises the following steps:

s4051: according to the release time of the text pushing data, sequentially acquiring text pushing data in the text pushing data belonging to the same event type as the current text pushing data;

s4052: and judging whether the current text pushing data is the first text pushing data, if so, executing step S4053, and if not, executing step S4054.

S4053: establishing a tweet cluster, adding the current tweet data into the tweet cluster, and setting the time, the address and the text vector code of the tweet cluster according to the release time, the position information and the text vector code of the current tweet data.

At this time, the first tweet cluster is established, namely, the first tweet data automatically becomes the first tweet cluster.

The time of the tweet cluster is defined as the release time of a piece of tweet data closest to the current time in the cluster, namely the latest release time of all tweet data in the cluster; the address of the tweet cluster is defined as the central point of the position information of all tweet data in the cluster on the map, namely the average value of longitude and latitude; the text vector encoding of a cluster is defined as the average of the text vector encodings of all the ciphertext data in the cluster.

At this time, only one text pushing data exists in the text pushing cluster, so that the time of the text pushing cluster is set as the issuing time of the current text pushing data, the address of the text pushing cluster is set as the position information of the current text pushing data, and the text vector code of the text pushing cluster is set as the text vector code of the current text pushing data. And after other text pushing data are added into the text pushing cluster, updating the time, the address and the text vector code of the text pushing cluster.

Then, the next tweet data belonging to the same event category is continuously obtained, i.e., step S4051 is executed.

S4054: judging whether a text pushing cluster exists or not, wherein the distance between the current text pushing data and the text pushing cluster is smaller than or equal to a preset first distance threshold, the time difference between the issuing time of the current text pushing data and the time of the text pushing cluster is smaller than a preset first time threshold, the distance between the text vector code of the current text pushing data and the text vector code of the text pushing cluster is smaller than a preset second distance threshold, if yes, executing a step S4055, and if not, executing a step S4056.

Preferably, in this embodiment, the first distance threshold is 2km, and the first time threshold is 24 h.

That is, if the following three conditions are satisfied simultaneously:

1. the distance between the position information (longitude and latitude) of the text pushing data and the address of the text pushing cluster is not more than 2 km;

2. the time difference between the publication time of the text pushing data and the time of the text pushing cluster is less than 24 hours;

3. the distance between the text vector coding of the tweet data and the text vector coding of the tweet cluster is smaller than a second distance threshold; in the embodiment, because the length of the text vector codes is 512, the function of the Euclidean distance is not large in the high-dimensional distance, and the distance between the text vector codes is calculated by adopting cosine similarity;

and the tweet data is considered to belong to the tweet cluster, and the tweet data is added into the tweet cluster.

If the current tweet data and any one existing tweet cluster can not meet the three conditions at the same time, the current tweet data is considered not to belong to any one existing tweet cluster, and a tweet cluster is created for the current tweet data.

S4055: adding the current tweet data into the tweet cluster, and updating the time, the address and the text vector code of the tweet cluster according to the release time, the position information and the text vector code of each tweet data in the tweet cluster.

Specifically, the latest publishing time is obtained according to the publishing time of each piece of tweet data in the tweet cluster, and the time of the tweet cluster is updated according to the latest publishing time;

S4056: establishing a new tweet cluster, adding the current tweet data into the new tweet cluster, and setting the time, the address and the text vector code of the new tweet cluster according to the release time, the position information and the text vector code of the current tweet data. This step can be referred to as step S4053.

Further, since the local event has timeliness, if the time difference between the time of a tweet cluster and the current time exceeds a preset second time threshold, the tweet cluster is set as a history cluster. In this embodiment, the second time threshold is 24 h.

That is to say, when new tweet data which is not included in the tweet cluster within 24 hours is added, the life cycle of the tweet cluster is considered to be finished, the tweet cluster can automatically die and become a history cluster, and the tweet cluster does not participate in subsequent clustering judgment. But if the user wants to view historical local events, retrieval and viewing can be performed.

Meanwhile, step S4055 shows that when new tweet data is added to a tweet cluster, the text vector code of the tweet cluster is updated, and at this time, the distance between the text vector code of the tweet cluster and the text vector codes of other tweet clusters belonging to the same event category is also calculated, and if the distance is smaller than a preset threshold, it is considered that the two tweet clusters discuss the same event. Since the discussion is performed from different angles, the two tweet clusters are separated from each other at the beginning, and as more and more posts are added into the posts, it can gradually appear that the two tweet clusters discuss the same thing, and the two tweet clusters need to be merged into a new tweet cluster, and the time, address and text vector encoding of the new tweet cluster are updated at the same time.

Specifically, after step S4055, the method further includes: respectively calculating the distance between the text vector code of the tweet cluster and the text vector codes of other tweet clusters belonging to the same event category; and if the distance between the text vector code of the tweet cluster and the text vector code of another tweet cluster belonging to the same event type is smaller than a preset third distance threshold, merging the tweet cluster and the another tweet cluster, and updating the time, the address and the text vector code of the merged tweet cluster according to the release time, the position information and the text vector code of each tweet data in the merged tweet cluster.

S406: and respectively calculating the influence factors of the tweet clusters according to the total number of tweet data in each tweet cluster, the total number of comments, the total number of forwarding, the total word number and the corresponding scores of the event types to which the total word number belongs.

The impact factors are calculated for each cluster to judge the importance of the tweet cluster. Specifically, the calculation is performed according to the following formula:

Score=a1×Nall+a2×Ncomments+a3×Nforward+a4×Nwords+a5×Nclass

the method comprises the following steps that Nall represents the total number of tweed data in a tweed cluster, Ndemands represents the total number of comments of all tweed data in the tweed cluster, Nforward represents the total number of forwarding of all tweed data in the tweed cluster, Nwords represents the total number of words of texts of all tweed data in the tweed cluster, Nclass represents a score corresponding to an event category to which the tweed cluster belongs, the score is preset and is determined according to the importance of the event category, for example, the fixed event score of the categories such as a malignant car accident and a knife holding injury is high; a1, a2, a3, a4 and a5 are respectively preset weight coefficients, and can be determined according to different requirements.

S407: and respectively generating event summaries of the tweet clusters, wherein the event summaries comprise texts with a preset first quantity of tweet data and keywords with a preset second quantity of tweet data in the corresponding tweet clusters.

Each tweet cluster represents a local event, and for a local event, in the present embodiment, 5 pieces of tweet data and 10 keywords are used for common representation. And when the number of the tweet data in the tweet cluster is less than 5, adding all the tweet data into the event summary of the tweet cluster. When the number of the tweet data in the tweet cluster is more than 5, using a Textrank algorithm to extract the most critical 5 words in a text set consisting of texts of all tweet data in the tweet cluster, and adding the 5 words into the event summary of the tweet cluster. And extracting 10 words from the text composition text set of all the tweet data in all the tweet clusters by using a TextRank algorithm as the key words of the events represented by the tweet clusters.

The TextRank regards grammar units in the text as nodes in the graph, if two grammar units have certain grammar relations (such as co-occurrence), the two grammar units have one edge in the graph to be connected with each other, different nodes have different weights finally through certain iteration times, and the grammar unit with the high weight can be used as a keyword. Similarly, the grammar unit can be regarded as a single sentence, and then the key sentences are extracted by the TextRank algorithm, and the text abstract can be formed after the key sentences are combined. Meanwhile, the TextRank algorithm can be separated from the background of a corpus and only analyzes a single document, so that the textbook cluster is favorably processed in real time.

Steps S406 and S407 may be performed without being sequential.

S408: and returning the event summary of the related tweet cluster according to the user requirement, namely returning the related local event.

Specifically, a first request sent by a client is received, wherein the first request comprises an event category; and acquiring N tweet clusters which belong to the event category in the first request and have the highest influence factor, and returning the event summaries of the N tweet clusters to the client. Or receiving a second request sent by the client, wherein the second request comprises a keyword; and matching to obtain a text pushing cluster containing the key words in the second request in the event abstract, and returning the event abstract of the text pushing cluster obtained by matching to the client.

In actual use, if a user wants to check the top N local event leaderboards in all event categories in a city at that time at a certain time, the top N tweet clusters with the highest influence factor are screened out from all tweet clusters, and the event summaries of the top N tweet clusters are returned.

And if the user only wants to see the first N events of a certain event category, sorting the tweet clusters belonging to the event category from large to small according to the influence factors of the tweet clusters, and returning the event summaries of the N tweet clusters with the top ranking.

The user can also customize the viewed event categories or keywords to return the desired event. And if the user is the user-defined event category query, sorting the influence factors among the tweet clusters in several defined event categories, and returning the event summaries of N tweet clusters with the top ranking. And if the user is the user-defined keyword query, matching the user-defined keywords in the event summaries of all the text pushing clusters, and if the event summaries of the text pushing clusters contain the user-defined keywords, returning the event summaries of the text pushing clusters.

In the embodiment, the two-stage classifier is trained by adopting a BERT pre-training model and a data set fine adjustment mode, the tweet irrelevant to the event can be well removed, a large amount of useless tweets are reduced to participate in the subsequent clustering process, and the tweet data in each event category are clustered respectively, so that the complexity of the whole process is reduced again, and the operation of the whole system becomes faster and more accurate.

The embodiment can realize local event detection based on online social media tweet real-time data, a city manager or citizen can acquire and view the nearest local event in real time, view the event only related to a certain category or a plurality of categories according to own interests or purposes, and also can customize keywords to retrieve and view the event.

Example two

Referring to fig. 2, the second embodiment of the present invention is: a local event detection device can execute the local event detection method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. The device can be implemented by software and/or hardware, and specifically comprises:

In an optional embodiment, the local event detection apparatus further includes:

the building module is used for building a BERT text classifier;

the labeling module is used for obtaining sample data and labeling the sample data to obtain training data, wherein the label comprises a primary label and a secondary label, the value of the primary label is a first value indicating that no potential event exists or a second value indicating that the potential event exists, and the value of the secondary label is a preset event type;

and the training module is used for training the BERT text classifier according to the training data to obtain a two-stage classifier, the two-stage classifier comprises a first-stage classifier and a second-stage classifier, the first-stage classifier is used for classifying the tweet data into tweet data with potential events and tweet data without potential events, and the second-stage classifier is used for determining the event category to which the tweet data with potential events belong.

In an optional embodiment, the second obtaining module comprises:

the first acquisition unit is used for acquiring the position information of the interest point in the sign-in information as the position information of the text pushing data if the text pushing data contains the sign-in information;

and the second acquisition unit is used for identifying the potential address of the tweet data through a preset named body identification model if the tweet data does not contain the sign-in information, and acquiring the position information of the potential address through a map as the position information of the tweet data.

In an optional embodiment, the clustering module comprises:

the first generating unit is used for generating text vector codes of all the tweet data according to the texts of all the tweet data;

the third acquisition unit is used for sequentially acquiring one piece of tweet data in the tweet data belonging to the same event type as the current tweet data according to the publishing time of the tweet data;

the first judgment unit is used for judging whether the current tweet data is first tweet data or not;

the first establishing unit is used for establishing a text pushing cluster if the current text pushing cluster is the text pushing cluster, adding the current text pushing data into the text pushing cluster, and setting the time, the address and the text vector code of the text pushing cluster according to the issuing time, the position information and the text vector code of the current text pushing data;

a second judging unit, configured to judge whether a tweet cluster exists if the distance between the current tweet data and the tweet cluster is not less than a preset first distance threshold, a time difference between the issuance time of the current tweet data and the time of the tweet cluster is less than a preset first time threshold, and a distance between a text vector code of the current tweet data and a text vector code of the tweet cluster is less than a preset second distance threshold;

the updating unit is used for adding the current tweet data into the tweet cluster if the current tweet data exists, and updating the time, the address and the text vector code of the tweet cluster according to the issuing time, the position information and the text vector code of each tweet data in the tweet cluster;

and the second establishing unit is used for establishing a new text pushing cluster if the current text pushing cluster does not exist, adding the current text pushing data into the new text pushing cluster, and setting the time, the address and the text vector code of the new text pushing cluster according to the release time, the position information and the text vector code of the current text pushing data.

In an optional embodiment, the updating unit includes:

the first updating subunit is used for acquiring the latest issuing time according to the issuing time of each piece of tweed data in the tweed cluster, and updating the time of the tweed cluster according to the latest issuing time;

the second updating subunit is used for calculating a central point according to the position information of each piece of tweed data in the tweed cluster and updating the address of the tweed cluster according to the position information of the central point;

and the third updating subunit is used for calculating an average value of the text vector codes of all the tweet data in the tweet cluster, and updating the text vector codes of the tweet cluster according to the average value.

In an optional embodiment, the clustering module further comprises:

the calculation unit is used for calculating the distance between the text vector code of the tweet cluster and the text vector codes of other tweet clusters belonging to the same event category;

and the merging unit is used for merging the tweet cluster and another tweet cluster if the distance between the text vector code of the tweet cluster and the text vector code of another tweet cluster belonging to the same event type is smaller than a preset third distance threshold, and updating the time, the address and the text vector code of the merged tweet cluster according to the issuing time, the position information and the text vector code of each tweet data in the merged tweet cluster.

In an optional embodiment, further comprising:

and the extinction module is used for setting a text pushing cluster as a history cluster if the time difference between the time of the text pushing cluster and the current time exceeds a preset second time threshold.

In an optional embodiment, the generating module comprises:

the second calculating unit is used for calculating the influence factors of the tweet clusters according to the total number of tweet data in the tweet clusters, the total number of comments, the total number of forwarding, the total number of words and the corresponding scores of the event types;

the second generation unit is used for respectively generating event summaries of all the tweet clusters, and the event summaries comprise texts with a preset first number of tweet data and a preset second number of keywords in the corresponding tweet clusters;

the first receiving unit is used for receiving a first request sent by a client, and the first request comprises an event category;

and the first returning unit is used for acquiring N tweet clusters which belong to the event category in the first request and have the highest influence factor, and returning the event summaries of the N tweet clusters to the client, wherein N is a preset natural number.

In an optional embodiment, the generating module further comprises:

the second receiving unit is used for receiving a second request sent by the client, and the second request comprises a keyword;

and the second returning unit is used for matching the text pushing cluster containing the key words in the second request in the event abstract and returning the event abstract of the text pushing cluster obtained by matching to the client.

EXAMPLE III

Referring to fig. 3, a third embodiment of the present invention is: an electronic device, the electronic device comprising:

one or more processors 301;

a storage device 302 for storing one or more programs;

when the one or more programs are executed by the one or more processors 301, the one or more processors 301 implement the processes in the local event detection method embodiment as described above, and can achieve the same technical effect, and details are not described here to avoid repetition.

Example four

A fourth embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements each process in the local event detection method embodiment described above, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

In summary, according to the local event detection method, device, equipment and storage medium provided by the invention, the text-pushing data is classified by the two-stage classifier, and then the text-pushing data belonging to the same event category is clustered, so that meaningless text-pushing can be screened and thrown away, and each text-pushing data only needs to calculate similarity with the text-pushing cluster of the same event category, and whether the text-pushing cluster belongs to a certain local event or not is judged, thereby greatly reducing the calculation times, greatly reducing the required calculation resources, increasing the processing speed and improving the reliability of real-time perception. In addition, in the clustering process, clustering is carried out according to three elements of time, text and position information, and it can be ensured that each piece of text pushing data added into the text pushing cluster belongs to the same local event as the text pushing cluster, so that it is ensured that the event represented by each finally output text pushing cluster can correspond to the real world. The invention can realize local event detection based on online social media tweet real-time data, and a city manager or citizen can acquire and check the latest local event in real time.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. A method for local event detection, comprising:

2. The local event detection method according to claim 1, wherein before the two-stage classification of each piece of ciphertext data is performed by a preset two-stage classifier to obtain the first-stage label and the second-stage label of each piece of ciphertext data, the method further comprises:

building a BERT text classifier;

3. The local event detection method according to claim 1, wherein the obtaining the location information of each piece of ciphertext data belonging to the same event category respectively comprises:

4. The local event detection method according to claim 1, wherein the clustering the tweet data belonging to the same event category according to the text, the publishing time, and the location information of the tweet data to obtain the tweet clusters belonging to the same event category comprises:

judging whether the current tweet data is the first tweet data or not;

5. The local event detection method according to claim 4, wherein the updating the time, the address, and the text vector code of the tweet cluster according to the publishing time, the location information, and the text vector code of each tweet data in the tweet cluster comprises:

6. The local event detection method according to claim 4, wherein after updating the time, the address, and the text vector code of the tweet cluster according to the publishing time, the location information, and the text vector code of each tweet data in the tweet cluster, the method further comprises:

7. The local event detection method according to any one of claims 1 and 4 to 6, wherein after clustering the tweet data belonging to the same event category according to the text, the publishing time and the location information of the tweet data to obtain a tweet cluster belonging to the same event category, the method further comprises:

8. The local event detection method according to claim 1, wherein the generating the event summary of each tweet cluster as the local event corresponding to each tweet cluster comprises:

9. The local event detection method according to claim 8, wherein the generating the event summary of each tweet cluster as the local event corresponding to each tweet cluster further comprises:

10. A local event detection device, comprising:

11. An electronic device, characterized in that the electronic device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a local event detection method as claimed in any one of claims 1-9.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the local event detection method according to any one of claims 1 to 9.