CN112597269A - Stream data event text topic and detection system - Google Patents

Stream data event text topic and detection system Download PDF

Info

Publication number
CN112597269A
CN112597269A CN202011566187.3A CN202011566187A CN112597269A CN 112597269 A CN112597269 A CN 112597269A CN 202011566187 A CN202011566187 A CN 202011566187A CN 112597269 A CN112597269 A CN 112597269A
Authority
CN
China
Prior art keywords
topic
event
text
module
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011566187.3A
Other languages
Chinese (zh)
Inventor
庄旭
袁鑫
贾莹
尹可鑫
张乾君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 10 Research Institute
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN202011566187.3A priority Critical patent/CN112597269A/en
Publication of CN112597269A publication Critical patent/CN112597269A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The system for detecting the text topic of the streaming data event can eliminate the redundancy of the intermediate process and reduce the detection time. The invention is realized by the following technical scheme: the topic detection module constructs a topic and event detection algorithm model, and crawls text data from various large network media and social platforms in real time by adopting a crawler technology; the thematic tracking module gives a thematic abstract and keyword information according to the text abstract information provided by the text abstract module; the association detection module detects texts in all directions, divides event attribution, sets a time window with a determined length, sends detected special topic clustering results to the special topic tracking module to obtain clustering results with smaller granularity, the event identification module adopts a hierarchical clustering mode to identify events, gives designed special topic abstracts and event extraction algorithms and special topic abstracts and key word information, and sends the special topic abstracts and event key word information to the event extraction module to analyze the special topic and event key word information to obtain a special topic set with a large number.

Description

Stream data event text topic and detection system
Technical Field
The invention belongs to the technical field of topic detection and tracking and event detection and extraction, and particularly relates to a topic and event detection system for streaming text data.
Background
In recent years, with the rapid development of the internet and the internet of things, a large amount of information is generated on the internet every day, the explosion of the large amount of information on the internet is increased, people are difficult to quickly and accurately retrieve high-quality useful news information from the internet, and the large amount of data appears in many applications, wherein a large part of data exists in the form of streaming data. Streaming data is characterized by being fast, massive, out-of-order, and requires fast response. Various emergencies are frequently occurred at home and abroad, news websites and social platforms are used as the most direct and rapid ways for people to exchange information, and the information contained in the news websites and the social platforms has important value for identifying the emergencies. However, as internet information resources have the characteristics of heterogeneous, dispersive and repetitive information phenomena, uniform formal expression is lacked, various information islands are formed, information resources are difficult to integrate and utilize, most of information is invalid information for specific users, particularly hot topics closely related to life and important major events are often initiated, transmitted and diffused through a network, the information often has a great influence on social public safety, on one hand, people increasingly ask for, reveal emotion and comment on the network, on the other hand, the network also becomes a platform for spreading and diffusing false information, and can be utilized by lawless persons. How to capture hot topics and events from network data streams efficiently, in real time and in all directions and quickly present the content of interest of users becomes important research content of network public opinion monitoring and social public safety analysis. How to satisfy the processing requirement of streaming data is also a hot topic of current research. Events are often extracted from a streaming data processing system, and then prediction analysis processing and expression of the events and topics are performed on the events occurring on the streaming data in the future, so that problems to be known can be conveniently and effectively obtained, and related application requirements are met. The method can be used for quickly and effectively detecting the events and extracting the related features, and is the key for improving the grasping capability of the emergency and analyzing the information of the emergency. The data-driven dominant calculable method will face a great challenge as the data becomes larger and the data itself becomes more complex. Therefore, the deep knowledge is rapidly mined from the network cross-media data and naturally presented, and the bottleneck problem of cross-media data processing in the public security field is broken through. Accordingly, mass data computability becomes a significant issue. The industry has also invested large amounts of manpower and material resources in the problem of large-scale data computation. How to mine mass cross-media data to obtain hot spots, sensitive topics and major events contained in the mass cross-media data, and then efficiently and naturally present the hot spots, the sensitive topics and the major events is a challenge to public safety under the background that the current network cross-media data is closely related to the real life of people. In a large-data-flow computing environment, data flow is often computed and used immediately after arrival, only a few data are persistently stored, and most data are often directly discarded. The use of data is often one-time, volatile, and even if replayed, the resulting data stream and the previous data stream are often different. It is desirable that the system have a certain fault tolerance capability, and that only one data computer is fully utilized to obtain valuable information from the data stream as comprehensively, accurately and effectively as possible. In a large data stream type computing environment, the generation of data is completely determined by data sources, and the speed of the data stream presents a sudden characteristic because the states of different data sources in different space-time ranges are not uniform and change dynamically. The data rate at the previous moment and the data rate at the later moment may have a huge difference, so that the system is required to have good scalability, be capable of dynamically adapting to uncertain inflow data streams, and have strong system computing capability and capability of dynamically matching large data flow. On the one hand, in case of sudden high data flow rates, it is guaranteed that no data is discarded, or that partially unimportant data is identified and optionally discarded; on the other hand, in case of low data rates, it is guaranteed that system resources are not occupied too long or too much. In a big-data-flow computing environment, data elements are unordered between data flows and within the same data flow: on one hand, because the data sources are independent from each other and the spatio-temporal environments are different, the relative sequence of each data element among the data streams cannot be ensured; the system is required to have good capability of analyzing data and finding rules in the data calculation process, and cannot depend too much on the internal logic between data streams or the internal logic inside the data streams.
Based on the above research context, the definitions of topics, topics and events and the relationships between them are first determined. In the traditional topic detection and tracking research, topic events and topics are distinguished and have a definite definition. The topic is the summarization of the characteristics of a specific category in a document classification standard, the classification standard is different and the characteristics of the topic are different for different applications, one topic can be divided into a plurality of sub topics according to the classification standard, and the topic formed by the sub topics is a parent topic. An event is defined as a series of activities, at a particular time, place, by participation of certain people or similar entities, spread around a certain subject. An event can be defined as "an object with a particular theme, time, place", and from a narrative perspective, an event can be defined as "an object that makes a meaningful change to the survival state of a character". With the definition of events and topics we see topics. The definition of a topic is the description and comment of people on an event or topic. Generally, events and topics are objective, whereas topics are subjective. Topics are not necessarily descriptions or comments on events, some topics may be discussions on multiple events, or they may be spoken about a broad topic. Generally, a topic is composed of a plurality of events, the topics in news reports are mostly based on the events, and the topics in topic detection and tracking technology are often defined as being composed of a seed event or activity and the related events or activities. However, these definitions are still insufficient to define topics in the system. To analyze the evolution of an event, the relationship between different sub-events, between different events, and between an event and a topic within the same event needs to be determined in detail. A topic is an attribute of an event, which explains what happens in the event, in effect identifying the attributes of the event. The topic contains an event. The creation of a theme is graphically related to the user. A topic may have multiple events. An event may be associated with multiple topics. Generally, an event or document will have one topic, but it is also possible that an event or document is associated with multiple topics at the same time. Both topics and events may form topics. Documents may be divided into event documents and topic documents by their narrative nature, with documents that are for a particular event in the topic of discussion being referred to as event documents and documents that are not for a particular event being referred to as topic text. An event is defined as an occurrence that describes in a set of documents a change that causes a topic change at a particular point in time, which occurrence has topic and time components and can often be associated with entities such as people and places. Corresponding to the topic is a topic, and a Topic Detection and Tracking (TDT) assessment conference defines "topic": the so-called Topic (Topic) is a core event or activity and events or activities directly related to it. An Event (Event) is usually caused by some reason and condition, occurs at a specific time and place, involves some objects (people or things), and may be accompanied by some necessary result. In general, topics can be simply thought of as a collection of series-related event reports. In a practical application scenario, for a specific type of topic, the event composition thereof needs to be analyzed from a finer granularity.
For the topic, a quite effective topic model is available at present to solve the problem, and for the event, a plurality of detection algorithms are available, but the detection algorithm can be rarely connected with the topic to detect the event from the topic, so that the advantage of detecting the event from the document stream directly is that the detection algorithm can provide richer associated information and more accurate expression to people and can better control the quantity and quality of the generated event. In addition, due to the fact that large data and real-time requirements are met, data are input in a streaming mode, and requirements which cannot be met by a common offline algorithm are met. The invention mainly tries how to mine the theme based on the document flow and detect the event based on the theme model under the real-time condition. The topic mining work specifically comprises topic evolution, topic property judgment and the like. The method is characterized in that the related speculation of the theme development is carried out through information such as a historical theme model, and the theme is tracked, so that the evolution and evolution of the theme are researched. Finally, there are some effective solutions, although there have been many research achievements in real-time topic mining, event detection, topic evolution, etc. However, a unified system which takes public safety oriented as an application field and takes topics as clues, and carries out mining such as real-time topic mining and event detection and topic modeling is not found at present. Due to the explosive growth of cross-media information in the internet, it is increasingly difficult to obtain desired information from a large amount of data. A new tool is needed to help us organize and understand this information. This requires a topic model. The topic model is mainly used for discovering and labeling large-data-volume documents with topics. It finds, mainly by analyzing words of the original text, the intrinsic topics and their associations and variations, etc. It does not require prior knowledge, nor manual labeling, which can be good for understanding and organizing information in unknown domains. In addition, some recent studies may also have topic models applied to streaming data, and may also have it act on cross-media data, such as microblogs, pictures, and so on. But the topic distribution of each document is different. The main function of the topic model is to discover topics from a document set, documents are known per se, and the structure of the topics and the documents. Although some algorithms can solve problems well in a certain application scenario, some algorithms lack support for a real-time streaming data source, and others cannot embody rich association relations between events and topics. Topic mining algorithm based on topic. Subject discovery, event detection and related mining work need to be performed on the basis of meeting the real-time requirement of real-time streaming processing. The problems of overlarge calculation amount, more used data nodes, larger error and the like in the construction process of the event template in the streaming data system are solved. In the aspect of data complexity, the data is various, the data has multi-level and multi-aspect difference in the aspects of coding mode, storage format, application characteristics and the like, structured, semi-structured and unstructured data coexist, and the proportion of the semi-structured and unstructured data is continuously increased. The value of knowledge hidden in the data is increased, and in addition, the big data often presents the characteristics of individuation, incompletion, sparse value, cross multiplexing and the like. Meanwhile, the great value of the total amount is often hidden in big data, and the method shows that the value density is extremely low, the distribution is extremely irregular, the information hiding degree is extremely deep, and the useful value is extremely difficult to find. As the streaming big data in the new period has the characteristics of real-time property, volatility, burst property, disorder property, infinity property and the like, a plurality of new higher requirements are put forward to the system. In 2011, Twitter introduced Storm streaming computing systems, and promoted the development and application of big data streaming computing technology to some extent. However, these systems still have significant disadvantages in terms of scalability, system fault tolerance, state consistency, load balancing, data throughput, etc. In 2011, Twitter introduced Storm streaming computing systems, and promoted the development and application of big data streaming computing technology to some extent. However, these systems still have significant disadvantages in terms of scalability, system fault tolerance, state consistency, load balancing, data throughput, etc., especially for topic and event detection of streaming text data. The traditional text detection method and some text detection methods based on deep learning are mostly multi-stage, and multiple stages need to be optimized during training, which will affect the final model effect and is very time-consuming. The size of text in natural scenes varies greatly. Loss de-regressive text regions directly using L1 or L2 will result in loss bias being longer towards larger. Similar to general purpose target detection, the thresholded results need to be subjected to non-maximum suppression (NMS) to get the final result. The conventional methods are keyword recognition and algorithm rules. The algorithm rules are also called as rule engines, and are mainly a set of expressions, and the recognition accuracy is improved by the method. However, the upgrading and evolution speed of the junk content is too high, and the auditing methods are too restricted by complex character recombination, special symbols and the like. The traditional retrieval mode is established on the basis of the deep understanding of the user on the query requirement, namely that any deviation from the conversion of the query requirement to the query expression, which means that the user needs to accurately express the query requirement to the query expression when retrieving news information, seriously affects the retrieval result. The search requirement that users cannot accurately define the search requirement can only be described abstractly is difficult to meet by the search engine based on keyword search. In the field of topic detection and tracking, common evaluation indexes include accuracy (P), recall (R), and F1 values. Since topic and event detection is unsupervised, it cannot be determined which class of the clustered class corresponds to which class of the labeled data.
Disclosure of Invention
In order to solve the above problems, the present invention provides a topic or event streaming text data detection system that eliminates redundancy in the middle process, reduces detection time, and can improve the accuracy of detecting text topics in each direction, and the tracking capability of topics and events, and the technical scheme adopted by the present invention is as follows: a streaming data event text topic and detection system, comprising: the topic detection module of receiving information flow to and the topic tracking module, event identification module and the event extraction module of establishing ties in order, connect the associated detection module of topic tracking module, carry out the text summary module of information exchange with topic tracking module and event identification module, its characterized in that: under the condition that background information of the special topic and the event is not determined as reference, the special topic detection module autonomously analyzes the contents of report discussion, judges whether two reports or two report sets belong to a special topic or an event or not, designs a social media streaming text information processing system and constructs a special topic and event detection algorithm model; detecting the topics and events unknown in advance by an information flow and organization system under the condition that all the topics are not known in advance, crawling text data from various large network media and social platforms in real time by adopting a crawler technology, and performing data cleaning on the crawled data; processing the cleaned text data by using a social media streaming text information processing system, and distinguishing the special subjects and events of the text data; the thematic tracking module organizes the text by taking the thematic as a basic unit and gives a thematic abstract and key word information according to the text abstract information provided by the text abstract module; the association detection module detects texts in all directions, processes the text information belonging to a topic, divides the attribution of events, extracts keywords and structural information of the events, processes the cleaned text data by using a social media streaming text information processing system, distinguishes the topic and the events of the text data, sets a time window with determined length, clusters the text data in the time window by using a hierarchical clustering method, sets a threshold value to end clustering, sends the detected topic clustering result to the topic tracking module to obtain a clustering result with smaller granularity, sends the clustering result to the event identification module, and the event identification module identifies the events by adopting a hierarchical clustering mode according to the topic or event streaming text data provided by the topic tracking module and the text abstraction module, and uniquely distinguishes the similarity of the texts during clustering, the method comprises the steps of organizing texts and coding document-level contents by using topics as basic units, providing designed topic abstracts and event extraction algorithms, topic abstracts and key word information, sending the designed topic abstracts and event extraction algorithms, the topic abstracts and the key word information of the topics to an event extraction module, dividing the event attribution of the text information belonging to the same topic, extracting the key words and structural information of the events, analyzing the key information of the topics and the events, calculating all dimension weights and normalizing the dimension weights based on a field document key word fast extraction algorithm and a hierarchical clustering method of TF-IWF and an incremental TF-IWF model, and performing graph path expansion through a memory mechanism to obtain a topic set T with a large number.
Compared with the prior art, the invention has the following beneficial effects:
the invention adopts a topic detection module for receiving information flow, a topic tracking module, an event identification module and an event extraction module which are connected in series in sequence, an association detection module connected with the topic tracking module, and a text summarization module for exchanging information with the topic tracking module and the event identification module, and constructs a whole flow system from original information flow to topic and event description and storage. The dynamic life cycle calculation mode provided by the invention fully considers the periodicity of news topics and news events in social media transmission, and effectively improves the ability of topic and event tracking.
Under the condition that no specific topic and event background information is used as reference, the topic detection module autonomously analyzes the contents of report discussion, judges whether two reports or two report sets belong to a topic or an event or not, designs a social media streaming text information processing system, and constructs a topic and event detection algorithm model; the technology for automatically detecting news topics organizes mass news information according to topics and displays the information to users in a certain mode to meet the requirements of the users. In order to relieve errors in remote supervision labeling, trigger word labeling is omitted, so that the system is easier to use.
The invention detects the unknown special subjects and events of the information flow and organization system in advance under the condition of not knowing all the special subjects in advance, crawls text data from various large network media and social platforms in real time by adopting a crawler technology, and performs data cleaning on the crawled data; processing the cleaned text data by using a social media streaming text information processing system, and distinguishing the special subjects and events of the text data; new topics and events can be detected in real time for incoming text information.
The invention sets a time window with determined length, clusters the text data in the time window by using a hierarchical clustering method, sets a threshold value to finish clustering, sends the detected special topic clustering result to a special topic tracking module to obtain a clustering result with smaller granularity, sends the clustering result to an event identification module, the event identification module adopts a hierarchical clustering method to identify events according to special topic or event stream text data provided by the special topic tracking module and a text abstract module, uniquely distinguishes the similarity of the texts during clustering, organizes the texts by taking special topics as a basic unit, gives a designed special topic abstract and event extraction algorithm, a special topic abstract and keyword information, considers the clustered events as documents related to event description, and encodes the content at the document level, the clustering of the documents according to the theme can be divided into backtracking detection and online detection, the text information is hierarchically classified according to the expressed special topic, the organization is convenient for the retrieval and browsing of the user and the selective use, and the special topic information can also be actively pushed to the user, so that the personalized service is realized. The application is mainly embodied in a content management system taking the processing of massive texts as a core, and can detect the topics of news report streams in real time and automatically collect the news report streams from news websites through web crawlers to obtain thematic topic clusters on different granularity levels.
The event extraction module divides the attribution of the event for the text information belonging to the same topic, extracts the key words and the structural information of the event, analyzes the key information of the topic and the event, calculates each dimension weight and normalizes the dimension weight based on the field document key word fast extraction algorithm and the hierarchical clustering method of TF-IWF and the incremental TF-IWF model, and performs graph path expansion through a memory mechanism to obtain a topic set T with a large number, thereby effectively improving the analysis capability of the topic and the event. The problems of overlarge calculated amount, more used data nodes, larger error and the like in the event construction process are effectively reduced, and the method has high usability.
The method simplifies and optimizes a plurality of complicated calculation processes, ensures the accuracy and reliability of the calculation result on the premise of (quasi-) real time, and is more suitable for processing various social media information without any prior knowledge.
Drawings
FIG. 1 is a schematic diagram of the organizational structure of a topic or event streaming text data detection system according to the present invention;
FIG. 2 is a data processing flow diagram of FIG. 1;
fig. 3 is a schematic view of the operating principle of fig. 1.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.
Detailed Description
Refer to fig. 1 and 2. In a preferred embodiment described below, a topic or event streaming text data detection system comprises: the topic detection module of receiving information flow to and the topic tracking module, event identification module and the event extraction module of establishing ties in order, connect the associated detection module of topic tracking module, carry out the text summary module of information exchange with topic tracking module and event identification module, its characterized in that: under the condition that background information of the special topic and the event is not determined as reference, the special topic detection module autonomously analyzes the contents of report discussion, judges whether two reports or two report sets belong to a special topic or an event or not, designs a social media streaming text information processing system and constructs a special topic and event detection algorithm model; detecting the unknown topics and events of the information flow and organization system in advance under the condition that all the topics are not known in advance, crawling text data from all large network media and social platforms in real time by adopting a crawler technology, and performing data cleaning on the obtained data; processing the cleaned text data by using a social media streaming text information processing system, and distinguishing the special subjects and events of the text data; the thematic tracking module organizes the text by taking the thematic as a basic unit and gives a thematic abstract and key word information according to the text abstract information provided by the text abstract module; the association detection module detects texts in all directions, processes the text information belonging to a topic, divides the attribution of events, extracts keywords and structural information of the events, processes the cleaned text data by using a social media streaming text information processing system, distinguishes the topic and the events of the text data, sets a time window with determined length, clusters the text data in the time window by using a hierarchical clustering method, sets a threshold value to end clustering, sends the detected topic clustering result to the topic tracking module to obtain a clustering result with smaller granularity, sends the clustering result to the event identification module, and the event identification module identifies the events by adopting a hierarchical clustering mode according to the topic or event streaming text data provided by the topic tracking module and the text abstraction module, and uniquely distinguishes the similarity of the texts during clustering, the method comprises the steps of organizing texts and coding document-level contents by using topics as basic units, providing designed topic abstracts and event extraction algorithms, topic abstracts and key word information, sending the designed topic abstracts and event extraction algorithms, the topic abstracts and the key word information of the topics to an event extraction module, dividing the event attribution of the text information belonging to the same topic, extracting the key words and structural information of the events, analyzing the key information of the topics and the events, calculating all dimension weights and normalizing the dimension weights based on a field document key word fast extraction algorithm and a hierarchical clustering method of TF-IWF and an incremental TF-IWF model, and performing graph path expansion through a memory mechanism to obtain a topic set T with a large number.
In the topic detection stage, the event identification module enables a higher threshold value theta1And the input event extraction module calculates a large number of thematic sets T by using a word-language inverse frequency mode calculation weighting algorithm TF-IWF according to the occurrence frequency of the word w, the occurrence frequency TF (d, w) of the word w in the text d and the total number of words in the corpus by using a traditional hierarchical clustering method and an incremental TF-IWF model. The weighting algorithm calculates the topic set T using the formula shown below,
Figure BDA0002861781740000071
until time t
In the above formula, wft(w) represents the total number of words that appear in the corpus up to time t.
In the topic tracking stage, the topic tracking module adds a part for inhibiting large topic combination and sets a smaller threshold value theta2To, forTwo special subjects T1And T2Number of texts n in two topics1And n2Calculating the maximum similarity simi,j
Figure BDA0002861781740000086
Where α is the attenuation base to cosine value, n1And n2The number of texts in the two topics for similarity comparison is shown.
Maximum similarity sim is taken by thematic tracking modulei,jAnd two corresponding topics T1And T2Comparing similarity, and aiming at the number n of texts in two special subjects1、n2And attenuation base number α, for maximum similarity simi,jAmplifying to calculate similarity
Figure BDA0002861781740000081
The threshold is maintained in a stable interval.
Similarity obtained by the topic tracking module
Figure BDA0002861781740000082
Greater than a threshold value theta2Then, the two topics are combined to form a new topic TkNew topic TkThe vector is expressed as the mean of all text vectors of the topic, if similarity is high
Figure BDA0002861781740000083
Less than threshold theta2And outputting the thematic set.
See fig. 3. As a principle human activity rule and a social domain rule, a news rule has a life cycle or a life length. In determining the relevance between topics, the lifecycle of the topics should be considered. The life cycle of different types of news topics is different, and the life cycle of the news topics is longer if the popularity of the topics is higher according to the judgment of the number of related reports. And (4) the topic tracking module plans that the life cycle of the news topic is finished if no update occurs for N days, freezes the news topic and moves the news topic into the historical topic set, and does not participate in topic combination any more. The thematic detection module detects according to the input information flow of the literature reports 1, 2, 3 … reports N-1 and N, sends the detection result to the correlation detection module, performs correlation detection on the thematic 1 and 2 … thematic m, and defines a thematic set N in the life cycle as:
Figure BDA0002861781740000087
and (4) the topic tracking module draws up a life cycle of the news topics which are not updated for n days to indicate that the life cycle is finished, freezes the news topics and moves the news topics into the historical topic set, and does not participate in topic combination any more.
The topic tracking module tracks the topics 1 and 2 …, and the shortest life cycle delta of the topics and the time t of topic creation are used for tracking the topics0Current time ts0The time t when the last report is clustered to the special topiccTwo hyper-parameters alpha and beta, adjusting the life cycle N and the decay function to obtain the decay function of the importance along with the time
Figure BDA0002861781740000084
Figure BDA0002861781740000085
When the current moment is that the last report is clustered to the thematic moment, the text abstract module counts the frequency of key words in the abstract 1 and abstract 2 … abstract k sentences according to the life cycle thematic set N of the thematic only related to the number num of the report related to the thematic, and then sequences the sentences, scores the sentences in the sequence word frequency list document one by one, finds out the sentences with high score to form the text abstract to be found, sends the text abstract to the event recognition module for recognition, if no new report enters the thematic, moves the thematic into the historical thematic set, and when t is the moment, the last report is clustered to the thematic setso-tcAnd when the number is more than N, the abstract of the life cycle-diminishing topic is judged according to the number of related reports, and the abstract of the life cycle-diminishing topic gradually declines along with the time and also enters the historical topic collection.
Some sentences contain more information, some contain less information, and if more keywords are contained, the more important the sentence is. The event extraction module adopts an extraction model, and extracts the events of the reports formed in the special sentences of the events 1, 2 and … according to the special abstract word frequency statistics to find out the sentences with the most information and the sentences with the most keywords. The information of the special subjects contained in the sentences is measured by using the keywords, the automatic abstract of the information amount of the sentences is extracted for keyword extraction, and the keywords 1, the keywords 2 …, the keywords i, the ID addresses, the time, the places, the trigger actions and the event information structured by the participants … are sent to a recommendation system or a search system.
The automatic topical summaries that make up the abstraction are specifically used to score sentences by the degree of importance of the "clusters". The content in a sentence whose length is less than a given value is called a "cluster", that is, as long as the distance between the keywords is less than a "threshold value", they are considered to be in the same cluster. The formula for calculating the importance of the "cluster" is shown in (6):
Figure BDA0002861781740000091
where n represents the number of included keywords and 1 represents the length of the cluster. The importance score of each cluster in the sentence is calculated, then the sentence (such as the first 5 sentences) containing the cluster with the highest score is found, and the sentences are combined together to form the automatic abstract of the topic.
The text abstract module adopts an extraction model, measures the information amount of sentences by using 'key words', finds out the sentences with the most information, counts the frequency of key words in the sentences, further sorts the sentences, determines that the distance between the key words is less than a threshold value and is in the same cluster, scores the sentences by using the importance degree of the 'cluster', scores the sentences in the document one by a sorted word frequency TF list, finds out the sentences with high scores, counts the number n of the contained key words according to the density of the key words and the frequency TF, and expresses the length l of the cluster,calculate an importance score for each cluster in the sentence
Figure BDA0002861781740000092
Then, the sentences containing the clusters with the highest scores are found out, the information of the topics is extracted and combined together to form the automatic abstract of the topics.
Furthermore, the text abstract module divides the text information belonging to the same topic into event attributions and extracts keywords and structural information of the event.
The goal of event detection is to detect all events contained in an information stream without distinguishing their degree of importance from a specific type. The detection mode of the event recognition module special topic is the same, a hierarchical clustering mode is still adopted, and the only difference is that the requirement on the similarity of texts during clustering is higher, so that the clustering result has smaller granularity. The event recognition module integrates the common weighting technology TF-IDFIDF statistical algorithm for information retrieval and data mining, and calculates the weighting algorithm TF-IWF by the inverse document frequency IDF and the inverse word frequency mode to extract keywords of the event.
The event extraction module adopts a Document-level event extraction model (Doc2EDAG), and the core idea of the model is to convert Document-level event table filling tasks (DEEs) into Entity-based directed acyclic graph path expansion tasks (EDAGs). The document level context is coded and a memory mechanism is designed for graph path expansion. In order to alleviate errors in remote supervision tagging, the model omits trigger-word tagging, and omits trigger-word predefining and heuristically generating trigger words (for sentences without trigger words, heuristically generating a trigger word from a predefined set of trigger words) by ignoring trigger-word tagging.
The event extraction module adopts a document-level event extraction model (Doc2EDAG) to extract events, namely converts a document-level event table filling task (DEE) into a path expansion task (EDAG) based on an entity directed acyclic graph, designs a memory mechanism to expand the graph path, and weights the graph path based on a mechanism of estimating conditional probabilityModel Context weighted estimation of known source symbol sequence x1... xnThe weight obtained by conditional probability distribution and the weight corresponding to the shortest code length of a section of information source sequence are respectively weighted and combined for two groups of Context models, the current information source symbol at the document level is coded, the similarity of the probability distribution is judged by utilizing the relation between the description length increment and the threshold, if the probability distribution is similar, the code length is obtained by adopting a weighting method, if the probability distribution is not similar, the probability distribution with the minimum information entropy is selected to code, the average code length of the code is gradually reduced until the average code length is close to the information source limit entropy, the bit number of the code length after each pixel point is coded is obtained, the probability with better solution is found, more accurate conditional probability distribution is constructed, the values of the code length obtained by corresponding to different thresholds are counted, the optimal adaptive value and the weight thereof are stored, and the trigger word is generated. The experimental result shows that the compression efficiency of the target sequence can be better improved by describing the length to judge whether the probability distribution is similar and then selectively performing Context weighting, namely, the code length can be effectively reduced, distortion-free compression is realized, and the compression efficiency can be improved by applying the method.
In the field of topic detection and tracking, common evaluation indexes include accuracy (P), recall (R), and F1 values. Because the detection of the special subjects and the events is unsupervised and the class after clustering and the class in the labeled data can not be determined to correspond to each other, the event extraction module respectively calculates the accuracy rate P of each class for each class detected by the special subjects and the eventsi,jRecall rate Ri,j、F1i,jCalculating the values of the formulas (7), (8) and (9), storing the optimum parameter settings of the system when the evaluation indexes reach the maximum values, and calculating the accuracy of each class
Figure BDA0002861781740000101
Recall rate
Figure BDA0002861781740000102
F1 value
Figure BDA0002861781740000103
Wherein, num*(T) is the number of texts in the set T,
Figure BDA0002861781740000104
for the ith topic detected by the topic and event detection algorithm,
Figure BDA0002861781740000105
the j-th topic in the corpus is labeled.
The event extraction module also calculates the accuracy (P), recall (R) and F1 values for the event components during event extraction. And stores the optimum parameter settings of the system when these evaluation indices take their maximum values.
The method simplifies and optimizes a plurality of complicated calculation processes, ensures the accuracy and reliability of the calculation result on the premise of real time, and is more suitable for processing various social media information without any prior knowledge.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. A streaming data event text topic and detection system, comprising: the topic detection module of receiving information flow to and the topic tracking module, event identification module and the event extraction module of establishing ties in order, connect the associated detection module of topic tracking module, carry out the text summary module of information exchange with topic tracking module and event identification module, its characterized in that: under the condition that background information of the special topic and the event is not determined as reference, the special topic detection module autonomously analyzes the contents of report discussion, judges whether two reports or two report sets belong to a special topic or an event or not, designs a social media streaming text information processing system and constructs a special topic and event detection algorithm model; detecting the topics and events unknown in advance by an information flow and organization system under the condition that all the topics are not known in advance, crawling text data from various large network media and social platforms in real time by adopting a crawler technology, and performing data cleaning on the crawled data; processing the cleaned text data by using a social media streaming text information processing system, and distinguishing the special subjects and events of the text data; the thematic tracking module organizes the text by taking the thematic as a basic unit and gives a thematic abstract and key word information according to the text abstract information provided by the text abstract module; the association detection module detects texts in all directions, processes the text information belonging to a topic, divides the attribution of events, extracts keywords and structural information of the events, processes the cleaned text data by using a social media streaming text information processing system, distinguishes the topic and the events of the text data, sets a time window with determined length, clusters the text data in the time window by using a hierarchical clustering method, sets a threshold value to end clustering, sends the detected topic clustering result to the topic tracking module to obtain a clustering result with smaller granularity, sends the clustering result to the event identification module, and the event identification module identifies the events by adopting a hierarchical clustering mode according to the topic or event streaming text data provided by the topic tracking module and the text abstraction module, and uniquely distinguishes the similarity of the texts during clustering, the method comprises the steps of organizing texts and coding document-level contents by using topics as basic units, providing designed topic abstracts and event extraction algorithms, topic abstracts and key word information, sending the designed topic abstracts and event extraction algorithms, the topic abstracts and the key word information of the topics to an event extraction module, dividing the event attribution of the text information belonging to the same topic, extracting the key words and structural information of the events, analyzing the key information of the topics and the events, calculating all dimension weights and normalizing the dimension weights based on a field document key word fast extraction algorithm and a hierarchical clustering method of TF-IWF and an incremental TF-IWF model, and performing graph path expansion through a memory mechanism to obtain a topic set T with a large number.
2. The streaming data event text topic and detection system of claim 1, wherein: in the topic detection stage, the event identification module enables a higher threshold value theta1And the input event extraction module calculates a large number of thematic sets T by using a word-language inverse frequency mode calculation weighting algorithm TF-IWF according to the occurrence frequency of the word w, the occurrence frequency TF (d, w) of the word w in the text d and the total number of words in the corpus by using a traditional hierarchical clustering method and an incremental TF-IWF model.
3. The streaming data event text topic and detection system of claim 2, wherein: the weighting algorithm is calculated using the following formula,
Figure FDA0002861781730000011
until the time t, the time t is reached,
in the formula, wft(w) represents the total number of words that appear in the corpus up to time t.
4. The streaming data event text topic and detection system of claim 1, wherein: in the topic tracking stage, the topic tracking module adds a part for inhibiting large topic combination and sets a smaller threshold value theta2Corresponding two topics T1And T2Number of texts n in two topics1And n2Calculating the maximum similarity
Figure FDA0002861781730000021
Where α is the attenuation base to cosine value, n1And n2The number of texts in the two topics for similarity comparison is shown.
5. The streaming data event text topic and detection system of claim 4, wherein: maximum similarity sim is taken by thematic tracking modulei,jAnd two corresponding topics T1And T2Comparing similarity, and aiming at the number n of texts in two special subjects1、n2And attenuation base number α, for maximum similarity simi,jAmplifying to calculate similarity
Figure FDA0002861781730000022
Keeping the threshold value in a stable interval; similarity obtained by the topic tracking module
Figure FDA0002861781730000023
Greater than a threshold value theta2Then, the two topics are combined to form a new topic TkNew topic TkThe vector is expressed as the mean of all text vectors of the topic, if similarity is high
Figure FDA0002861781730000024
Less than threshold theta2And outputting the thematic set.
6. The streaming data event text topic and detection system of claim 1, wherein: and (4) the topic tracking module plans that the life cycle of the news topic is finished if no update occurs for N days, freezes the news topic and moves the news topic into the historical topic set, and does not participate in topic combination any more.
7. The streaming data event text topic and detection system of claim 1, wherein: the thematic detection module detects according to the input information flow of the literature reports 1, 2, 3 … reports N-1 and N, sends the detection result to the correlation detection module, performs correlation detection on the thematic 1 and 2 … thematic m, and defines a thematic set N in the life cycle as:
Figure FDA0002861781730000025
and (4) the special topic tracking module plans a life cycle of the news special topics which are not updated for n days to indicate the end, freezes the news special topics and moves the news special topics into the historical special topic set, and does not participate in special topic combination any more.
8. The streaming data event text topic and detection system of claim 1, wherein: the topic tracking module tracks the topics 1 and 2 …, and the shortest life cycle delta of the topics and the time t of topic creation are used for tracking the topics0Current time ts0The time t when the last report is clustered to the special topiccAdjusting the life cycle N and the decay function to obtain the decay function of the importance along with the time
Figure FDA0002861781730000026
9. The streaming data event text topic and detection system of claim 1, wherein: when the current moment is that the last report is clustered to the thematic moment, the text abstract module counts the frequency of key words in the abstract 1 and abstract 2 … abstract k sentences according to the life cycle thematic set N of the thematic only related to the number num of the report related to the thematic, and then sequences the sentences, scores the sentences in the sequence word frequency list document one by one, finds out the sentences with high score to form the text abstract to be found, and sends the text abstract to the event recognition module for recognition, if no new report enters the thematic, the thematic is moved into the historical thematic set, and when t is the last report, the thematic set is clustered to the thematic setso-tcAnd when the number is more than N, the abstract of the life cycle-diminishing topic is judged according to the number of related reports, and the abstract of the life cycle-diminishing topic gradually declines along with the time and also enters the historical topic collection.
10. The streaming data event text topic and detection system of claim 1, wherein: the event extraction module adopts an extraction model, and extracts the events of the reports formed in the special sentences of the events 1, 2 and … according to the special abstract word frequency statistics to find out the sentences with the most information and the sentences with the most keywords. The information of the special subjects contained in the sentences is measured by using the keywords, the automatic abstract of the information amount of the sentences is extracted for keyword extraction, and the keywords 1, the keywords 2 …, the keywords i, the ID addresses, the time, the places, the trigger actions and the event information structured by the participants … are sent to a recommendation system or a search system.
CN202011566187.3A 2020-12-25 2020-12-25 Stream data event text topic and detection system Pending CN112597269A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011566187.3A CN112597269A (en) 2020-12-25 2020-12-25 Stream data event text topic and detection system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011566187.3A CN112597269A (en) 2020-12-25 2020-12-25 Stream data event text topic and detection system

Publications (1)

Publication Number Publication Date
CN112597269A true CN112597269A (en) 2021-04-02

Family

ID=75202706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011566187.3A Pending CN112597269A (en) 2020-12-25 2020-12-25 Stream data event text topic and detection system

Country Status (1)

Country Link
CN (1) CN112597269A (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050198056A1 (en) * 2004-03-02 2005-09-08 Microsoft Corporation Principles and methods for personalizing newsfeeds via an analysis of information novelty and dynamics
US20090292660A1 (en) * 2008-05-23 2009-11-26 Amit Behal Using rule induction to identify emerging trends in unstructured text streams
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
US20130290232A1 (en) * 2012-04-30 2013-10-31 Mikalai Tsytsarau Identifying news events that cause a shift in sentiment
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
WO2015084756A1 (en) * 2013-12-02 2015-06-11 Qbase, LLC Event detection through text analysis using trained event template models
CN105005590A (en) * 2015-06-29 2015-10-28 北京信息科技大学 Method for generating special topic staged abstract of information media
CN105956197A (en) * 2016-06-15 2016-09-21 杭州量知数据科技有限公司 Social media graph representation model-based social risk event extraction method
US20170235820A1 (en) * 2016-01-29 2017-08-17 Jack G. Conrad System and engine for seeded clustering of news events
CN108932311A (en) * 2018-06-20 2018-12-04 天津大学 The method of incident detection and prediction
CN109829089A (en) * 2018-12-12 2019-05-31 中国科学院计算技术研究所 Social network user method for detecting abnormality and system based on association map
CN110147439A (en) * 2018-07-18 2019-08-20 中山大学 A kind of news event detecting method and system based on big data processing technique
CN110162632A (en) * 2019-05-17 2019-08-23 北京百分点信息科技有限公司 A kind of method of Special Topics in Journalism event discovery
CN110232149A (en) * 2019-05-09 2019-09-13 北京邮电大学 A kind of focus incident detection method and system
US20200184151A1 (en) * 2018-11-30 2020-06-11 Thomson Reuters Special Services Llc Systems and methods for identifying an event in data
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN112069383A (en) * 2020-08-31 2020-12-11 杭州叙简科技股份有限公司 News text event and time extraction and normalization system for event tracking

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050198056A1 (en) * 2004-03-02 2005-09-08 Microsoft Corporation Principles and methods for personalizing newsfeeds via an analysis of information novelty and dynamics
US20090292660A1 (en) * 2008-05-23 2009-11-26 Amit Behal Using rule induction to identify emerging trends in unstructured text streams
US20130290232A1 (en) * 2012-04-30 2013-10-31 Mikalai Tsytsarau Identifying news events that cause a shift in sentiment
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
WO2015084756A1 (en) * 2013-12-02 2015-06-11 Qbase, LLC Event detection through text analysis using trained event template models
CN105005590A (en) * 2015-06-29 2015-10-28 北京信息科技大学 Method for generating special topic staged abstract of information media
US20170235820A1 (en) * 2016-01-29 2017-08-17 Jack G. Conrad System and engine for seeded clustering of news events
CN105956197A (en) * 2016-06-15 2016-09-21 杭州量知数据科技有限公司 Social media graph representation model-based social risk event extraction method
CN108932311A (en) * 2018-06-20 2018-12-04 天津大学 The method of incident detection and prediction
CN110147439A (en) * 2018-07-18 2019-08-20 中山大学 A kind of news event detecting method and system based on big data processing technique
US20200184151A1 (en) * 2018-11-30 2020-06-11 Thomson Reuters Special Services Llc Systems and methods for identifying an event in data
CN109829089A (en) * 2018-12-12 2019-05-31 中国科学院计算技术研究所 Social network user method for detecting abnormality and system based on association map
CN110232149A (en) * 2019-05-09 2019-09-13 北京邮电大学 A kind of focus incident detection method and system
CN110162632A (en) * 2019-05-17 2019-08-23 北京百分点信息科技有限公司 A kind of method of Special Topics in Journalism event discovery
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN112069383A (en) * 2020-08-31 2020-12-11 杭州叙简科技股份有限公司 News text event and time extraction and normalization system for event tracking

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
丁晟春等: "基于突发主题词和凝聚式层次聚类的微博突发事件检测研究", 《现代图书情报技术》 *
张仰森 等: "基于多种词特征的微博突发事件检测方法", 《电子学报》 *

Similar Documents

Publication Publication Date Title
Huang et al. A probabilistic method for emerging topic tracking in microblog stream
Li et al. Filtering out the noise in short text topic modeling
CN104933164B (en) In internet mass data name entity between relationship extracting method and its system
Gabrilovich et al. Newsjunkie: providing personalized newsfeeds via analysis of information novelty
US20130018967A1 (en) System and method for deriving user expertise based on data propagating in a network environment
Ouyang et al. A peek into the future: Predicting the popularity of online videos
Uppal et al. Fake news detection using discourse segment structure analysis
CN104281608A (en) Emergency analyzing method based on microblogs
Li et al. Twitter event summarization by exploiting semantic terms and graph network
Guo et al. A survey of internet public opinion mining
Wu et al. Exploring multiple feature spaces for novel entity discovery
Tian et al. Predicting rumor retweeting behavior of social media users in public emergencies
Chen et al. Towards topic trend prediction on a topic evolution model with social connection
Ma et al. Social media event prediction using DNN with feedback mechanism
Pandya et al. Mated: metadata-assisted twitter event detection system
Chen et al. Novelty paper recommendation using citation authority diffusion
Chen et al. Popular topic detection in Chinese micro-blog based on the modified LDA model
Claveau et al. Improving distributional thesauri by exploring the graph of neighbors
Zhu et al. A prerecognition model for hot topic discovery based on microblogging data
Miranda Ackerman Extracting a causal network of news topics
CN112597269A (en) Stream data event text topic and detection system
Aufaure What’s up in business intelligence? A contextual and knowledge-based perspective
Wang et al. Deep Attention Model with Multiple Features for Rumor Identification
Zhang et al. Adaptive general event popularity analysis on streaming data
Song et al. Research on weibo hotspot finding based on self-adaptive incremental clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210402