CN110968676A

CN110968676A - Text data semantic spatio-temporal mode exploration method based on LDA model and LSTM network

Info

Publication number: CN110968676A
Application number: CN201911234313.2A
Authority: CN
Inventors: 贺一桐; 张康; 李�杰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2020-04-07

Abstract

The invention discloses a text data semantic spatio-temporal pattern exploration method based on an LDA (latent dirichlet allocation) model and an LSTM (least Square TM) network, which comprises the following steps of: (1) integrating the subject model; generating a theme, evaluating the quality of the theme and projecting the reduced dimension of the theme; extracting semantics from the text data by using an LDA topic model, generating a topic model by iterating different parameters, and selecting a preferred topic for integration after performing quality evaluation on the topic model so as to solve the influence of parameters on the quality of the model; (2) constructing a theme space-time body; converting time, space and text subject data in the text data into a cubic data structure; (3) visual interaction and prediction; the system specifically comprises a theme projection view, a space-time projection view and a mode contrast view; the system is used for providing visual interactive exploration for the theme space-time body, and a visual mode is used for facilitating the exploration of a data result by a user; and predicting the value change for the future time period using the LSTM method.

Description

Text data semantic spatio-temporal mode exploration method based on LDA model and LSTM network

Technical Field

The patent mainly relates to the field of natural language processing and data visualization, in particular to a method for structural representation and theme model optimization of mass text data.

Background

The amount of text data worldwide has achieved exponential growth in recent years, which urgently requires people to mine new knowledge, new opinions, from text data. Processing textual data has become of unprecedented importance, from social media analysis to risk management and cyber-crime protection. Since text data usually contains temporal and spatial information, spatiotemporal attributes are often added while processing the text data.

In the work of processing text data in a spatial and temporal distribution, there are many works that focus on finding keywords in the text data. A common approach is to analyze text data to detect associated events that occur at specific times and locations, identify events from groups of text having the same locations and the same or overlapping times^[2]. Markus et al^[3]Hot events are detected by peaks in twitter data and semantically labeled using keywords in the twitter. Zhou et al^[4]Events are identified using a detect burst in machine learning technique. However, these methods only show the text content at the level of keywords, and do not involve text semantic analysis. With the widespread use of topic models, topic models have been applied in recent years when visualizing textual data. Chen et al^[5]A flow of visual analysis of social media data is summarized. Xu et al^[6]A topic competition model is presented to represent the public's attention to multiple topics.

The above-mentioned work mainly uses text data of social media, however, these work have some drawbacks and disadvantages: firstly, the above work is to set the parameters of the topic model in advance, and extract the semantics from the data of the invention by using the trained topic model, because the topic model is very sensitive to the parameters, the quality of the topic model cannot be ensured, and the capability of the model for extracting the semantics is influenced. Secondly, the work mainly uses static data, has no requirements on the processing speed and the query speed of the data, and cannot respond to massive text data in time. Finally, the above work provides only a query function when displaying data, and cannot add the user's decision to the result.

Disclosure of Invention

An object of the present invention is to solve the following problems in the prior art. 1. Using LDA topic model to replace traditionThe method classifies the text according to the fields, thereby reducing the loss of semantics of the invention. 2. The common topic model is very sensitive to parameter settings^[8]Slight parameter variations may produce completely different results, making it difficult to reasonably set the subject model parameters without a priori knowledge. And a plurality of models are generated by using an iterative method and are integrated together, so that the influence of parameters on the text processing quality is reduced. 3. The topic spatiotemporal patterns contain a large number of interactive exploration tasks^[9]For example, not only the temporal trend or spatial variation of the topic of interest of the user, but also the specific topic content under a certain spatio-temporal coordinate may need to be analyzed in contrast, so that a consistent task organization mode needs to be constructed to better support different analysis application scenarios. Therefore, the invention provides a visual analysis framework to interactively explore the semantic spatiotemporal patterns of mass text data. Firstly, the framework adopts a theme extraction method based on model integration to carry out semantic extraction on text data. Secondly, the framework integrates a database-based cube^[7]The data and task organization structure of the system realizes quick response to various interactive exploration tasks. And finally, designing a visual interface supporting quick query and interaction on the data result.

The purpose of the invention is realized by the following technical scheme:

the text data semantic spatio-temporal pattern exploration method based on the LDA model and the LSTM network comprises the following steps:

(1) integrating the subject model; generating a theme, evaluating the quality of the theme and projecting the reduced dimension of the theme; extracting semantics from the text data by using an LDA topic model, generating a topic model by iterating different parameters, and selecting a preferred topic for integration after performing quality evaluation on the topic model so as to solve the influence of parameters on the quality of the model;

(2) constructing a theme space-time body; converting time, space and text subject data in the text data into a cubic data structure; the device is used for sorting and storing the results and supporting the subsequent real-time interaction operation; specifically, a cube data structure organization is designed, a user query task on data is supported, the time space and the theme after the discretization are taken as three dimensions of a data cube, and semantic information extracted from text data through a theme model is stored in a cell so as to meet the requirement of exploring the text data on a semantic spatiotemporal level;

(3) visual interaction and prediction; the system specifically comprises a theme projection view, a space-time projection view and a mode contrast view; the system is used for providing visual interactive exploration for the theme space-time body, and a visual mode is used for facilitating the exploration of a data result by a user; and predicting the value change for the future time period using the LSTM method.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. the accuracy of the analysis result is increased. When the existing work is used for analyzing the space-time distribution of a certain type of text, an analyst usually selects the text keyword as a classification basis to analyze the distribution situation of the text keyword in space and time. If there is a case where one text belongs to a plurality of case types, it is difficult to automatically acquire all the corresponding types, which results in loss of semantics. The distribution of the text on a plurality of types can be extracted by using the LDA topic model, and the error caused by text classification can be reduced.

2. The prediction of the time series of the subject values by using the LSTM model can improve the speed and accuracy of the prediction. Prediction is an important requirement for text data analysis, and the traditional method directly predicts the text and has the problems of high difficulty in training a model and low prediction speed. A large amount of texts are converted into specific numerical values, the influence of high-dimensional sparsity of the texts on the prediction accuracy can be avoided, and the topic value is a numerical value, so that the prediction speed can be increased, the system response time can be reduced, and the real-time interaction requirement can be met.

3. The semantic space-time body containing three dimensions of time, space and theme is designed, the structure consistently supports various interactive exploration tasks, and the semantic information under different space-time coordinates is stored in advance to realize quick response to the request of a user for inquiring different space-time semantic contents.

4. A data cube-based alarm receiving log data visual interaction system is realized. In the traditional method, a user cannot adjust the internal structure of the data model and cannot adjust the model. According to the method, the content of the model is visually displayed, a user can check the specific information of the model and improve the specific information, and the problem that the traditional theme model is difficult to determine reasonable parameters is effectively solved. Meanwhile, the visualization mode can help the user to explore the text data from multiple angles of time, space and text types.

Drawings

Fig. 1 is a general block diagram of the proposed method.

FIG. 2 is an LDA topic model integrated diagram, in which (1) represents a corpus, (2) represents topic model integration, (3) represents visual topic projection, ① represents that replacement parameters are iterated for multiple times to obtain topic model integration, ② represents a dimension reduction projection topic, and ③ represents that a user selects a topic through topic projection interaction.

FIG. 3 is a topic space-time body construction diagram, in which (1) represents police data, (2) represents topic model integration, (3) represents a data cube, (4) represents a visual interface, (①) represents extraction of topic distribution values, ② represents spatial clustering, ③ represents temporal clustering, ④ represents visual query results, and ⑤ represents interactive query.

Fig. 4 is a visualization interface diagram. In the figure: (1) the method comprises the steps of (1) representing an LDA theme distribution view, (2) representing a space-time projection view, (3) representing a mode contrast view, (a) representing a space distribution sub-view, (b) representing a query condition, (c) representing a region theme value ranking graph, (d) representing a theme value hour distribution graph, (e) representing a theme value week distribution graph, and (f) representing a theme value sky distribution graph.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a visual analysis framework to interactively explore semantic spatio-temporal patterns of mass text data. Firstly, the framework adopts a theme extraction method based on model integration, and by projecting the theme model extraction results of different parameters on a plane, a user can intuitively know the difference of different themes and select interested themes from the theme model extraction results, so that the problem that the theme parameters cannot be accurately set is solved. Secondly, the framework integrates data and task organization structures based on the DataCube, and fast response to various interactive exploration tasks is achieved by pre-storing two types of indexes of the theme under different time and space coordinates and corresponding different analysis tasks to various projection and slicing operations in the data body. Specifically, as shown in fig. 1, the method mainly includes the following steps:

the method comprises the following steps: topic model integration (FIG. 2). The specific operations comprise theme generation, theme quality evaluation and theme dimension reduction projection. Generating a plurality of topic models by iterating different parameters, performing quality evaluation on the topic models, selecting a preferred topic for integration, projecting all the integrated topics on a two-dimensional plane in a dimensionality reduction mode, and displaying the content in the topic by using word cloud for a user to understand and find the difference between the content in the topic and different topics.

The embodiment uses the LDA topic model to extract topics from the alarm log, which is a probabilistic topic modeling method and is generally used for processing a large amount of text. The invention takes the alarm receiving content field in the alarm receiving log as the original corpus of the LDA model, and takes the alarm receiving content of each record as a document (figure 2, (1)). Because the alarm receiving content contains a large amount of interference information, the quality of the generated topic model is low, the text after word segmentation and word stop removal processing needs to be processed before the model is trained, and the method mainly comprises the steps of removing useless characters, replacing synonyms and counting word frequency.

The original corpus can be used as a corpus for training a model after being processed by the method. In order to get rid of the dependence of the quality of the topic model on the parameters, the invention uses an iterative method, generates a plurality of topic models by changing the parameters with larger influence, iterates the number of the topic models in a reasonable interval to generate a plurality of topic models, and then integrates the generated topic models together, so that a user can view and select the topics for extracting the semantic information in the alarm receiving text.

Step two: the subject spatiotemporal body was constructed (fig. 3). And converting the time, space and text data in the alarm receiving data into a cubic data structure. Specifically, a cube data structure organization is designed and the query task is supported, the discretized time space and the discretized theme are used as three dimensions of the data cube, and semantic information extracted from text data through a theme model is stored in the cells, so that the exploration of the alarm log on a semantic spatiotemporal level is met.

In alarm receiving data, time and space attributes are continuous, and since discrete time and space information is needed for constructing a data cube, discrete processing is carried out on continuous data. In the time angle, the method is divided according to natural time units, reasonable minimum time precision is selected according to the number of police affair data, and the time precision is selected to be the most suitable time when the time precision is small through testing. At query time, a query range larger than an hour will be split into hours for query.

From a spatial perspective, because regions have two-dimensional attributes-precision and latitude, and regions can affect each other, location information should be preserved when storing, and spatial attributes are represented by two dimensions in the data cube. Consider partitioning a space into different subspaces and aggregating data according to the subspaces. However, how to determine the subspace size results in too low a resolution if the subspace is too large, and consumes a lot of space storage if the subspace is too small. And considering the characteristic that alarm receiving log records are not uniformly distributed in space, the alarm receiving log records are decided to be stored by using a quadtree method.

And then semantic extraction is carried out on text data in the alarm data, a topic distribution value (PS) and a keyword weight (kw) on a certain point P (P, s, t) are calculated by using a topic model, and the two indexes are stored in a cell of a corresponding point of a semantic space body.

And calculating a topic distribution value PS (p, s, t). All records in space s and time t are collected to obtain a document subset D (s, t), then each topic value recorded on the topic P in the subset is derived through a topic model, and the sum of all the recorded topic values is the topic distribution value of the point P (P, s, t). As shown in formula (1), where D is the record in the subset of documents D, v_dpFor the topic value recorded on topic p, the value can be obtained in the document-topic in the topic model.

Keyword weights kw (p, s, t) are calculated. Multiplying the obtained PS value by the weight value v of the theme of the word in the theme model_pkV as a weight of the keyword in the point P_pkValues may be obtained in topic-to-topic words in the topic model. As shown in formula (2), if a certain topic has a higher distribution value at the point and a certain topic word has a higher weight at the topic, the kw value of the keyword in the point is higher.

kw(p,s,t)＝PS(p,s,t)*v_pk(2)

The user interactively selects the theme and inputs the query range through the visual interface, searches the result on the corresponding scale and stored in the theme space according to the query condition of the user, and outputs the result to be displayed in a visual mode (figure 3, step ⑤).

Step three: visual interaction and prediction (fig. 4). A visual interactive exploration is provided for the text data cube. The theme projection view is used for interactively selecting the theme which is interested by the user, and the user can select partial themes from the theme projection view for subsequent operation. The space-time projection view comprises two sub-views of data projection and hot area ranking, and the data projection can project and display the query result on a map according to the query conditions of time, space and theme selected by a user and check the distribution condition of areas. The mode comparison view is used for storing the user query result and displaying the data result in detail through three line graphs with different time precision, so that the user can conveniently compare and find the occurrence modes of the case under different query conditions. The invention also provides a semantic level prediction method, which converts the regional historical subject value data into a time sequence and uses LSTM^[1]Method for predicting value change in future period of time。

The LDA theme selection view.

The user firstly checks and selects the interested subject in the subject projection view, the subject projection view carries out dimensionality reduction projection on the subject in the subject model integration, the distance between the subjects is calculated according to the similarity degree of words in each subject, and anti-collision detection is added during distance calculation^[10]And the phenomenon of coincidence during projection is avoided. Each generated theme corresponds to a case of a certain type, and the cases can be classified according to the selection of the theme. According to the invention, 56 topics are obtained by modifying the parameters, and a user can randomly select interested topics from the 56 topics and display corresponding topic values in other views for classifying cases and calculating corresponding types of topic values. The user can click on the theme to select interactively, and the selected theme and the theme words contained in the selected theme are displayed on the underlying theme alternative list.

The topic alternative list (fig. 4 (1)) displays word clouds of any number of topics selected by the user, as shown in fig. 4(1), the topic alternative list is a topic selected by the user from the topics, the user selects a certain topic in the projection view, the content of the topic is displayed in a word cloud manner, the topic content is obtained by topic-topic word weight in the topic model, the font size in the word cloud represents the weight value of the term in the topic, the weight value reflects the importance of the term in the topic, the user can select to cancel the unsatisfactory topic, and select one of the topics for the next spatio-temporal distribution exploration.

And (4) historical statistics view.

According to the theme selected by the user in the theme alternative list, the spatiotemporal distribution condition of the corresponding theme value is projected in a spatiotemporal projection view (figure 4(2)), the view is used for exploring the distribution condition of the alarm data theme distribution value on the spatiotemporal, and the user can select the query time to view by self-definition. The user can select the time range of projection by adjusting the time axis below (fig. 4(b)), and the selection of the start/end time counts only the subject data of the specified time period within the selected time range. The system sums up the user-selected subject values corresponding to the alarm data in the designated time range in each region to obtain the subject value of the spatio-temporal query, and the numerical value of the query result is displayed in a region subject value ranking graph according to the size ranking (fig. 4 (c)).

If the user has the requirement for multi-region comparison, the specified spatio-temporal query condition generation template can be saved in the statistical comparison view (fig. 4 (3)). The view shows the time distribution conditions through three line graphs with different time precisions, wherein the three line graphs respectively show the time distribution conditions of the cases with the specified theme type from various angles, and the three line graphs respectively show the time distribution conditions of the cases with different time precisions, namely a distribution graph at different moments every day, a distribution graph at different weeks every week and a distribution graph at different dates every month. Fig. 4(3) shows a template view generated by using two different spatio-temporal ranges stored in the system, the left side of the view shows the spatio-temporal query condition corresponding to the template, the right side of the view is three polyline statistical graphs with three different time scales, (d) the abscissa of the line graph is hour, and the ordinate is a subject value, which represents the distribution of the subject value corresponding to different hours per day, (e) the abscissa of the line graph is week, and the ordinate is a subject value, which represents the distribution of the subject value corresponding to different weeks in a week, (f) the abscissa of the line graph is date, and the ordinate is a subject value, which represents the distribution of the subject value in all the time.

And predicting the text theme.

Corresponding subject values in the region in the subject space-time body are arranged according to the time sequence, a time-varying sequence of the subject values, namely a subject value time sequence, can be obtained, and then the extracted time sequence is analyzed so as to carry out subsequent prediction operation. The invention processes the sub-regions independently, and arranges the distribution values of the alarm records of the sub-regions in the unit of day on the same theme according to the time sequence, thereby obtaining the distribution value time sequence of each theme of the region. In the present invention, the LSTM model is used to predict the processed time series of subjects for testing.

The present invention uses Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and error rate as evaluation indexes. Evaluation of useThe data is police alarm data of A city with the time length of 137 days, the A city is divided into 16-16 subregions, and 56 subjects with 6 subject models are generated for performance evaluation. According to the method, 6 representative subjects are selected, the calculation method is that subject value time sequences of the 6 subjects in all sub-regions are used as input of a model, the first 133 days of the time sequences are used as training data, and the last 4 days of the time sequences are used as test data. In order to qualitatively analyze the prediction result, an error interval is set for each time sequence, and if the error between the predicted value and the true value is in the error interval, the prediction result is considered to be accurate. The invention sets the value of the error interval as the standard deviation of the time sequence, the calculation formula is shown as (4), x_iThe sample value at the time i, mu is the average value of the time series, and N is the number of the time series samples. When the absolute value of the difference value between the predicted value and the true value is smaller than the standard deviation, the prediction is considered to be accurate, finally, the prediction results of all the regions for 4 days are counted, and the ratio of the final accurate prediction quantity C (corret) to the total quantity C (total) is calculated to be used as the Accuracy (Accuracy). The accuracy of the predicted results is as follows:

TABLE 1 prediction results for each topic

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Reference documents:

[1]Gers F A,Schmidhuber J,Cummins F.Learning to forget:Continualprediction with LSTM. 1999.

[2]Wang X,Dou W,Ma Z,Villalobos J,Chen Y,Kraft T,Ribarsky W.I-SI:Scalable Architecture for Analyzing Latent Topical-Level Information FromSocial Media Data.Computer Graphics Forum.Oxford,UK:Blackwell Publishing Ltd,2012,31(3pt4):1275-1284.

[3]Marcus A,Bernstein M S,Badar O,Karger D R,Madden S,Miller RC.Twitinfo:aggregating and visualizing microblogs for eventexploration.Proceedings of the SIGCHI conference on Human factors incomputing systems.ACM,2011:227-236.

[4]Zhou X,Xu C.Tracing the spatial-temporal evolution of events basedon social media data. ISPRS International Journal of Geo-Information,2017,6(3):88.

[5]Chen S,Lin L,Yuan X.Social media visual analytics.ComputerGraphics Forum.2017,36(3): 563-587.

[6]Xu P,Wu Y,Wei E,Peng T Q,Liu S,Zhu J J,Qu H.Visual analysis oftopic competition on social media.IEEE Transactions on Visualization andComputer Graphics,2013,19(12): 2012-2021.

[7]Gray J,Chaudhuri S,Bosworth A,Layman A,Reichart D,Venkatrao M,Pellow F,Pirahesh H. Data cube:A relational aggregation operator generalizinggroup-by,cross-tab,and sub-totals[J]. Data mining and knowledge discovery,1997,1(1):29-53.

[8]Papanikolaou Y,Foulds J R,Rubin T N,Tsoumakas,G.Densedistributions from sparse samples:improved gibbs sampling parameterestimators for LDA.The Journal of Machine Learning Research,2017,18(1):2058-2115.

[9]Ibrahim Y.Temporality,space and technology:time-space discoursesof call centres.New Technology,Work and Employment,2012,27(1):23-35.

[10]Fang Z W,Wan H G,Gao S M.A fast collision detection algorithm inimage space.Journal of Computer-Aided Design&Computer Graphics,2002,14(9):805-809。

Claims

1. the text data semantic spatio-temporal pattern exploration method based on the LDA model and the LSTM network is characterized by comprising the following steps of: