KR101613397B1 - Method and apparatus for associating topic data with numerical time series - Google Patents
Method and apparatus for associating topic data with numerical time series Download PDFInfo
- Publication number
- KR101613397B1 KR101613397B1 KR1020150076402A KR20150076402A KR101613397B1 KR 101613397 B1 KR101613397 B1 KR 101613397B1 KR 1020150076402 A KR1020150076402 A KR 1020150076402A KR 20150076402 A KR20150076402 A KR 20150076402A KR 101613397 B1 KR101613397 B1 KR 101613397B1
- Authority
- KR
- South Korea
- Prior art keywords
- time
- topic
- series
- data
- text data
- Prior art date
Links
Images
Classifications
-
- G06F17/2745—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
-
- G06F17/277—
Abstract
Embodiments relate to a method and apparatus for associating text data and time series numerical data. A method of associating time series text data and time series numerical data to be performed by a computing device, comprising: obtaining a data set including time series text data and time series numerical data corresponding to each other temporally, Applying an ATM (Associative Topic Model) to the numeric data, the step of applying ATM comprises the steps of calculating a time-dependent trajectory of topic proportions from time-series text data, And correlating it with time-series numerical data on the time axis.
Description
The disclosed technique relates to big data processing, and particularly relates to a technique for associating large data based time series text data with time series numerical data.
In recent years, big data data mining techniques have been proposed to collect useful data by collecting and analyzing such online data as the number of data generated and exchanged on the Internet increases. For example, studies are under way to forecast and prepare economic conditions and stock price fluctuations by synthesizing and analyzing public information through SNS (social network service) such as Twitter or Facebook.
Typically, when a series of events occurs, various types of time series data are generated, such as information generated by individuals on the SNS, analysis articles on specialized sites, articles on media, and statistical numerical information. For example, a series of economic events affect not only stock market indices but also economic news streams.
Accordingly, there has been proposed a method for associating different types of data with each other. For example, some researchers have proposed a technique for predicting whether the next day's stock market will fluctuate from financial information extracted from Big Data collected in SNS since 2010. For example, Bollen published a paper that predicted the DOW Index (ie, DJI, Dow Jones Industrial Average) using OpinionFinder and Google's Google-Profile of Mood States (GPOMS). In this paper, we analyze the twitter-based big data by using emotional analysis system such as OpinioFinder and GPOMS, classify the DOW index up / down by 87% probability by using Granger causality and Neural Network technique classification). In addition to these, various papers have been published, but there have not been any researches on professional analytical techniques that can be used in practice using big data analysis technology.
It is an object of the present invention to provide a method and an apparatus for associating text data and time series numerical data to improve understanding of a specific event by associating text data with time series numerical data, The purpose is to provide.
In particular, the disclosed technique is based on a topic model that finds a topic affected by time series numerical data and text data, that is, text data associating text data with time series numerical data by an Associative Topic Model (ATM) And an object thereof is to provide a method of associating the same.
The disclosed technique also provides a method and apparatus for associating text data and time series numerical data that identifies a topic associated with a time series characteristic from various types of data and predicts time series numerical data at a higher accuracy than an iterative model It is for that purpose.
The above objects are provided by a method and apparatus for associating text data and time series numerical data provided according to embodiments.
A method of associating text data and time series numerical data provided according to an aspect of embodiments includes time series text data and time series numerical data association methods performed by a computing device, And applying an ATM (Associative Topic Model) to time-series text data and time-series numerical data corresponding to each other in time, wherein applying the ATM comprises extracting a topic ratio calculating a time-based trajectory of topic proportions, and correlating topic ratios according to the trajectory with time-series numerical data in a time axis.
The step of calculating the locus of the topic proportions from the time series text data may include calculating a locus of the topic proportions from the time series text data according to a time according to a Dynamic Topic Model (DTM) Based on a time series numerical variable generated from the same prior information as the topic proportion.
The step of correlating the topic ratios according to the locus with the time series numerical data in the time axis may further comprise the step of selecting a corpus-level state of the topic ratio at time t, a document- level state of the topic ratio at time t as a variational method, In accordance with the present invention.
In addition, correlating the topic ratios according to the locus with the time series numerical data in the time axis may further comprise using a Kalman filter to estimate the dynamics of the topic ratios in the corpus over time have.
In addition, acquiring a data set that includes time-series text data and time-series numeric data that are temporally corresponding to each other includes determining whether the data set has terms that are less than stop words or frequencies less than a predefined ratio And finding and removing the step.
The apparatus for associating time-series text data and time-series numerical data provided according to another aspect of the embodiments, comprising: a data acquiring module for acquiring a data set of time-series text data and time-series numeric data corresponding to each other temporally, And an ATM (Associative Topic Model) module for correlating the topic ratios according to the locus with time series numerical data on the time axis.
In addition, the ATM module can be configured to calculate a time-based trajectory of the topic ratio from the time-series text data based on a time-series numerical variable generated from the same fryer information as the topic ratio of the dynamic topic model (DTM).
The ATM module also uses the variance method to determine the corpus-level state of the topic ratio at time t, the document-level state of the topic ratio at time t to correlate the topic ratios along the locus with the time series numerical data in the time axis, ). ≪ / RTI >
The ATM module can also be further configured to estimate the dynamics of topic ratios in the corpus over time to correlate topic ratios along the locus with time series numerical data in the time axis and to use a Kalman filter for estimation have.
Further, the data acquisition module for obtaining time-series text data and time-series numerical data corresponding to each other in time can be configured to find out whether there is a term that appears in the data set with a frequency less than a stop word or a predefined ratio.
There is provided a computer-readable medium having recorded thereon a program provided in accordance with another aspect of the embodiments, the program comprising the steps of: when executed by a computer, obtaining a data set comprising time-series text data and time- Applying an ATM (Associative Topic Model) to time-series text data and time-series numeric data that correspond to each other in time, and the step of applying ATM is a step of generating a trajectory according to time of topic proportions from time- Calculating and correlating topic ratios according to the locus with time-series numerical data in the time axis.
In addition, the step of calculating the locus of the topic ratio from the time series text data with respect to time from the time series text data may include the step of calculating a locus of the topic ratio from the time series text data with respect to time in the same manner as the topic ratio of the dynamic topic model (DTM: based on the time series numerical parameter generated from the prior information.
Also, the step of correlating the topic ratios according to the locus with the time series numerical data in the time axis may comprise the step of comparing the document-level state of the topic ratio at the time t with the corpus-level state of the topic ratio at time t, method according to an embodiment of the present invention.
In addition, the step of correlating the topic ratios according to the locus with the time series numerical data in the time axis further comprises using a Kalman filter to estimate the dynamics of the topic ratios in the corpus over time can do.
Further, the program may further comprise the steps of acquiring a time-series corresponding time-series text data and a data set including time-series numeric data, wherein the data words are stored in a database with stop words or terms appearing less frequently than a pre- It is possible to carry out a step of finding out whether there is a defect.
A memory for storing time-series text data and time-series numeric data in temporal correspondence with each other, and a processor, the processor comprising: an ATM (Associative Topic Model), and to correlate the topic ratios according to the locus with the time series numerical data stored in the memory in the time axis.
The processor may further be configured to calculate a time-based trajectory of the topic ratio from the time-series text data based on a time-series numerical variable generated from the same fryer information as the topic ratio of a dynamic topic model (DTM).
The processor may also be configured to compute the corpus-level state of the topic ratio at time t, the document-level state of the topic ratio at time t in a variational method to correlate topic ratios along the locus with time- Lt; / RTI >
The processor may also be configured to estimate the dynamics of the topic ratio in the corpus over time to correlate topic ratios along the locus with time-series numerical data in the time axis, and to use a Kalman filter for estimation.
A memory for storing time-series text data and time-series numerical data corresponding to each other in a temporal manner, and a processor, the processor comprising: a time-series text data and time-series numerical data, (Associative Topic Model) module configured to calculate a time-based trajectory of a topic proportion from text data and to correlate topic ratios according to the trajectory with time-series numerical data stored in memory in a time axis .
The features and advantages of the embodiments will become more apparent from the following detailed description based on the accompanying drawings.
According to the embodiments, it is possible to provide a method and apparatus for associating text data and time-series numerical data that can improve understanding of a specific event by associating text data with time-series numeric data.
The disclosed technique is based on a topic model that finds a topic that is affected by numerical and text time series data, that is, an associated topic model (ATM: Associative Topic Model) that associates text data with time- An associated method and apparatus therefor.
The disclosed technique also identifies topics associated with time-series characteristics from various types of data and provides a method and an apparatus for associating textual and time series numerical data that predict numerical time series data with higher accuracy than an iterative model .
1 is a schematic diagram showing a general dynamic topic model (DTM)
Figure 2 is a schematic diagram showing an Associative Topic Model (ATM) according to an embodiment;
FIG. 3 is a flowchart showing a process of an ATM according to an embodiment.
FIG. 4 is a diagram showing a grouping of the top eight related topics as a result obtained by applying ATM to stock returns according to the embodiment; FIG.
FIG. 5 (a) is a graph showing the dynamics of the topic ratio among the results analyzed for the stock price return rate in the example of FIG. 4
5 (b) is a graph showing a comparison between the actual stock price return rate in the example of FIG. 4 and the stock price return ratio predicted by ATM
Figure 6 is a schematic diagram showing an associated topic among the results analyzed for volatilities according to an embodiment;
Fig. 7 (a) is a graph showing the dynamics of the topic ratio among the results analyzed for the stock price change in the example of Fig. 6
FIG. 7 (b) is a graph showing a comparison between the actual stock price change in the example of FIG. 6 and the stock price change predicted by the ATM
FIG. 8 is a schematic diagram showing an associated topic among the results analyzed for the Obama approval index according to the embodiment; FIG.
FIG. 9 (a) is a graph showing the dynamics of the topic ratio among the results analyzed for the Obama approval rate in the example of FIG. 8
FIG. 9 (b) is a graph showing a comparison between the actual Obama approval rate and the Obama approval rate predicted by the ATM in the example of FIG. 8
FIG. 10 is a schematic view showing an associated topic among the analyzed results of the stock price return rate according to the embodiment
11 (a) is a graph showing the dynamics of the topic ratio among the results analyzed for the stock price return rate in the example of FIG. 10
11 (b) is a graph showing the actual stock price return rate in the example of FIG. 10 compared with the stock price return rate predicted by ATM
FIG. 12 is a schematic view showing an associated topic among the analyzed results of the stock price change according to the embodiment
Fig. 13 (a) is a graph showing the dynamics of the topic ratio among the results analyzed for the stock price change in the example of Fig. 12
FIG. 13 (b) is a graph showing a comparison between the actual stock price change in the example of FIG. 12 and the stock price change predicted by the ATM
FIG. 14 is a schematic diagram showing an associated topic among results analyzed for the Obama approval index according to an embodiment; FIG.
FIG. 15 (a) is a graph showing the dynamics of the topic ratio among the results analyzed for the Obama approval rate in the example of FIG. 14
FIG. 15 (b) is a graph showing a comparison between the actual Obama support rate in the example of FIG. 14 and the Obama support rate predicted by ATM
FIG. 16 is a graph showing log likelihood among the results analyzed for the stock price return according to the embodiment by comparing various conventional models with ATM for the entire test period
FIG. 17 is a graph showing the results of estimating the log probability during the last test period among the results analyzed for the stock price return in the example of FIG. 16,
18 is a graph showing the log probability among the results of analyzing the stock price change according to the embodiment by comparing various existing models with the ATM for the entire test period
FIG. 19 is a graph showing comparison of various existing models with ATM with respect to the result of estimating the log probability during the last test period among the analyzed results of the stock price change in the example of FIG. 18
FIG. 20 is a graph showing the log probability of the results analyzed for the Obama approval rate according to the embodiment by comparing various existing models with the ATM for the entire test period
FIG. 21 is a graph showing comparison of various existing models with ATM with respect to the result of estimating the log probability in the final test period among the results analyzed for the Obama support rate in the example of FIG. 20
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
FIG. 1 is a schematic diagram showing a general dynamic topic model (DTM), and FIG. 2 is a schematic diagram showing an Associative Topic Model (ATM) according to an embodiment.
Referring to FIGS. 1 and 2, each circle represents a random variable. The gray filled circle is an observation variable, and the unfilled circle is a hidden variable. A large rectangle containing a plurality of random variables is called a plate, and the number specified in a corner (for example, D t , N) means that the number of sets of random variables contained in the plate is duplicated by that number. The arrows indicate that there is a statistical relationship represented by a probability distribution between two connected variables.
The ATM (see FIG. 2) can be configured to perform learning based on time series numerical values of associated topics extracted from time series text data. ATM computes the trajectory of topic proportions over time, which topics are correlated with numerical variables over time. ATM can be thought of as a combination of Kalman filter (Kalman filter) and latent Dirichlet allocation (LDA). ATM has a somewhat similar aspect with the DTM (see FIG. 1). DTM is the dynamics of the topic over time,
And the occurrence frequency of the topic (appearances) ≪ / RTI > The difference between DTM and ATM is that it adds 1) a time series variable influenced by the topic proportion, and 2) simplifies word distribution using topics.Since ATM should include a time series numerical variable, it is necessary to make a simple link from the topic ratio to a numerical variable. To this end, it is assumed that the latent topic extracted from the corpus generates numerical values as well as words in the corpus. In order to make it easier to interpret, a single set of topic words within a certain period of time is created,
To the DTM. Topic ratio at time t in ATM Gaussian distribution and numerical variables < RTI ID = 0.0 > , The percentage of topics in the document . ≪ / RTI >Figures 1 and 2 are graphical representations of DTM and ATM. ATM compared to DTM
Have more variables. At DTM and ATM Is modeled as a Gaussian distribution and represents the dynamics of the frequency of occurrence of the topic over time. in this case, Is the percentage of topics in the document Which have the same distribution type as shown in equations (1) and (2).
here,
Are K-dimensional identical matrices. Each of the above co-variance matrices is modeled as a scalar matrix to reduce the computational cost. Within the document, each word's topic Is assigned according to the topic ratio of the document as expressed in Equation (3).
In Equation (3)
Silver Time In the document . In order to transform the Gaussian distribution into a priori in a multinomial distribution, a softmax function defined as in equation (4) Lt; / RTI >
In the soft max function,
The size of Is used to determine the topic for the word. As a result, For each word of a lexical set, each word has a topic distribution Lt; / RTI > So far, ATM's modeling approach is very similar to DTM.The difference between ATM and DTM,
Time-series numerical variable < RTI ID = 0.0 > . In order to use the same fryer information and use the same information as the text data when assigning a topic to a word, the time series variable is calculated using the same soft max function as in Equation 4 Lt; / RTI > Equation (5) Wow The causal relationship between the model and the model.
here
Is a vector of K-dimensional linear coefficient parameters. Previous studies integrating word-level features in DTM suggest a simple linear combination of Gaussian distributions. In a similar manner, The corpus-level feature may be modeled as follows. nevertheless, The size of But does not generate a corpus. Affecting in both directions In order to model, Is rescaled to fit the topic selection procedure. Rescaled topic ratio and The Lt; / RTI > This assumption is the same as linear regressions using topic ratios as independent variables.In general, the appropriate number of coefficients
Is determined by the number of value data in the linear regression to avoid over-fitting. In this context, an appropriate number of topics need to be selected to prevent over-fitting. In Equation (5) Changes the equation, such as the strength factor of the fitting value along with the topic ratio. Changing To explain the relationship between text and numerical values in ATM Which means that it accepts the appropriate time series errors of. On the other hand, of Is fixed to a small value such as the dynamics of the ATM-induced topic ratio And is highly related to. That is, Is fixed to a small value, the ATM can only find very relevant topics to explain the correct trajectory of the clock during the training period. Hereinafter, an ATM having this processing is referred to as a fixed-ATM (ATM). FIG. 3 is a flow chart showing the process of the ATM according to the embodiment, and shows a generation process summarizing the above-described assumptions of the ATM.The posterior reasoning part of ATM has a property that can not be traced. Therefore, a variational method is used to approximate the trailing part of the ATM of FIG. The idea based on variational methods is to optimize the free parameters of distribution of potential variables by optimizing KL (Kullback-Leibler) divergence.
At ATM, the potential variables are 1) the corpus-level latency of the topic ratio
, 2) the document-level potential status of the topic ratio , And 3) a topic indicator . Equation (6) below represents the factorized assumptions of the assumed variational function .
here,
, , And Is a variation parameter of the assumed variance function for the latent variables. And The factorized variance distribution of And Lt; / RTI > But, In the variation distribution of the Gaussian fluctuation observation And maintains a sequential structure of the corpus topic expression. In the DTM, a variational Kalman filter model is used to represent topic dynamics. ATM uses a modified Kalman filter model as a model for estimating the dynamics of topic ratios in corpus over time. The main idea of a fluctuating Kalman filter is that observations in a standard Kalman filter & And the posterior distribution of the latent state in the standard Kalman filter model is considered to be the variance distribution .The variable Kalman filter according to the embodiment is expressed by the following equations (7) and (8).
Using standard Kalman filter calculations, the forward mean and variance are fixed
And fixed Is given by the following equation (9) together with the initial condition of < RTI ID = 0.0 >
The inverse mean and variance are given by Equation (10) below.
In Equation (10)
And Is derived from the fluctuating Kalman filter calculations. Post State Space use with .Using these fluctuation backslashes and Jensen's inequality, we can find the lower bound of log likelihood as shown in equation (11).
These boundaries include four prediction terms associated with the data presently present. The first term relates to the latency state from both data sources. The second and third prediction terms are associated with textual data. Fourth term
Is associated with continuous time series data. The first term on the right side of Equation (11) is expressed by Equation (12).
Equation (12) uses the Gaussian secondary shape identity of Equation (13) below.
The second term on the right side of Equation (11) is shown in Equation (14).
The third term on the right side of Equation (11) is expressed by Equation (15).
Equation (15) shows that, due to the soft max function in which the input variables are derived from the Gaussian distribution,
. The closed form of this prediction term can not be computed, but the lower bound can be found. The processing of lower bounds maintains the lower bound of the log probability.The fourth term on the right side of Equation (11) is expressed by Equation (16).
Because we model regression from discretized topic ratios for numerical values
SoftMax function on Is applied. The rationale of discretization is that the topic extracted from the text must influence regression, And The same soft-max processing is applied to the data. This discretization for topic extraction that occurs in the DTM combines polynomial and Gaussian distributions, but ATM is different in that it combines two Gaussian distributions. This combination of two Gaussian distributions using a soft max function results in a non-trivial calculation in the prediction calculation. In Equation 16, finding a lower boundary with the variation parameter of the prediction term is not traceable due to the non-concavity caused by the opposite signs of this prediction term. Also, the closed form of the prediction term can not be calculated accurately due to log-normality and difficulties in the ratio of the two random variables. Thus, ATM uses an approximate approach to calculate local predictions using Taylor expansions for rate estimation. Inference of stochastic graphical models with approximate prediction of soft max function with Gaussian fryer is a unique feature of ATM. The prediction of the simple soft max function is expressed in Equation (17).
In Equation 17, a new symbol
Is introduced. The combined prediction of the two soft max functions is consequently approximated as: < EMI ID = 18.0 >
The last term of Equation (11) is an entropy term and is expressed by the following Equation (20).
Using the above-described prediction terms, an approximate lower boundary of log likelihood can be found.
Model parameter learning
By using a variational distribution, which is an approximate posterior distribution for the latent variables, an updater equation for the model parameters can be found. In a topic expression,
Lt; RTI ID = 0.0 > (21) < / RTI >
here
silver And is 0 otherwise. Text data ( ) And numerical time series data ( May be updated as shown in Equation (22) and Equation (23), respectively.
The document-level potential status of each topic ratio
Lt; RTI ID = 0.0 > , The time series potential of the topic ratio , While numerical time series variables are derived from To have Lt; / RTI > By learning these distributions, it is possible to find the degree of association between text data and time series numerical data. This low value means that two data sources (ie, text data and time series numerical data) are strongly correlated and text data can help predict time series variables. On the other hand, If this value is high, it means that the two data sources are not related to each other. If the variance is fixed at a low value, ATM tends to learn topics with high explanatory power over the trajectory of time series values in the training data set.As mentioned above, the rescaled
And The linear combination of . To maximize the lower bound of its log-likelihood. Equation 24 below is a soft max function for inferring a coefficient vector.
In Equation 24,
The Element with Matrix, The As an element Matrix, Period Lt; / RTI > This update equation is similar to the Gaussian response of a supervised latent Dirichlet Allocation (LDA).prediction
After all the parameters have been learned, the ATM sends the new text data (
) To observe the future time series variable ( Can be used as a prediction model. Can be expressed by the following equation (25).
To compute the predictions of the soft max function for the new time step,
Backward distribution of the document, , And using learned model parameters Of the posterior distribution. The inference at this time is based on Equation (11) except for the fourth prediction term. After sufficient repetition of variable reasoning, ) Can be used to predict the numerical value of the next time step.An apparatus and method for associating text data and time-series numeric data using ATM as described above according to an embodiment are provided.
The associated device of text data and time series numerical data using ATM can be implemented as a computing device. The computing device includes, without limitation, an apparatus having a processor for performing data processing, a memory for storing programs and data, and the like, such as a personal computer, a server computer, a desktop, a laptop, a palmtop, The computing device may be one independent device, but it is also possible to implement a distributed computing system in which a plurality of devices connected by a data communication network cooperate with each other.
In the embodiment, the text data extracting the topic may be a step of collecting data on the SNS such as Twitter and Facebook. In another embodiment, text data may be collected from data such as shopping malls, newsgroups, media, etc., as well as the SNS. Numerical time series data includes a series of time series numerical data that are announced at regular intervals in time. The text data and the time series numerical data have the same total collection period, and the collection periods per unit time correspond to each other. For example, if the time series numerical data is the data collected during the total collection period of one year of the closing stock return, which is generated every Friday, this "short-term return on stock market" is a time series data having a period of one week's time period . The corresponding text data is likewise collected during the one-year total collection period as time-series data with a period of one week's time period. In other words, in this case, the text data collected from the previous Saturday to Friday are collected as one unit. That is, the set of texts collected between the time when the numerical data y (t-1) and y (t) are generated corresponds to y (t).
The text data and the time series numerical data thus collected may be stored in a memory or a hard disk built in the computing device and then provided to be usable by the processor. Big data collected in other manners may be stored in an optical disk or a portable memory, and then provided to the computing device. Big data collected in another manner may be stored in another remote computing device or a cloud server, and then provided to be usable through a data communication network such as the Internet.
On the other hand, a method of associating text data and time series numerical data using ATM can be performed in a corresponding module of the independent apparatus. May be implemented as a software program to be installed in a general purpose computing device, including, by way of example, a processor, memory, etc., to be executed by a processor of a general purpose computing device.
Experiment
Hereinafter, actual examples of associating text time series data with time series numerical data using ATM according to the embodiment are exemplified. Illustrated are examples of ATMs applied to financial news corpus (textual data) and stock price indexes (numerical data), examples of news corpus related to Obama (textual data) and President Obama's approval rating (numerical data). Table 1 shows the data set used in this experiment.
As shown in Table 1. Experimental Example 1 is a data set in which text data obtained by collecting financial news articles of Bloomberg for 120 weeks on a weekly basis and numerical data obtained by collecting stock returns of the weekly foreground Dow index corresponding thereto are stored in a data set do. Experimental Example 2 is the data set of the text data of Experimental Example 1 and numerical data obtained by collecting the changes of the weekly long-term Dow Index (stock volatilities) corresponding thereto. Experimental Example 3 shows that the gathering of news articles extracted from the Washington Post articles using President Obama's name in a weekly 284-week period, The collected numerical data is regarded as a data set.
Of these data sets, text data was randomly selected, and the number of documents (i.e., news articles) collected during the collection unit period, i.e., a week, was preset. For example, in the case of Bloomberg articles, it was set at 500 for a week, and for the Washington Post article, it was not exceeded 100 for a week. Also, these text data were processed to remove stop words such as articles and pronouns which have no significant meaning such as 'the' and 'I'm' before analyzing by ATM. Also, words appearing in a small number of documents, such as person names that do not significantly affect the semantic segmentation of each document, have been removed for efficiency of analysis processing. In other words, less than 2% of the documents were removed.
By applying the above-described ATM to each of the above data sets, we have associated associated text data and numerical data such as stock price returns, stock price changes, and ratings. In these examples,
To orient randomly disturbed topics from uniform topics, Is a sample dispersion of time series data, 0.1, 0.1, Was initialized to zero vector for the number of topics. The results are shown in Figs. 4-9.FIG. 4 is a schematic diagram showing an associated topic among the results analyzed for stock returns according to an embodiment using ATM, FIG. 5 is a diagram illustrating the dynamics of the topic ratio among the results analyzed for the stock price return in the example of FIG. 4 And the actual return of the stock price is compared with the stock price return predicted by the ATM. FIG. 6 is a schematic diagram showing an associated topic among the results analyzed for stock volatilities according to the embodiment using ATM, FIG. 7 is a diagram illustrating the dynamics of the topic ratio among the results analyzed for the stock price change in the example of FIG. 6 This graph is a graph showing the change in the graph and actual stock price compared with the change in stock price predicted by ATM. Meanwhile, FIG. 8 is a schematic view showing an associated topic among results analyzed for an Obama approval index according to an embodiment using ATM, FIG. 9 is a diagram illustrating a relationship between a topic ratio And Obama's approval ratings compared to Obama's predicted by ATM.
Figures 4, 6, and 8 show related topics, each of which is represented by eight words. Each word was selected in order of occurrence frequency. The upper graphs of FIGS. 5, 7, and 9 show the dynamics of the topic ratio in the same order as the order of FIGS. 4, 6, and 8, respectively. In the upper graph of FIGS. 5, 7 and 9, the color of the topic ratio represents the effect of the topic. The blue color at the top of the graph shows a positive effect, the red color at the bottom shows a negative effect, . Figures 5, 7, and 9 show the comparison of actual stock price returns, stock price changes, and approval ratings, stock price returns, stock price changes, and ratings supported by ATM, respectively.
Referring to FIGS. 4 through 7, which are the results of analyzing the stock price return and the stock price change, the illustrated results are derived from different time series and some related topics, for example, due to the nature of the financial sector, Tax cuts, and economic reports, among others. However, some related topics are different. For example, topics on Asia's energy and economy are related to stock returns, while topics on federal lows are related to stock price changes. The dynamics of the topic ratio were very different due to different topics in the same text data. These results qualitatively demonstrate the feature that ATM identifies different topics associated with different time series values, even if the initial settings are the same as the text data. In FIGS. 5 and 7, when comparing the actual value and the predicted value of the stock price return and the stock price change, it can be seen that ATM predicts the stock price change more than the stock price return. These results are already anticipated by an efficient market theory that speaks of the difficulty of predicting stock prices and yields. In the financial sector, forecasting changes in stock prices is a common problem.
Referring now to Figures 8 and 9, the analysis of text data and rating data for President Obama is shown. The results show that topics related to family life are positively associated with ratings. However, some topics, such as Romney party, war and policy, are shown to be negatively associated. It also shows that some topics, such as education, tax, and agency, do not show high relevance. These results qualitatively demonstrate that ATM identifies reasonably relevant topics.
As can be seen from the above, the textual and numerical data association technique using ATM according to the embodiment shows how these two data are related to each other, and the different types of data collected from two different sources Lt; RTI ID = 0.0 > related < / RTI > topics.
ATMs, on the other hand, can be used to analyze existing data rather than predictions. In ATM modeling processor, time series numerical data and DTM are integrated. In time series modeling, numerical hidden states are defined as Gaussian error
Is generated. If the error To , ATM will find only relevant topics that describe the exact trajectory of the time series during the learning period. As already mentioned above, the ATM in this case is referred to as fixed-ATM.For the application of fixed-ATM, the data set was used identical to that shown in Table 1. Also,
To orient randomly disturbed topics from uniform topics, 0.1, 0.1, Was initialized to zero vector for the number of topics. The number of topics was set at 10 for stock price returns and ratings, and 5 for stock price changes. 10 to 15 show these results.FIG. 10 is a schematic diagram showing an associated topic among the results analyzed for the stock price return rate according to the embodiment using the fixed-ATM, FIG. 11 is a graph showing the dynamics of the topic ratio among the analyzed results of the stock price return in the example of FIG. And the actual stock price return is compared with the stock price return predicted by ATM. FIG. 12 is a schematic view showing an associated topic among the analyzed results of the stock price change according to the embodiment using the fixed-ATM, FIG. 13 is a graph showing the dynamics of the topic ratio among the analyzed results of the stock price change in the example of FIG. 12 And the actual share price change compared with the stock price change predicted by ATM. FIG. 14 is a schematic diagram showing an associated topic among the results analyzed for the Obama approval rate according to the embodiment using the fixed-ATM, FIG. 15 is a graph showing the dynamics of the topic ratio among the results analyzed for the Obama approval rate in the example of FIG. Graphs and actual Obama ratings compared to Obama's predictions as predicted by ATM.
Figures 10, 12 and 14 show related topics, each of which is represented by eight words. Each word was selected in order of occurrence frequency. The upper graphs of FIGS. 11, 13, and 15 show the dynamics of the topic ratio in the same order as the order of FIGS. 10, 12, and 14, respectively. In the upper graph of FIGS. 11, 13 and 15, the color of the topic ratio represents the effect of the topic. The blue color at the top of the graph shows a positive effect, the red color at the bottom shows a negative effect, . Figures 11, 13, and 15 show the comparison of actual stock price returns, stock price changes, and approval ratings, stock price returns, stock price changes, and ratings supported by ATM, respectively.
In the fixed-ATM analysis shown, if we look at the topic ratio dynamics, we can see that all dynamics change dramatically to include time series errors. Fixed-ATM is not useful for prediction and can be used to analyze topics with high relevance during the learning period. The result of ATM application and the application of fixed-ATM may be different. For example, in the case of the stock price returns shown in FIGS. 10 and 11, topics related to bloomberg editors and stories have shown the most negative impacts, while in the case of ATMs of FIGS. 4 and 5, .
Evaluation of prediction performance
In order to quantitatively evaluate the predictive performance of ATM, it is assumed that the model has been learned by the fryer data, and the next time value is predicted. For this, prediction was performed for 21 weeks in the case of stock price change and stock price change for the data set of Table 1, and for 25 weeks in case of support rate. The comparison models (AR, LDA-LR, IT-LDA, DTM-LR, IT-DTM) and ATM inferred five topics and all models were set to have the same initial parameters and approximated by the same criteria . For AR (p), p is chosen to be the best performing for test data. For the ITMTF model, the fryer feedback loop was repeated three times with a confidence threshold of 70%. After prediction, mean squared errors (MSE)
And mean absolute errors (MAE) , Where M is the number of test points; Is the actual value of the prediction; Is a predicted value).
Table 2 above shows the predictive performance of the comparison models and the proposed model (ATM) as MSE and MAE. Bold fonts represent the best model to generate both interpretation and prediction while simultaneously analyzing text and numbers. Underlined fonts, such as AR, are the best performing only in numeric models. ATM has the best overall performance.
Performance comparison between ATM and comparison models was also performed using the log likelihood metric under the same conditions as the prediction performance, and the results are shown in FIGS. 16 to 21. FIG.
16 is a graph showing a log probability (Log likelihood) among the results analyzed for the stock price return rate according to the embodiment of the performance test using the log probability metric by comparing various existing models with the ATM for the entire test period, In the example of FIG. 16, there is a graph showing various conventional models compared with ATM with respect to the result of estimating the log probability during the last test period among the analyzed results of the stock price return. 18 is a graph showing the log probability among the results of analyzing the stock price change according to the embodiment of the performance test using the log probability metric by comparing various existing models with the ATM for the entire test period, , Which is a graph showing the results of estimating the log probability during the last test period among the analyzed results of the stock price change in comparison with ATM. Meanwhile, FIG. 20 is a graph showing the log probability among the results analyzed for the Obama support rate according to the embodiment of the performance test using the log probability metric, comparing various existing models with the ATM for the entire test period, and FIG. In this paper, we present a comparison of various existing models against the results of estimating log probability in the last test period.
Referring to Figures 16, 18 and 20 showing the results for the entire test period, it can be seen that all experimental models have similar values. These results show that ATM has similar probability performance despite the strong assumption that two different sources are generated from the same hidden state.
17, 19 and 21 showing the results for the last unit period, ATM does not have the best performance. However, even though ATM incorporates time series values into the probability modeling part, it can be seen that it has similar performance compared with the comparative models.
A new topic model, i. E. ATM, according to the above embodiment can find the relationship between numerical data and corpus collected in time. ATM can be useful in a wide variety of applications with varying numerical data and text data collected from the crowd. For example, the model can be applied not only to political SNS messages, ratings, but also to product reviews, sales records, and so on.
Various and modified configurations are possible with reference to and combining various features described herein. Accordingly, it should be pointed out that the scope of the embodiments is not limited to the described embodiments, but rather should be construed in accordance with the appended claims.
Claims (20)
Obtaining a data set including time-series text data and time-series numeric data corresponding to each other in time, and
And applying an associative topic model (ATM) to the time-series text data and the time-series numeric data corresponding to each other in time,
Wherein applying the ATM comprises: calculating a time-based trajectory of topic proportions from the time-series text data; and correlating topic ratios according to the trajectory with the time-series numerical data in a time axis,
Wherein the step of calculating the locus of the topic proportions from the time series text data includes the step of calculating a locus of the topic proportions from the time series text data according to time in a Dynamic Topic Model (DTM) Based on a time-series numerical variable generated from the same prior information as the topic proportion. ≪ Desc / Clms Page number 19 >
Correlating the topic ratios according to the locus with the time-series numerical data in the time axis comprises: comparing the document-level state of the topic ratio at the time t with the corpus-level state of the topic ratio at the time t by a variational method, Wherein the time-series text data and the time-series numerical data are associated with each other.
Correlating the topic ratios according to the locus with the time series numerical data in the time axis further comprises using a Kalman filter to estimate the dynamics of the topic ratios in the corpus over time. Time series text data and time series numerical data.
Wherein acquiring a data set that includes time-series text data and time-series numeric data that are temporally corresponding to each other includes determining whether there are stop words or terms appearing less frequently than a predefined ratio in the data set And locating and removing time series text data and time series numerical data.
A data acquisition module for acquiring time-series text data and time-series numerical data sets of data corresponding to each other in time, and
And an ATM (Associative Topic Model) module for calculating a locus of the topic ratio with respect to time from the time series text data and correlating the topic ratios according to the locus with the time series numerical data on a time axis,
The ATM module includes:
And compute a time-based trajectory of the topic ratio from the time-series text data on the basis of a time-series numerical variable generated from the same fryer information as the topic ratio of the dynamic topic model (DTM).
The ATM module includes:
Level state of the topic ratio at time t, the document-level state of the topic ratio at time t to correlate the topic ratios according to the locus with the time series numerical data on the time axis, according to a variational method Time-series text data and a time-series numeric data associating device.
The ATM module includes:
A time series text that is further configured to estimate a dynamics of a topic ratio in a corpus over time to correlate topic ratios according to the locus with the time series numerical data in a time axis and to use a Kalman filter for the estimation, Data and time series numerical data associating devices.
And a data acquiring module for acquiring time-series text data and time-series numeric data corresponding to each other in time,
Wherein the data set is configured to find and remove a stop word or a term appearing less frequently than a predefined rate.
Obtaining a data set including time-series text data and time-series numeric data corresponding to each other in time, and
And applying an associative topic model (ATM) to the time-series text data and the time-series numeric data corresponding to each other in time,
Wherein the step of applying the ATM comprises the steps of calculating a locus of the topic proportions in time from the time series text data and correlating the topic ratios according to the locus with the time series numerical data on a time axis,
The program includes:
Wherein the step of calculating a locus based on the time ratio of the topic ratio from the time-series text data comprises the steps of: calculating a locus of the topic ratio with respect to time from the time-series text data as priority information equivalent to a topic ratio of a dynamic topic model (DTM) based on a time-series numerical parameter generated from the time-series numerical information.
The program includes:
Correlating the topic ratios according to the locus with the time-series numerical data in the time axis comprises: comparing the document-level state of the topic ratio at the time t with the corpus-level state of the topic ratio at the time t by a variational method, In accordance with a result of the comparison.
The program includes:
Correlating the topic ratios according to the locus with the time series numerical data in the time axis further comprises using a Kalman filter to estimate the dynamics of the topic ratio in the corpus over time, A computer-readable recording medium storing a program.
The program includes:
Wherein acquiring a data set that includes time-series text data and time-series numeric data that are temporally corresponding to each other includes determining whether there are stop words or terms appearing less frequently than a predefined ratio in the data set And a step of detecting and removing the program.
A memory for storing time-series text data and time-series numeric data in a temporally corresponding manner;
A processor,
The processor calculates a locus of a topic proportion according to time from the time series text data stored in the memory according to an associative topic model (ATM), calculates topic ratios according to the locus in a time axis, Data,
The processor
And calculating a time-based trajectory of the topic ratio from the time-series text data on the basis of a time-series numerical variable generated from the same priority information as a topic ratio of a dynamic topic model (DTM) A numeric data association device.
The processor
Level state of the topic ratio at time t, the document-level state of the topic ratio at time t to correlate the topic ratios according to the locus with the time series numerical data on the time axis, according to a variational method Time-series text data and a time-series numeric data associating device.
The processor comprising:
Time text data, which is configured to estimate a dynamics of topic ratios in a corpus over time to correlate topic ratios according to the locus with the time series numerical data in a time axis and to use a Kalman filter for the estimation, And a time series numerical data association device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150076402A KR101613397B1 (en) | 2015-05-29 | 2015-05-29 | Method and apparatus for associating topic data with numerical time series |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150076402A KR101613397B1 (en) | 2015-05-29 | 2015-05-29 | Method and apparatus for associating topic data with numerical time series |
Publications (1)
Publication Number | Publication Date |
---|---|
KR101613397B1 true KR101613397B1 (en) | 2016-04-18 |
Family
ID=55916954
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150076402A KR101613397B1 (en) | 2015-05-29 | 2015-05-29 | Method and apparatus for associating topic data with numerical time series |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101613397B1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019013376A1 (en) * | 2017-07-14 | 2019-01-17 | 한국과학기술원 | Method and device for predicting approval rating by using text-compensated automatic statistical model |
CN111125305A (en) * | 2019-12-05 | 2020-05-08 | 东软集团股份有限公司 | Hot topic determination method and device, storage medium and electronic equipment |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120095952A1 (en) * | 2010-10-19 | 2012-04-19 | Xerox Corporation | Collapsed gibbs sampler for sparse topic models and discrete matrix factorization |
-
2015
- 2015-05-29 KR KR1020150076402A patent/KR101613397B1/en active IP Right Grant
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120095952A1 (en) * | 2010-10-19 | 2012-04-19 | Xerox Corporation | Collapsed gibbs sampler for sparse topic models and discrete matrix factorization |
Non-Patent Citations (1)
Title |
---|
박성래, 사회 지표와 연관된 토픽모델, KAIST 석사 학위 논문, (2014.)* |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019013376A1 (en) * | 2017-07-14 | 2019-01-17 | 한국과학기술원 | Method and device for predicting approval rating by using text-compensated automatic statistical model |
KR20190007915A (en) * | 2017-07-14 | 2019-01-23 | 한국과학기술원 | Method and apparatus for predicting approval rates of politicians with text augmented automatic statistician |
KR101991569B1 (en) * | 2017-07-14 | 2019-06-19 | 한국과학기술원 | Method and apparatus for predicting approval rates of politicians with text augmented automatic statistician |
CN111125305A (en) * | 2019-12-05 | 2020-05-08 | 东软集团股份有限公司 | Hot topic determination method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xu et al. | E-commerce product review sentiment classification based on a naïve Bayes continuous learning framework | |
US11361200B2 (en) | System and method for learning contextually aware predictive key phrases | |
US10600005B2 (en) | System for automatic, simultaneous feature selection and hyperparameter tuning for a machine learning model | |
Zamani et al. | Neural query performance prediction using weak supervision from multiple signals | |
US20210042590A1 (en) | Machine learning system using a stochastic process and method | |
US11037080B2 (en) | Operational process anomaly detection | |
El Morr et al. | Descriptive, predictive, and prescriptive analytics | |
JP2021504789A (en) | ESG-based corporate evaluation execution device and its operation method | |
Landeiro et al. | Robust text classification in the presence of confounding bias | |
KR102105319B1 (en) | Esg based enterprise assessment device and operating method thereof | |
Dang et al. | Framework for retrieving relevant contents related to fashion from online social network data | |
AlDahoul et al. | A comparison of machine learning models for suspended sediment load classification | |
US11615361B2 (en) | Machine learning model for predicting litigation risk in correspondence and identifying severity levels | |
Prasad et al. | Hybrid topic cluster models for social healthcare data | |
Badenes-Olmedo et al. | Efficient clustering from distributions over topics | |
KR101613397B1 (en) | Method and apparatus for associating topic data with numerical time series | |
Obiedat | Predicting the popularity of online news using classification methods with feature filtering techniques | |
Iwata et al. | Sequential modeling of topic dynamics with multiple timescales | |
CN111694957B (en) | Method, equipment and storage medium for classifying problem sheets based on graph neural network | |
Fritsche et al. | Deciphering professional forecasters' stories: Analyzing a corpus of textual predictions for the German economy | |
Gutsche | Automatic weak signal detection and forecasting | |
CN113256383B (en) | Recommendation method and device for insurance products, electronic equipment and storage medium | |
Nasr et al. | Natural language processing: Text categorization and classifications | |
Vanipriya et al. | Stock market prediction using sequential events | |
Hewa Nadungodage et al. | Online multi-dimensional regression analysis on concept-drifting data streams |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant | ||
FPAY | Annual fee payment |
Payment date: 20190402 Year of fee payment: 4 |