KR20160136014A - Method and system for topic clustering of big data - Google Patents

Method and system for topic clustering of big data Download PDF

Info

Publication number
KR20160136014A
KR20160136014A KR1020150069641A KR20150069641A KR20160136014A KR 20160136014 A KR20160136014 A KR 20160136014A KR 1020150069641 A KR1020150069641 A KR 1020150069641A KR 20150069641 A KR20150069641 A KR 20150069641A KR 20160136014 A KR20160136014 A KR 20160136014A
Authority
KR
South Korea
Prior art keywords
level node
ghtm
big data
topic
domain knowledge
Prior art date
Application number
KR1020150069641A
Other languages
Korean (ko)
Inventor
문일철
신수진
Original Assignee
한국과학기술원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국과학기술원 filed Critical 한국과학기술원
Priority to KR1020150069641A priority Critical patent/KR20160136014A/en
Publication of KR20160136014A publication Critical patent/KR20160136014A/en

Links

Images

Classifications

    • G06F17/30318
    • G06F17/2745
    • G06F17/30705

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a method and apparatus for clustering large data topics. This is a method of clustering big data topics performed by a computing device, comprising the steps of: obtaining big data, wherein the clustering of big data topics is performed by a computing device, the method comprising: allowing a user to respond to each of at least one category and at least one category A domain knowledge including at least one seed word to which the domain knowledge is input; and applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data with the domain knowledge as input .

Description

METHOD AND SYSTEM FOR CLUSTERING OF BIG DATA BACKGROUND OF THE INVENTION [0001]

The disclosed technique relates to a topic clustering technique of Big Data, and more particularly to an improved Big Data Topology clustering method and apparatus based on a hierarchical topic model including domain knowledge.

In recent years, as data generated and exchanged on the Internet such as the Internet has increased, big data data mining techniques have been proposed which collect useful data by collecting and analyzing online data. For example, studies are under way to synthesize and analyze public information through SNS (social network service) such as Twitter or Facebook, and anticipate and prepare for economic conditions and stock price fluctuations.

However, in order to extract financial information from the big data collected on the SNS and analyze it to predict the economic situation or stock price fluctuation, there has been a technical demand that the financial-related information to be extracted should be accurately extracted. The technique of extracting topic information from big data is known as a topic model. Whether or not the correct information has been extracted is determined by whether the extracted topic is correctly clustered.

The topic model, Latent Dirichlet Allocation (LDA), is well known as a probabilistic model for extracting potential topics from large corpus. This extracts a topic based on the assumption that there are many potential topics in the corpus and one document based on the assumption that it is a mixture of topics. Here 'topic' is the probability distribution of words and is not the same as a short sentence that can be interpreted by humans. Therefore, there is a problem that it is difficult to interpret the topics extracted by the probability model as having a meaning understood by human intuition, and therefore it is not known whether or not the accurate information is extracted.

The LDA simply observes independent topics, and the Hierarchical Topic Model (HTM) has been proposed to develop them and observe topics that define the hierarchical relationships. The HTM has the advantage of being able to see categorized content, since the topics extracted from the corpus are related in a tree structure, that is, a hierarchical structure. However, the topic tree of HTM was often different from the interpretation by human intuition. This is because the HTM extracts only data-based information that does not reflect human knowledge at all. Therefore, there is still a need for a topic model that allows the topics extracted from the corpus to correspond with the meaning understood by the human intuition.

In particular, the disclosed technique is based on a hierarchical topic model including domain knowledge, and thus, a big data topic clustering method and apparatus with improved accuracy of hierarchical clustering of topics extracted from big data And the like.

The disclosed technique is a Guided Hierarchical Topic Model (GHTM) that improves the interpretability of topics by applying the HTM to define a set of words that can reveal the category characteristics of the type that the user desires to see / The present invention is directed to a method and apparatus for clustering large data topics using clusters.

The disclosed technique also provides a method and apparatus for clustering big data topics with improved accuracy of hierarchical clustering of topics extracted from big data by applying the Dirichlet Forest prior when incorporating domain knowledge into hierarchical topic modeling For that purpose.

The object of the present invention is provided by a Big Data Topology Clustering method and apparatus provided in accordance with embodiments.

A big data topic clustering method provided in accordance with an aspect of embodiments is a big data topic clustering method performed by a computing device, comprising: obtaining big data; Inputting domain knowledge including at least one seed word corresponding to the domain knowledge, and applying domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data with domain knowledge input . ≪ / RTI >

Applying the GHTM may include creating a root level node and creating at least one subroute level node.

The step of creating a root level node may include the step of statistically generating a topic distribution from a Dirichlet distribution at a root level node.

Creating at least one subroute level node may include generating a diricule tree distribution from a Dirichlet Forest prior at at least one subroute level node.

Applying the GHTM may include applying at least one seed word corresponding to each of at least one category and at least one category as a Dirichlet forest flier.

The step of applying the GHTM includes setting a parameter of the GHTM, and the step of setting the parameter of the GHTM may include the step of allowing the user to input a parameter or setting the parameter to a pre-stored value.

A device for big data topic clustering provided in accordance with another aspect of embodiments is a data storage for storing big data. A domain knowledge input section for receiving a domain knowledge including at least one seed word corresponding to each of at least one category and at least one category, and a domain knowledge input section for inputting at least one category and at least one category corresponding to each of at least one category And a GHTM module configured to perform topic clustering by applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data based on domain knowledge including one seed word.

The GHTM module may be further configured to create a root level node and at least one subroute level node.

The GHTM module may be further configured to statistically generate a topic distribution from a Dirichlet distribution at a root level node and generate a Diricletree distribution from a Dirichlet Forest prior at least one subroute level node .

The GHTM module may be further configured to apply domain knowledge as a Dicle Forest Forerian, including at least one category and at least one seed word corresponding to each of the at least one category.

The GHTM module is further configured to set the parameters of the GHTM, and the parameters of the GHTM may be a value input from the user or a pre-stored value.

A big data topic clustering method provided in accordance with another aspect of embodiments is a big data topic clustering method performed by a computing device, comprising: obtaining big data; Inputting domain knowledge including at least one corresponding seed word, and applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the Big Data based on domain knowledge can do.

Applying the GHTM may include creating a root level node and at least one subroute level node.

The step of generating a root level node and at least one subroute level node comprises: probabilistically and statistically generating a topic distribution from a Dirichlet distribution at a root level node and generating a Dirichlet Forest priori at least one subroute level node To generate a diricule tree distribution.

A big data topic clustering method provided in accordance with another aspect of embodiments is a big data topic clustering method performed by a computing device, comprising: obtaining big data; Inputting a domain knowledge including at least one corresponding seed word, and applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data based on domain knowledge, Clustering may be performed.

Performing the Big Data Topology Clustering may include creating a root level node and at least one subroute level node.

The step of generating a root level node and at least one subroute level node comprises: probabilistically and statistically generating a topic distribution from a Dirichlet distribution at a root level node and generating a Dirichlet Forest priori at least one subroute level node To generate a diricule tree distribution.

A big data topic clustering method provided in accordance with another aspect of embodiments is a big data topic clustering method performed by a computing device, comprising: preparing to be able to access big data, allowing a user to access at least one category and at least one category And inputting domain knowledge including at least one seed word corresponding to each of the plurality of seed words, and accessing the big data according to a domain knowledge applied GHTM (Guided Hierarchical Topic Model) And performing a big-data-topic clustering on the big data.

Performing the Big Data Topology Clustering may include creating a root level node and at least one subroute level node.

The step of generating a root level node and at least one subroute level node comprises: probabilistically and statistically generating a topic distribution from a Dirichlet distribution at a root level node and generating a Dirichlet Forest priori at least one subroute level node To generate a diricule tree distribution.

The features and advantages of the embodiments will become more apparent from the following detailed description based on the accompanying drawings.

According to the embodiments, it is possible to provide a method and apparatus for clustering big data topics in which the accuracy of hierarchical clustering of topics extracted from big data is improved based on hierarchical topic models including domain knowledge.

According to the embodiments, a GHTM (Guided HTM) which improves the interpretability of topics by applying to the HTM by defining a set of words that can reveal the category characteristics of the form that the user desires to see / ), And can provide a method and apparatus for clustering big data topics.

According to embodiments, a hierarchical topic structure is provided and domain knowledge for the corpus can be integrated. A hierarchical topic model can classify similar topics by category, and intensive sampling of predefined keywords is enabled by using domain knowledge. Embodiments use a hierarchical topic clustering model named by inventors as GHTM (Guided Hierarchical Topic Model). The basis of the GHTM used in the examples is HTM. Compared to LDA, the HTM detects hierarchical topic structures as well as topics. In GHTM, prior knowledge of domain knowledge is assigned to a dichotomous distribution, and a dichotomous distribution is a franchise of a hierarchical topic model. By the fryer adaptation, we can obtain a guided topic tree by domain knowledge. Accordingly, it is possible to provide a method and an apparatus for clustering big data topics with improved accuracy of hierarchical clustering of topics extracted from big data.

Figure 1 is a schematic diagram showing a typical Dirichlet distribution and a Dirichlet tree distribution;
Figure 2 is a schematic diagram illustrating a hierarchical topic model applying the Dirichlet Forest prior according to one embodiment;
Figure 3 shows a schematic hierarchical topic model, HTM,
Figure 4 is a schematic diagram illustrating a hierarchical topic model applying the Dirichlet Forest prior, i.e., GHTM (Guided HTM), according to one embodiment.
5 is a block diagram illustrating a big data topic clustering apparatus using GHTM according to an embodiment
6 is a flowchart illustrating a method of clustering Big Data Topics using GHTM according to an embodiment.
7 is a schematic diagram illustrating a topic cluster structure using a general HTM
8 is a schematic diagram illustrating a topic cluster structure using GHTM according to one embodiment

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

1 is a schematic diagram showing a general Dirichlet distribution and a Dirichlet tree distribution.

The dirichlet distribution shown in FIG. 1 (a) is a Dirichlet distribution in which A, B, C, D, E, F, G,

Figure pat00001
. The diricule tree distribution shown in FIG. 1 (b) shows the domain knowledge {{A}, {B, C}, {D, E, F} for A, B, C, D, E, }}.
Figure pat00002
Is assigned to {D, E, F}.

In an embodiment, it is assumed that the user provides a keyword to hierarchically separate the topics. The keywords provided by the user are seed words used to integrate domain knowledge. The seed words are converted to the parameters of the diricule tree distribution. As can be seen in Fig. 1, the Dirichlet distribution (see Fig. 1 (a)) is a diricule tree distribution with a single tree depth (see Fig. 1 (b)). When applying a split operation to the diriculatrix distribution, the constraint is encoded in the distribution as illustrated in FIG. 1 (b). parameter

Figure pat00003
Means the strength parameter of the domain knowledge.
Figure pat00004
The larger the separation tendency, the stronger the separation tendency.

2 is a schematic diagram illustrating a hierarchical topic model applying the Dirichlet Forest prior according to one embodiment.

The LDA uses two types of Dirichlet fryers. One is a fryer for a document-topic distribution, and the other is a fryer for a topic-word distribution. Previously, there has been a proposal for an asymmetric dirichlet fryer that gives more weight to a particular word. However, the asymmetric fryer affects parameter inference for other words, and this fryer has a problem that it can not implement the correlation between two specific words. For example, if two words are in a very large correlation in a positive manner, the Dirichlet distribution flier setting could not be applied without affecting other words. This problem is solved by the Dirichlet Forest prior. Dirichlet Forest Prior is a collection of Dirichlet distributions. The diricule tree distribution allows the encoding of complex constraints, i.e., separating operations. The GHTM according to an embodiment is to apply a Dirichlet forest flier to a hierarchical topic model.

FIG. 2 schematically illustrates a hierarchical topic model using such a Dirichlet forest flier. Here, only the root node is associated with the Dirichlet fryer, which is to show the most common topic in the corpus regardless of domain knowledge. On the other hand, a Dirichlet forest primer is assigned to other nodes. The Dirichlet forest flier has a collection of dirichlet tree distributions as it encodes the separating operation. Thus, it provides guidance of parameter estimation for sub-trees.

FIG. 3 is a schematic diagram illustrating a general hierarchical topic model, i.e., an HTM, and FIG. 4 is a schematic diagram illustrating a hierarchical topic model, that is, a GHTM (Guided HTM), using a Dirichlet forest prior according to an embodiment.

Referring to FIGS. 3 and 4, each circle represents a random variable. The gray filled circle is an observation variable, and the unfilled circle is a hidden variable. A large rectangle containing several random variables is called a plate, which means that the set of containing random variables overlap by the number specified in the corner (D, N d , ∞). The arrows indicate that there is a statistical relationship represented by a probability distribution between two connected variables.

here,

Figure pat00005
Is a parameter that controls how often a document selects a new path,
Figure pat00006
The path of the document d,
Figure pat00007
Is a Dirichlet fryer for a document level ratio,
Figure pat00008
The level ratio of the document d,
Figure pat00009
Word
Figure pat00010
Level assignment of <
Figure pat00011
Is the nth word in document d,
Figure pat00012
Is the k-th topic,
Figure pat00013
Is the intensity parameter of the domain knowledge,
Figure pat00014
Is a Dirichlet fryer by topic word distribution,
Figure pat00015
Is the number of words in document d, D is the number of documents (documents), T is the number of subdirrays,
Figure pat00016
Means user-defined domain knowledge.

Basically, the GHTM 40 uses a processor to assign a level assignment to each word of the HTM 30 and a document-specific path. At the same time, the GHTM 40 differs from the HTM 30 in the fryer parts 41 and 43.

In the creation process, first, the topic distribution is generated stochastically from the Dirichlet distribution at the root level. The sub-root node then creates a diriculette distribution from the Dirichlet forest fryer. The generation process at this sub-root node is similar to the process of selecting a branch of the diriculatrix distribution corresponding to a particular separation operation. Then, each document generates words by selecting a particular path in the global tree and sampling the level index. As a result, the root topic contains the most common content of the corpus, because the root topic is shared by all the documents. But deeper levels of nodes are focused on more specific topics.

Document - The process of selecting a particular route uses the nested Chinese restaurant process (nCRP). Each document uses a Markov process to select a single path from the root to the leaf. The internal prediction probability of nCRP follows the Dirichlet process, which is known to have rich-get-richer properties. Alternatively, a document path may be selected using a uniform process known to have no such adverse property. That is, the path of the document d is sequentially generated by either the dichroic process or the uniform process, which is represented by the following equations (1) and (2).

Figure pat00017

Figure pat00018

here,

Figure pat00019
Is selected
Figure pat00020
- Represents a topic at a level. At each level, the probability of selecting an existing topic depends on the number of documents with topic k
Figure pat00021
(For the Dirichlet process), or 1 (for the uniform process). The probability of selecting a new topic depends on the parameter
Figure pat00022
. The following is the creation process (see symbols in FIG. 3 and FIG. 4).

Figure pat00023

Figure pat00024

Inference

The diricule tree distribution is a conjugate prior to a polynomial distribution, such as a Dirichlet distribution. Thus, the variable q (the index of the branch in the diricule tree distribution), c (the document path), and z (the level index of the word) may be sampled using collapsed Gibbs sampling.

The sampling of q is expressed by Equation (3).

Figure pat00025

here,

Figure pat00026
Is the size of the u-th set of words,
Figure pat00027
Is a set of all the topics in the sub-tree t,
Figure pat00028
Is the internal nodes beneath the subdiriklet of topic j in the u-th domain knowledge,
Figure pat00029
Is the child of node s in the circled tree of topic j,
Figure pat00030
Is the edge weight that leads into node k in the circled tree of topic j,
Figure pat00031
Is the number of words under node k in the circled tree of topic j. For each sub-root topic, a set of words is assigned by Equation (3). The second term in equation (3) is the probability given by the production type of all sub-topics, including sub-root topics.

The sampling of c is expressed in equation (4).

Figure pat00032

here,

Figure pat00033
Quot; contains all the words in document d,
Figure pat00034
Path
Figure pat00035
The words assigned to a topic of level 1,
Figure pat00036
Lt; RTI ID = 0.0 > diricltry < / RTI &
Figure pat00037
Wow
Figure pat00038
Are both the number of words (the former is for all documents except d, the latter for d only). The document d selects the path according to Equation 4, and the path of d is excluded in the paths of other documents.

The sampling of z is expressed by equation (5).

Figure pat00039

here

Figure pat00040
,
Figure pat00041
Is the number of words in d allocated to topic k, excluding word i,
Figure pat00042
Lt; / RTI > is a subset of the internal nodes in the dicl tree of topic v that are ancestors of leaf i,
Figure pat00043
The
Figure pat00044
Including
Figure pat00045
And a child node directly under s.

The word-topic distribution [phi] may be estimated after convergence instead of gibbs sampling of q, c, z. The topic with the Dirichlet flier is obtained by Equation 6, and the Dirichlet flier is obtained by Equation (7).

Figure pat00046

Figure pat00047

As in the LDA, the document's level ratio

Figure pat00048
Can be estimated by Equation (8).

Figure pat00049

The GHTM as described above performs topic clustering on a specific data set. At this time, the GHTM can classify the topics extracted in the data set into a hierarchical tree structure and utilize user defined domain knowledge, thereby improving the accuracy of topic clustering.

5 and 6 are block diagrams illustrating an apparatus and method for clustering Big Data Topics using GHTM according to an embodiment.

The Big Data Topology Clustering Apparatus 50 shown in FIG. 5 may be implemented as a computing device. The computing device may be a device having a memory for storing data and a processor for performing data processing such as a personal computer, a server computer, a desktop, a laptop, a palmtop, a smart phone, etc. and a user interface device such as a keyboard and a display Either of them is acceptable. The computing device may be one independent device, but it is also possible to implement a distributed computing system in which a plurality of devices connected by a data communication network cooperate with each other.

Meanwhile, the Big Data Topology Clustering method 600 shown in FIG. 6 may be performed in an apparatus as illustrated in FIG. Or may be implemented to run on a general purpose computing device as a software program to be installed on a general purpose computing device, including, by way of example, a processor, memory, user interface,

5, as an example, the big data topic clustering device 50 includes a GHTM module 52, a data storage 54, a domain knowledge input 56, and a parameter setting 58 .

The data storage 54 may be a memory device that stores big data, which is data data for extracting and hierarchically clustering topics from the big data topic clustering device 50. [ Here, the Big Data may be collected from various sources. For example, the big data may be data collected on a social network service (SNS). As another example, the Big Data may be collected from media such as newsgroup data, user review data of home shopping channels, newspapers or broadcast news.

The domain knowledge input unit 56 may be a user interface device that allows the user to interact with the big data topic clustering device 50. [ The user inputs a plurality of categories and a plurality of seed words corresponding to each category as domain knowledge. The user may want to extract a topic of a desired category (financial information associated with the stock index) from the big data (for example, data collected on a twitter for a certain period of time) as the target data. In this way, the user can specify a desired category, and keywords that best represent this category can be input as seed words.

The parameter setting unit 58 may be configured to set a parameter of GHTM which is a hierarchical topic model that applies the domain knowledge executed in the GHTM module 52. [ The parameter setting unit 58 may be a user interface device for setting a parameter by a user. Alternatively, the parameter setting unit 58 may be implemented as a part of a processor that automatically inputs a parameter setting value to the GHTM module 52 and automatically sets a parameter based on a result of the test.

The GHTM module 52 may be implemented as a processor that actually performs operations to extract and cluster the topic from the big data. To this end, the GHTM module 52 receives the big data from the data storage 54, applies the domain knowledge input through the domain knowledge input unit 56, and transmits the GHTM having the parameters set through the parameter setting unit 58 By executing, the topics extracted from the big data can be clustered hierarchically.

The Big Data Topology Clustering method 600 according to the embodiment shown in FIG. 6 includes an SNS-based big data collection step 610, a step 630 of designating a domain knowledge of the collected big data data by a user, (Step 650), and performing a hierarchical topic clustering (step 670).

The SNS-based big data collection step 610 may be a step of collecting data on the SNS such as Twitter and Facebook. In another embodiment, the big data may be collected not only from the SNS but also from data such as shopping malls, news groups, media, and the like. The collected big data may be stored in the memory or hard disk built in the computing device and then provided to be usable by the processor. Big data collected in other manners may be stored in an optical disk or a portable memory, and then provided to the computing device. Big data collected in another manner may be stored in another remote computing device or a cloud server, and then provided to be usable through a data communication network such as the Internet.

In step 630, where the user designates domain knowledge of the collected big data data, the user may use a user interface, such as a keyboard and a display device, provided by the computing device, to a processor of the computing device, A plurality of corresponding seed words can be input as domain knowledge. The domain knowledge designated by the user may be a set of keywords selected by the user to be associated with information that the user intends to extract from the big data.

For example, a user may wish to extract financial related information that will help predict a stock price index from big data collected on Twitter. In this case, the user can input one or more categories related to 'financial information' and keywords that can express each category as a seed word. These categories and seed words are used as the domain knowledge input value of the hierarchical topic clustering model.

The step 650 of setting the parameters of the GHTM is a step of setting the parameters of the GHTM to have specific numerical values. The parameter setting can be set by the user, or GHTM can be tested in advance and set to an optimized value according to the result.

In step 670 of performing hierarchical topic clustering, the GHTM module 52 of FIG. 5 hierarchically clusters topics using GHTM to which domain knowledge specified by the user is applied to the big data in the processor of the computing device or in the GHTM module 52 of FIG.

The step 670 of hierarchically clustering topics using the GHTM operates differently at the root level node and the subrout level nodes. That is, at the root level node, the topic distribution is stochastically drawn from the Dirichlet distribution. On the other hand, at subroute level nodes, draw a Dirichlet tree distribution from the Dirichlet Forest prior. Finally, hierarchical topic clustering can be completed by selecting the path of each document. The process of selecting the path of each document may be either a dirichlet process or a uniform process of nCRP.

Experimental Example

In order to compare the proposed HTM with the GHTM proposed above, experiments were conducted on concrete big data. The Big Data used in this experiment is 20 newsgroup data, RCV1_v2, Amazon.com product review data, and is used by Kiritchenko et al. To evaluate the topic interpretability of the hierarchical clustering of topics extracted from HTM and GHTM The proposed hierarchical F-values (F-measures) were calculated.

Table 1 below shows the big data used in the experiment, i.e., the number of documents in the corpus, the number of words distinguished from each other, and the total number of words.

Figure pat00050

The 20 newsgroups in the corpus of Table 1 are classified into six main categories, and each major category is subdivided into sub-categories. The six main categories are: 1) religion-related topics such as atheism and Christianity, 2) computer-related topics such as graphics, operating systems, hardware, 3) commodity-related topics, 4) automotive, motorcycle, , Scientific topics related to medicine and space science, and 6) political topics. (For more information, see http://qwone.com/~jason/20Newsgroups/)

On the other hand, a data set that collects about 800,000 news articles issued by Reuters for one year (1996-08-20 ~ 1997-08-19) is referred to as RCV1, and RCV1_v2 is original RCV1 data for analysis . RCV datasets are largely classified into three categories: Topics, Industries, and Regions. In this experiment, Topics classification method was used. According to the Topics classification, there are four categories of business / industry (CCAT), economy (ECAT), market (MCAT) and government / society (GCAT) Three categories of market were used. Unlike 20 Newsgroups, where documents and categories correspond one-to-one, articles on RCV can belong to several categories at the same time. (For more information, see http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/)

The Review Data in Table 1 is a collection of Amazon review data related to the two products 'Apple iPad' and 'Samsung Galaxy Tab'.

Table 2 below shows the 10 seed words each assigned by the user for each of the six categories of 20 newsgroups in the corpus of Table 1. [

Figure pat00051

In order to carry out the experiment for the comparison of HTM with GHTM, 1) the depth of the topic layer is set to 2 to 4, 2) the path is selected using the diclcle process and the uniform process, and 3)

Figure pat00052
And the number of seed words is set to 1. 5, and 10. On the other hand, GHTM † was further tested in addition to HTM and GHTM, and GHTM † was tested at the edge of the leaf node in the diriculatrix distribution
Figure pat00053
.
Figure pat00054
, The probability that the seed words are correlated within the same topic center is reduced while the different topic sets are separated from each other. Also
Figure pat00055
= 0.5,
Figure pat00056
2.0, 1.0, 0.5], [2.0, 1.0, 0.5, 0.1] depending on the level, and the Dirichlet Process Priorer
Figure pat00057
= 0.1, the uniform process fryer
Figure pat00058
= 0.001.

To measure the validity of GHTM and HTM, a hierarchical F-value,

Figure pat00059
Were measured. Although the topic model is not a classifier, it is important to evaluate the document cluster. This is because accurate clustering affects the topic hierarchy, and 2) true categorization values can be obtained from the dataset. Thus, the macro-averaged F-measure as well as the micro-average F-measure were calculated. Equation (9) shows the micro-average F-value, and (10) shows the macro-average F-value.

Figure pat00060

Figure pat00061

Figure pat00062
And
Figure pat00063
Represent micro-average and macro-average, respectively.
Figure pat00064
Hierarchical precision,
Figure pat00065
Quot; refers to a hierarchical recall,
Figure pat00066
The
Figure pat00067
ego,
Figure pat00068
The
Figure pat00069
Lt;
Figure pat00070
Is a correction class,
Figure pat00071
Is the predicted class, and L is the number of leaf nodes.

Table 3 below shows the hierarchical F-values measured in the tested models. It can be seen from this result that, firstly, the F-values of GHTM and GHTM † are significantly higher than the F-values of HTM when the hierarchical level exceeds 2. This implies that GHTM classifies documents more hierarchically than HTM, and therefore GHTM is a better model for identifying the word-topic distribution of hierarchical topics.

The second point of this result is that the fryer of the uniform process is superior to the micro-F-value of the Dirichlet process, and the Dirichlet process is superior to the uniform process of the macro F-value. This indicates that the macroeconomic F-value has underperformance because the correlation of small-sized clusters is underestimated. Therefore, if the user considers the accuracy of all branches in the topic hierarchy, it is desirable to use a uniform process.

Figure pat00072

FIG. 7 is a schematic view illustrating a topic cluster structure using a general HTM, and FIG. 8 is a schematic view illustrating a topic cluster structure using a GHTM according to an exemplary embodiment. The results of FIGS. 7 and 8 are for the review data of Table 1, that is, the corpus collecting Amazon's review data related to the two products 'Apple iPad' and 'Samsung Galaxy Tab'. In this experiment, the domain knowledge you specify is {appl, ipod, iphon, mac, safari} and {samsung, android, galaxy, jelli, bean}. The path selection process used the uniform process flier.

The HTM-based topic hierarchy 70 of FIG. 7 does not consider domain knowledge, and the GHTM topic hierarchy 80 of FIG. 8 includes domain knowledge {appl, ipod, iphon, mac, safari} and {samsung , android, galaxy, jelli, bean}. The word set {appl, ipod, iphon, mac, safari} is marked with an ellipse, and the word set {samsung, android, galaxy, jelli, bean} is marked with a rectangle. Each topic contains 20 words with the highest probability of the corresponding topic distribution. As can be seen, it can be seen that the topics are clustered more accurately in the hierarchical structure 80 of the GHTM than the hierarchical structure 70 of the HTM.

It is practically impossible to read all the documents of a large corpus in order to understand the context structure of the corpus. HTM can cluster documents hierarchically, but there was no way to integrate domain knowledge into the fryer. Therefore, GHTM according to the present disclosure 1) applies domain knowledge as a frier and 2) solves the structural properties of HTM. Thus, the accuracy of the hierarchical clustering of the document can be increased.

Various and modified configurations are possible with reference to and combining various features described herein. Accordingly, it should be pointed out that the scope of the embodiments is not to be interpreted as being limited to the embodiments described, but should be construed according to the appended claims.

30: HTM
40: GHTM
41: Domain knowledge
43: Dirichlet fryer
50: Big data topic clustering device
52: GHTM module
54: Data storage
56: Domain knowledge input unit
58: Parameter setting section
70, 80: hierarchical structure

Claims (20)

CLAIMS 1. A method for clustering large data topics performed by a computing device,
Obtaining big data,
Causing the user to enter domain knowledge comprising at least one category and at least one seed word corresponding to each of the at least one category, and
And applying domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data with the domain knowledge as input.
The method according to claim 1,
Wherein applying the GHTM comprises creating a root level node and creating at least one subroute level node.
3. The method of claim 2,
The step of generating the root level node
Statistically generating a topic distribution from a Dirichlet distribution at said root level node.
The method of claim 3,
The step of generating the at least one subroute level node
Generating a Diricletree distribution from a Dirichlet Forest prior at the at least one subroute level node.
5. The method of claim 4,
The step of applying GHTM
Applying the at least one seed word corresponding to each of the at least one category and the at least one category as the dicerist forest fryer.
6. The method of claim 5,
Wherein applying the GHTM comprises setting parameters of the GHTM,
The step of setting the parameters of the GHTM
Causing the user to input the parameter;
And setting the parameter to a pre-stored value.
An apparatus for clustering Big Data Topics,
Data storage to store big data,
A domain knowledge input unit for receiving a domain knowledge including at least one category and at least one seed word corresponding to each of the at least one category,
A GHTM (Guided Hierarchical Topic Model) is applied to the big data based on the domain knowledge including at least one category word and at least one seed word corresponding to each of the at least one category, ) To apply topic clustering. ≪ RTI ID = 0.0 > [0002] < / RTI >
8. The method of claim 7,
Wherein the GHTM module is further configured to generate a root level node and at least one subroute level node.
9. The method of claim 8,
The GHTM module further statistically generates a topic distribution from the Dirichlet distribution at the root level node and generates a Diriclet tree distribution from the Dirichlet Forest prior at the at least one subroute level node , A big data topic clustering device.
10. The method of claim 9,
The GHTM module
And apply the domain knowledge comprising the at least one category and at least one seed word corresponding to each of the at least one category as the dicerist forest fryer.
8. The method of claim 7,
Wherein the GHTM module is further configured to set a parameter of the GHTM,
Wherein the parameter of the GHTM is a value input from the user or a previously stored value.
CLAIMS 1. A method for clustering large data topics performed by a computing device,
Obtaining big data,
Causing the user to enter domain knowledge, and
And applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data based on the domain knowledge.
13. The method of claim 12,
Wherein applying the GHTM comprises generating a root level node and at least one subroute level node.
14. The method of claim 13,
Wherein generating the root level node and at least one subroute level node comprises:
Statistically generating a topic distribution from a Dirichlet distribution at said root level node and generating a Diricletree distribution from a Dirichlet Forest prior at said at least one subroute level node, How to clustering data topics.
CLAIMS 1. A method for clustering large data topics performed by a computing device,
Obtaining big data,
Causing the user to enter domain knowledge comprising at least one category and at least one seed word corresponding to each of the at least one category, and
And performing a big data topic clustering by applying a domain knowledge applied type GHTM (Guided Hierarchical Topic Model) to the big data based on the domain knowledge.
16. The method of claim 15,
Wherein performing the Big Data Topic Clustering comprises creating a root level node and at least one subroute level node.
17. The method of claim 16,
Wherein generating the root level node and at least one subroute level node comprises:
Statistically generating a topic distribution from a Dirichlet distribution at said root level node and generating a Diricletree distribution from a Dirichlet Forest prior at said at least one subroute level node, How to clustering data topics.
CLAIMS 1. A method for clustering large data topics performed by a computing device,
Preparing to be able to access big data,
Causing the user to enter domain knowledge comprising at least one category and at least one seed word corresponding to each of the at least one category, and
Accessing the Big Data and performing Big Data Topic Clustering on the Big Data in accordance with a domain knowledge applied GHTM (Guided Hierarchical Topic Model) based on the domain knowledge; Way.
19. The method of claim 18,
Wherein performing the Big Data Topic Clustering comprises creating a root level node and at least one subroute level node.
20. The method of claim 19,
Wherein generating the root level node and at least one subroute level node comprises:
Statistically generating a topic distribution from a Dirichlet distribution at said root level node and generating a Diricletree distribution from a Dirichlet Forest prior at said at least one subroute level node, How to clustering data topics.
KR1020150069641A 2015-05-19 2015-05-19 Method and system for topic clustering of big data KR20160136014A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020150069641A KR20160136014A (en) 2015-05-19 2015-05-19 Method and system for topic clustering of big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020150069641A KR20160136014A (en) 2015-05-19 2015-05-19 Method and system for topic clustering of big data

Publications (1)

Publication Number Publication Date
KR20160136014A true KR20160136014A (en) 2016-11-29

Family

ID=57706228

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020150069641A KR20160136014A (en) 2015-05-19 2015-05-19 Method and system for topic clustering of big data

Country Status (1)

Country Link
KR (1) KR20160136014A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596202A (en) * 2018-03-08 2018-09-28 清华大学 The method for calculating personal commuting time based on mobile terminal GPS positioning data
CN109684480A (en) * 2018-12-30 2019-04-26 杭州翼兔网络科技有限公司 A kind of clustering method based on industry
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596202A (en) * 2018-03-08 2018-09-28 清华大学 The method for calculating personal commuting time based on mobile terminal GPS positioning data
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method
CN109684480A (en) * 2018-12-30 2019-04-26 杭州翼兔网络科技有限公司 A kind of clustering method based on industry
CN109684480B (en) * 2018-12-30 2021-01-05 北京人民在线网络有限公司 Industry-based clustering method

Similar Documents

Publication Publication Date Title
Kalmegh Analysis of weka data mining algorithm reptree, simple cart and randomtree for classification of indian news
US9542477B2 (en) Method of automated discovery of topics relatedness
De Battisti et al. A decade of research in statistics: A topic model approach
Roll et al. Using machine learning to disentangle homonyms in large text corpora
US20090327259A1 (en) Automatic concept clustering
KR102334236B1 (en) Method and application of meaningful keyword extraction from speech-converted text data
Özdağoğlu et al. Topic modelling-based decision framework for analysing digital voice of the customer
KR102334255B1 (en) Text data collection platform construction and integrated management method for AI-based voice service
Kandylas et al. Analyzing knowledge communities using foreground and background clusters
Ebadi et al. Application of machine learning techniques to assess the trends and alignment of the funded research output
US9230210B2 (en) Information processing apparatus and method for obtaining a knowledge item based on relation information and an attribute of the relation
KR20160136014A (en) Method and system for topic clustering of big data
Nashipudimath et al. An efficient integration and indexing method based on feature patterns and semantic analysis for big data
Ramathulasi et al. Augmented latent Dirichlet allocation model via word embedded clusters for mashup service clustering
WO2020095357A1 (en) Search needs assessment device, search needs assessment system, and search needs assessment method
KR101055363B1 (en) Apparatus and method for providing search information based on multiple resource
Ashoori et al. Using clustering methods for identifying blood donors behavior
Hess et al. C-salt: Mining class-specific alterations in boolean matrix factorization
Sang et al. Faceted subtopic retrieval: Exploiting the topic hierarchy via a multi-modal framework
WO2018086518A1 (en) Method and device for real-time detection of new subject
US11822609B2 (en) Prediction of future prominence attributes in data set
Ma et al. API prober–a tool for analyzing web API features and clustering web APIs
McGee et al. Towards visual analytics of multilayer graphs for digital cultural heritage
Suresh et al. A fuzzy based hybrid hierarchical clustering model for twitter sentiment analysis
Abinaya et al. Effective Feature Selection For High Dimensional Data using Fast Algorithm

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E601 Decision to refuse application