KR20160136014A - Method and system for topic clustering of big data - Google Patents
Method and system for topic clustering of big data Download PDFInfo
- Publication number
- KR20160136014A KR20160136014A KR1020150069641A KR20150069641A KR20160136014A KR 20160136014 A KR20160136014 A KR 20160136014A KR 1020150069641 A KR1020150069641 A KR 1020150069641A KR 20150069641 A KR20150069641 A KR 20150069641A KR 20160136014 A KR20160136014 A KR 20160136014A
- Authority
- KR
- South Korea
- Prior art keywords
- level node
- ghtm
- big data
- topic
- domain knowledge
- Prior art date
Links
Images
Classifications
-
- G06F17/30318—
-
- G06F17/2745—
-
- G06F17/30705—
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a method and apparatus for clustering large data topics. This is a method of clustering big data topics performed by a computing device, comprising the steps of: obtaining big data, wherein the clustering of big data topics is performed by a computing device, the method comprising: allowing a user to respond to each of at least one category and at least one category A domain knowledge including at least one seed word to which the domain knowledge is input; and applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data with the domain knowledge as input .
Description
The disclosed technique relates to a topic clustering technique of Big Data, and more particularly to an improved Big Data Topology clustering method and apparatus based on a hierarchical topic model including domain knowledge.
In recent years, as data generated and exchanged on the Internet such as the Internet has increased, big data data mining techniques have been proposed which collect useful data by collecting and analyzing online data. For example, studies are under way to synthesize and analyze public information through SNS (social network service) such as Twitter or Facebook, and anticipate and prepare for economic conditions and stock price fluctuations.
However, in order to extract financial information from the big data collected on the SNS and analyze it to predict the economic situation or stock price fluctuation, there has been a technical demand that the financial-related information to be extracted should be accurately extracted. The technique of extracting topic information from big data is known as a topic model. Whether or not the correct information has been extracted is determined by whether the extracted topic is correctly clustered.
The topic model, Latent Dirichlet Allocation (LDA), is well known as a probabilistic model for extracting potential topics from large corpus. This extracts a topic based on the assumption that there are many potential topics in the corpus and one document based on the assumption that it is a mixture of topics. Here 'topic' is the probability distribution of words and is not the same as a short sentence that can be interpreted by humans. Therefore, there is a problem that it is difficult to interpret the topics extracted by the probability model as having a meaning understood by human intuition, and therefore it is not known whether or not the accurate information is extracted.
The LDA simply observes independent topics, and the Hierarchical Topic Model (HTM) has been proposed to develop them and observe topics that define the hierarchical relationships. The HTM has the advantage of being able to see categorized content, since the topics extracted from the corpus are related in a tree structure, that is, a hierarchical structure. However, the topic tree of HTM was often different from the interpretation by human intuition. This is because the HTM extracts only data-based information that does not reflect human knowledge at all. Therefore, there is still a need for a topic model that allows the topics extracted from the corpus to correspond with the meaning understood by the human intuition.
In particular, the disclosed technique is based on a hierarchical topic model including domain knowledge, and thus, a big data topic clustering method and apparatus with improved accuracy of hierarchical clustering of topics extracted from big data And the like.
The disclosed technique is a Guided Hierarchical Topic Model (GHTM) that improves the interpretability of topics by applying the HTM to define a set of words that can reveal the category characteristics of the type that the user desires to see / The present invention is directed to a method and apparatus for clustering large data topics using clusters.
The disclosed technique also provides a method and apparatus for clustering big data topics with improved accuracy of hierarchical clustering of topics extracted from big data by applying the Dirichlet Forest prior when incorporating domain knowledge into hierarchical topic modeling For that purpose.
The object of the present invention is provided by a Big Data Topology Clustering method and apparatus provided in accordance with embodiments.
A big data topic clustering method provided in accordance with an aspect of embodiments is a big data topic clustering method performed by a computing device, comprising: obtaining big data; Inputting domain knowledge including at least one seed word corresponding to the domain knowledge, and applying domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data with domain knowledge input . ≪ / RTI >
Applying the GHTM may include creating a root level node and creating at least one subroute level node.
The step of creating a root level node may include the step of statistically generating a topic distribution from a Dirichlet distribution at a root level node.
Creating at least one subroute level node may include generating a diricule tree distribution from a Dirichlet Forest prior at at least one subroute level node.
Applying the GHTM may include applying at least one seed word corresponding to each of at least one category and at least one category as a Dirichlet forest flier.
The step of applying the GHTM includes setting a parameter of the GHTM, and the step of setting the parameter of the GHTM may include the step of allowing the user to input a parameter or setting the parameter to a pre-stored value.
A device for big data topic clustering provided in accordance with another aspect of embodiments is a data storage for storing big data. A domain knowledge input section for receiving a domain knowledge including at least one seed word corresponding to each of at least one category and at least one category, and a domain knowledge input section for inputting at least one category and at least one category corresponding to each of at least one category And a GHTM module configured to perform topic clustering by applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data based on domain knowledge including one seed word.
The GHTM module may be further configured to create a root level node and at least one subroute level node.
The GHTM module may be further configured to statistically generate a topic distribution from a Dirichlet distribution at a root level node and generate a Diricletree distribution from a Dirichlet Forest prior at least one subroute level node .
The GHTM module may be further configured to apply domain knowledge as a Dicle Forest Forerian, including at least one category and at least one seed word corresponding to each of the at least one category.
The GHTM module is further configured to set the parameters of the GHTM, and the parameters of the GHTM may be a value input from the user or a pre-stored value.
A big data topic clustering method provided in accordance with another aspect of embodiments is a big data topic clustering method performed by a computing device, comprising: obtaining big data; Inputting domain knowledge including at least one corresponding seed word, and applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the Big Data based on domain knowledge can do.
Applying the GHTM may include creating a root level node and at least one subroute level node.
The step of generating a root level node and at least one subroute level node comprises: probabilistically and statistically generating a topic distribution from a Dirichlet distribution at a root level node and generating a Dirichlet Forest priori at least one subroute level node To generate a diricule tree distribution.
A big data topic clustering method provided in accordance with another aspect of embodiments is a big data topic clustering method performed by a computing device, comprising: obtaining big data; Inputting a domain knowledge including at least one corresponding seed word, and applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data based on domain knowledge, Clustering may be performed.
Performing the Big Data Topology Clustering may include creating a root level node and at least one subroute level node.
The step of generating a root level node and at least one subroute level node comprises: probabilistically and statistically generating a topic distribution from a Dirichlet distribution at a root level node and generating a Dirichlet Forest priori at least one subroute level node To generate a diricule tree distribution.
A big data topic clustering method provided in accordance with another aspect of embodiments is a big data topic clustering method performed by a computing device, comprising: preparing to be able to access big data, allowing a user to access at least one category and at least one category And inputting domain knowledge including at least one seed word corresponding to each of the plurality of seed words, and accessing the big data according to a domain knowledge applied GHTM (Guided Hierarchical Topic Model) And performing a big-data-topic clustering on the big data.
Performing the Big Data Topology Clustering may include creating a root level node and at least one subroute level node.
The step of generating a root level node and at least one subroute level node comprises: probabilistically and statistically generating a topic distribution from a Dirichlet distribution at a root level node and generating a Dirichlet Forest priori at least one subroute level node To generate a diricule tree distribution.
The features and advantages of the embodiments will become more apparent from the following detailed description based on the accompanying drawings.
According to the embodiments, it is possible to provide a method and apparatus for clustering big data topics in which the accuracy of hierarchical clustering of topics extracted from big data is improved based on hierarchical topic models including domain knowledge.
According to the embodiments, a GHTM (Guided HTM) which improves the interpretability of topics by applying to the HTM by defining a set of words that can reveal the category characteristics of the form that the user desires to see / ), And can provide a method and apparatus for clustering big data topics.
According to embodiments, a hierarchical topic structure is provided and domain knowledge for the corpus can be integrated. A hierarchical topic model can classify similar topics by category, and intensive sampling of predefined keywords is enabled by using domain knowledge. Embodiments use a hierarchical topic clustering model named by inventors as GHTM (Guided Hierarchical Topic Model). The basis of the GHTM used in the examples is HTM. Compared to LDA, the HTM detects hierarchical topic structures as well as topics. In GHTM, prior knowledge of domain knowledge is assigned to a dichotomous distribution, and a dichotomous distribution is a franchise of a hierarchical topic model. By the fryer adaptation, we can obtain a guided topic tree by domain knowledge. Accordingly, it is possible to provide a method and an apparatus for clustering big data topics with improved accuracy of hierarchical clustering of topics extracted from big data.
Figure 1 is a schematic diagram showing a typical Dirichlet distribution and a Dirichlet tree distribution;
Figure 2 is a schematic diagram illustrating a hierarchical topic model applying the Dirichlet Forest prior according to one embodiment;
Figure 3 shows a schematic hierarchical topic model, HTM,
Figure 4 is a schematic diagram illustrating a hierarchical topic model applying the Dirichlet Forest prior, i.e., GHTM (Guided HTM), according to one embodiment.
5 is a block diagram illustrating a big data topic clustering apparatus using GHTM according to an embodiment
6 is a flowchart illustrating a method of clustering Big Data Topics using GHTM according to an embodiment.
7 is a schematic diagram illustrating a topic cluster structure using a general HTM
8 is a schematic diagram illustrating a topic cluster structure using GHTM according to one embodiment
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
1 is a schematic diagram showing a general Dirichlet distribution and a Dirichlet tree distribution.
The dirichlet distribution shown in FIG. 1 (a) is a Dirichlet distribution in which A, B, C, D, E, F, G,
. The diricule tree distribution shown in FIG. 1 (b) shows the domain knowledge {{A}, {B, C}, {D, E, F} for A, B, C, D, E, }}. Is assigned to {D, E, F}.In an embodiment, it is assumed that the user provides a keyword to hierarchically separate the topics. The keywords provided by the user are seed words used to integrate domain knowledge. The seed words are converted to the parameters of the diricule tree distribution. As can be seen in Fig. 1, the Dirichlet distribution (see Fig. 1 (a)) is a diricule tree distribution with a single tree depth (see Fig. 1 (b)). When applying a split operation to the diriculatrix distribution, the constraint is encoded in the distribution as illustrated in FIG. 1 (b). parameter
Means the strength parameter of the domain knowledge. The larger the separation tendency, the stronger the separation tendency.2 is a schematic diagram illustrating a hierarchical topic model applying the Dirichlet Forest prior according to one embodiment.
The LDA uses two types of Dirichlet fryers. One is a fryer for a document-topic distribution, and the other is a fryer for a topic-word distribution. Previously, there has been a proposal for an asymmetric dirichlet fryer that gives more weight to a particular word. However, the asymmetric fryer affects parameter inference for other words, and this fryer has a problem that it can not implement the correlation between two specific words. For example, if two words are in a very large correlation in a positive manner, the Dirichlet distribution flier setting could not be applied without affecting other words. This problem is solved by the Dirichlet Forest prior. Dirichlet Forest Prior is a collection of Dirichlet distributions. The diricule tree distribution allows the encoding of complex constraints, i.e., separating operations. The GHTM according to an embodiment is to apply a Dirichlet forest flier to a hierarchical topic model.
FIG. 2 schematically illustrates a hierarchical topic model using such a Dirichlet forest flier. Here, only the root node is associated with the Dirichlet fryer, which is to show the most common topic in the corpus regardless of domain knowledge. On the other hand, a Dirichlet forest primer is assigned to other nodes. The Dirichlet forest flier has a collection of dirichlet tree distributions as it encodes the separating operation. Thus, it provides guidance of parameter estimation for sub-trees.
FIG. 3 is a schematic diagram illustrating a general hierarchical topic model, i.e., an HTM, and FIG. 4 is a schematic diagram illustrating a hierarchical topic model, that is, a GHTM (Guided HTM), using a Dirichlet forest prior according to an embodiment.
Referring to FIGS. 3 and 4, each circle represents a random variable. The gray filled circle is an observation variable, and the unfilled circle is a hidden variable. A large rectangle containing several random variables is called a plate, which means that the set of containing random variables overlap by the number specified in the corner (D, N d , ∞). The arrows indicate that there is a statistical relationship represented by a probability distribution between two connected variables.
here,
Is a parameter that controls how often a document selects a new path, The path of the document d, Is a Dirichlet fryer for a document level ratio, The level ratio of the document d, Word Level assignment of < Is the nth word in document d, Is the k-th topic, Is the intensity parameter of the domain knowledge, Is a Dirichlet fryer by topic word distribution, Is the number of words in document d, D is the number of documents (documents), T is the number of subdirrays, Means user-defined domain knowledge.Basically, the
In the creation process, first, the topic distribution is generated stochastically from the Dirichlet distribution at the root level. The sub-root node then creates a diriculette distribution from the Dirichlet forest fryer. The generation process at this sub-root node is similar to the process of selecting a branch of the diriculatrix distribution corresponding to a particular separation operation. Then, each document generates words by selecting a particular path in the global tree and sampling the level index. As a result, the root topic contains the most common content of the corpus, because the root topic is shared by all the documents. But deeper levels of nodes are focused on more specific topics.
Document - The process of selecting a particular route uses the nested Chinese restaurant process (nCRP). Each document uses a Markov process to select a single path from the root to the leaf. The internal prediction probability of nCRP follows the Dirichlet process, which is known to have rich-get-richer properties. Alternatively, a document path may be selected using a uniform process known to have no such adverse property. That is, the path of the document d is sequentially generated by either the dichroic process or the uniform process, which is represented by the following equations (1) and (2).
here,
Is selected - Represents a topic at a level. At each level, the probability of selecting an existing topic depends on the number of documents with topic k (For the Dirichlet process), or 1 (for the uniform process). The probability of selecting a new topic depends on the parameter . The following is the creation process (see symbols in FIG. 3 and FIG. 4).
Inference
The diricule tree distribution is a conjugate prior to a polynomial distribution, such as a Dirichlet distribution. Thus, the variable q (the index of the branch in the diricule tree distribution), c (the document path), and z (the level index of the word) may be sampled using collapsed Gibbs sampling.
The sampling of q is expressed by Equation (3).
here,
Is the size of the u-th set of words, Is a set of all the topics in the sub-tree t, Is the internal nodes beneath the subdiriklet of topic j in the u-th domain knowledge, Is the child of node s in the circled tree of topic j, Is the edge weight that leads into node k in the circled tree of topic j, Is the number of words under node k in the circled tree of topic j. For each sub-root topic, a set of words is assigned by Equation (3). The second term in equation (3) is the probability given by the production type of all sub-topics, including sub-root topics.The sampling of c is expressed in equation (4).
here,
Quot; contains all the words in document d, Path The words assigned to a topic ofThe sampling of z is expressed by equation (5).
here
, Is the number of words in d allocated to topic k, excluding word i, Lt; / RTI > is a subset of the internal nodes in the dicl tree of topic v that are ancestors of leaf i, The Including And a child node directly under s.The word-topic distribution [phi] may be estimated after convergence instead of gibbs sampling of q, c, z. The topic with the Dirichlet flier is obtained by Equation 6, and the Dirichlet flier is obtained by Equation (7).
As in the LDA, the document's level ratio
Can be estimated by Equation (8).
The GHTM as described above performs topic clustering on a specific data set. At this time, the GHTM can classify the topics extracted in the data set into a hierarchical tree structure and utilize user defined domain knowledge, thereby improving the accuracy of topic clustering.
5 and 6 are block diagrams illustrating an apparatus and method for clustering Big Data Topics using GHTM according to an embodiment.
The Big Data
Meanwhile, the Big Data
5, as an example, the big data
The
The domain
The
The
The Big Data
The SNS-based big
In
For example, a user may wish to extract financial related information that will help predict a stock price index from big data collected on Twitter. In this case, the user can input one or more categories related to 'financial information' and keywords that can express each category as a seed word. These categories and seed words are used as the domain knowledge input value of the hierarchical topic clustering model.
The
In
The
Experimental Example
In order to compare the proposed HTM with the GHTM proposed above, experiments were conducted on concrete big data. The Big Data used in this experiment is 20 newsgroup data, RCV1_v2, Amazon.com product review data, and is used by Kiritchenko et al. To evaluate the topic interpretability of the hierarchical clustering of topics extracted from HTM and GHTM The proposed hierarchical F-values (F-measures) were calculated.
Table 1 below shows the big data used in the experiment, i.e., the number of documents in the corpus, the number of words distinguished from each other, and the total number of words.
The 20 newsgroups in the corpus of Table 1 are classified into six main categories, and each major category is subdivided into sub-categories. The six main categories are: 1) religion-related topics such as atheism and Christianity, 2) computer-related topics such as graphics, operating systems, hardware, 3) commodity-related topics, 4) automotive, motorcycle, , Scientific topics related to medicine and space science, and 6) political topics. (For more information, see http://qwone.com/~jason/20Newsgroups/)
On the other hand, a data set that collects about 800,000 news articles issued by Reuters for one year (1996-08-20 ~ 1997-08-19) is referred to as RCV1, and RCV1_v2 is original RCV1 data for analysis . RCV datasets are largely classified into three categories: Topics, Industries, and Regions. In this experiment, Topics classification method was used. According to the Topics classification, there are four categories of business / industry (CCAT), economy (ECAT), market (MCAT) and government / society (GCAT) Three categories of market were used. Unlike 20 Newsgroups, where documents and categories correspond one-to-one, articles on RCV can belong to several categories at the same time. (For more information, see http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/)
The Review Data in Table 1 is a collection of Amazon review data related to the two products 'Apple iPad' and 'Samsung Galaxy Tab'.
Table 2 below shows the 10 seed words each assigned by the user for each of the six categories of 20 newsgroups in the corpus of Table 1. [
In order to carry out the experiment for the comparison of HTM with GHTM, 1) the depth of the topic layer is set to 2 to 4, 2) the path is selected using the diclcle process and the uniform process, and 3)
And the number of seed words is set to 1. 5, and 10. On the other hand, GHTM † was further tested in addition to HTM and GHTM, and GHTM † was tested at the edge of the leaf node in the diriculatrix distribution . , The probability that the seed words are correlated within the same topic center is reduced while the different topic sets are separated from each other. Also = 0.5, 2.0, 1.0, 0.5], [2.0, 1.0, 0.5, 0.1] depending on the level, and the Dirichlet Process Priorer = 0.1, the uniform process fryer = 0.001.To measure the validity of GHTM and HTM, a hierarchical F-value,
Were measured. Although the topic model is not a classifier, it is important to evaluate the document cluster. This is because accurate clustering affects the topic hierarchy, and 2) true categorization values can be obtained from the dataset. Thus, the macro-averaged F-measure as well as the micro-average F-measure were calculated. Equation (9) shows the micro-average F-value, and (10) shows the macro-average F-value.
And Represent micro-average and macro-average, respectively. Hierarchical precision, Quot; refers to a hierarchical recall, The ego, The Lt; Is a correction class, Is the predicted class, and L is the number of leaf nodes.
Table 3 below shows the hierarchical F-values measured in the tested models. It can be seen from this result that, firstly, the F-values of GHTM and GHTM † are significantly higher than the F-values of HTM when the hierarchical level exceeds 2. This implies that GHTM classifies documents more hierarchically than HTM, and therefore GHTM is a better model for identifying the word-topic distribution of hierarchical topics.
The second point of this result is that the fryer of the uniform process is superior to the micro-F-value of the Dirichlet process, and the Dirichlet process is superior to the uniform process of the macro F-value. This indicates that the macroeconomic F-value has underperformance because the correlation of small-sized clusters is underestimated. Therefore, if the user considers the accuracy of all branches in the topic hierarchy, it is desirable to use a uniform process.
FIG. 7 is a schematic view illustrating a topic cluster structure using a general HTM, and FIG. 8 is a schematic view illustrating a topic cluster structure using a GHTM according to an exemplary embodiment. The results of FIGS. 7 and 8 are for the review data of Table 1, that is, the corpus collecting Amazon's review data related to the two products 'Apple iPad' and 'Samsung Galaxy Tab'. In this experiment, the domain knowledge you specify is {appl, ipod, iphon, mac, safari} and {samsung, android, galaxy, jelli, bean}. The path selection process used the uniform process flier.
The HTM-based topic hierarchy 70 of FIG. 7 does not consider domain knowledge, and the GHTM topic hierarchy 80 of FIG. 8 includes domain knowledge {appl, ipod, iphon, mac, safari} and {samsung , android, galaxy, jelli, bean}. The word set {appl, ipod, iphon, mac, safari} is marked with an ellipse, and the word set {samsung, android, galaxy, jelli, bean} is marked with a rectangle. Each topic contains 20 words with the highest probability of the corresponding topic distribution. As can be seen, it can be seen that the topics are clustered more accurately in the hierarchical structure 80 of the GHTM than the hierarchical structure 70 of the HTM.
It is practically impossible to read all the documents of a large corpus in order to understand the context structure of the corpus. HTM can cluster documents hierarchically, but there was no way to integrate domain knowledge into the fryer. Therefore, GHTM according to the present disclosure 1) applies domain knowledge as a frier and 2) solves the structural properties of HTM. Thus, the accuracy of the hierarchical clustering of the document can be increased.
Various and modified configurations are possible with reference to and combining various features described herein. Accordingly, it should be pointed out that the scope of the embodiments is not to be interpreted as being limited to the embodiments described, but should be construed according to the appended claims.
30: HTM
40: GHTM
41: Domain knowledge
43: Dirichlet fryer
50: Big data topic clustering device
52: GHTM module
54: Data storage
56: Domain knowledge input unit
58: Parameter setting section
70, 80: hierarchical structure
Claims (20)
Obtaining big data,
Causing the user to enter domain knowledge comprising at least one category and at least one seed word corresponding to each of the at least one category, and
And applying domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data with the domain knowledge as input.
Wherein applying the GHTM comprises creating a root level node and creating at least one subroute level node.
The step of generating the root level node
Statistically generating a topic distribution from a Dirichlet distribution at said root level node.
The step of generating the at least one subroute level node
Generating a Diricletree distribution from a Dirichlet Forest prior at the at least one subroute level node.
The step of applying GHTM
Applying the at least one seed word corresponding to each of the at least one category and the at least one category as the dicerist forest fryer.
Wherein applying the GHTM comprises setting parameters of the GHTM,
The step of setting the parameters of the GHTM
Causing the user to input the parameter;
And setting the parameter to a pre-stored value.
Data storage to store big data,
A domain knowledge input unit for receiving a domain knowledge including at least one category and at least one seed word corresponding to each of the at least one category,
A GHTM (Guided Hierarchical Topic Model) is applied to the big data based on the domain knowledge including at least one category word and at least one seed word corresponding to each of the at least one category, ) To apply topic clustering. ≪ RTI ID = 0.0 > [0002] < / RTI >
Wherein the GHTM module is further configured to generate a root level node and at least one subroute level node.
The GHTM module further statistically generates a topic distribution from the Dirichlet distribution at the root level node and generates a Diriclet tree distribution from the Dirichlet Forest prior at the at least one subroute level node , A big data topic clustering device.
The GHTM module
And apply the domain knowledge comprising the at least one category and at least one seed word corresponding to each of the at least one category as the dicerist forest fryer.
Wherein the GHTM module is further configured to set a parameter of the GHTM,
Wherein the parameter of the GHTM is a value input from the user or a previously stored value.
Obtaining big data,
Causing the user to enter domain knowledge, and
And applying a domain knowledge applied GHTM (Guided Hierarchical Topic Model) to the big data based on the domain knowledge.
Wherein applying the GHTM comprises generating a root level node and at least one subroute level node.
Wherein generating the root level node and at least one subroute level node comprises:
Statistically generating a topic distribution from a Dirichlet distribution at said root level node and generating a Diricletree distribution from a Dirichlet Forest prior at said at least one subroute level node, How to clustering data topics.
Obtaining big data,
Causing the user to enter domain knowledge comprising at least one category and at least one seed word corresponding to each of the at least one category, and
And performing a big data topic clustering by applying a domain knowledge applied type GHTM (Guided Hierarchical Topic Model) to the big data based on the domain knowledge.
Wherein performing the Big Data Topic Clustering comprises creating a root level node and at least one subroute level node.
Wherein generating the root level node and at least one subroute level node comprises:
Statistically generating a topic distribution from a Dirichlet distribution at said root level node and generating a Diricletree distribution from a Dirichlet Forest prior at said at least one subroute level node, How to clustering data topics.
Preparing to be able to access big data,
Causing the user to enter domain knowledge comprising at least one category and at least one seed word corresponding to each of the at least one category, and
Accessing the Big Data and performing Big Data Topic Clustering on the Big Data in accordance with a domain knowledge applied GHTM (Guided Hierarchical Topic Model) based on the domain knowledge; Way.
Wherein performing the Big Data Topic Clustering comprises creating a root level node and at least one subroute level node.
Wherein generating the root level node and at least one subroute level node comprises:
Statistically generating a topic distribution from a Dirichlet distribution at said root level node and generating a Diricletree distribution from a Dirichlet Forest prior at said at least one subroute level node, How to clustering data topics.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150069641A KR20160136014A (en) | 2015-05-19 | 2015-05-19 | Method and system for topic clustering of big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150069641A KR20160136014A (en) | 2015-05-19 | 2015-05-19 | Method and system for topic clustering of big data |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20160136014A true KR20160136014A (en) | 2016-11-29 |
Family
ID=57706228
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150069641A KR20160136014A (en) | 2015-05-19 | 2015-05-19 | Method and system for topic clustering of big data |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20160136014A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596202A (en) * | 2018-03-08 | 2018-09-28 | 清华大学 | The method for calculating personal commuting time based on mobile terminal GPS positioning data |
CN109684480A (en) * | 2018-12-30 | 2019-04-26 | 杭州翼兔网络科技有限公司 | A kind of clustering method based on industry |
CN109710728A (en) * | 2018-11-26 | 2019-05-03 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | News topic automatic discovering method |
-
2015
- 2015-05-19 KR KR1020150069641A patent/KR20160136014A/en not_active Application Discontinuation
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596202A (en) * | 2018-03-08 | 2018-09-28 | 清华大学 | The method for calculating personal commuting time based on mobile terminal GPS positioning data |
CN109710728A (en) * | 2018-11-26 | 2019-05-03 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | News topic automatic discovering method |
CN109684480A (en) * | 2018-12-30 | 2019-04-26 | 杭州翼兔网络科技有限公司 | A kind of clustering method based on industry |
CN109684480B (en) * | 2018-12-30 | 2021-01-05 | 北京人民在线网络有限公司 | Industry-based clustering method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kalmegh | Analysis of weka data mining algorithm reptree, simple cart and randomtree for classification of indian news | |
US9542477B2 (en) | Method of automated discovery of topics relatedness | |
De Battisti et al. | A decade of research in statistics: A topic model approach | |
Roll et al. | Using machine learning to disentangle homonyms in large text corpora | |
US20090327259A1 (en) | Automatic concept clustering | |
KR102334236B1 (en) | Method and application of meaningful keyword extraction from speech-converted text data | |
Özdağoğlu et al. | Topic modelling-based decision framework for analysing digital voice of the customer | |
KR102334255B1 (en) | Text data collection platform construction and integrated management method for AI-based voice service | |
Kandylas et al. | Analyzing knowledge communities using foreground and background clusters | |
Ebadi et al. | Application of machine learning techniques to assess the trends and alignment of the funded research output | |
US9230210B2 (en) | Information processing apparatus and method for obtaining a knowledge item based on relation information and an attribute of the relation | |
KR20160136014A (en) | Method and system for topic clustering of big data | |
Nashipudimath et al. | An efficient integration and indexing method based on feature patterns and semantic analysis for big data | |
Ramathulasi et al. | Augmented latent Dirichlet allocation model via word embedded clusters for mashup service clustering | |
WO2020095357A1 (en) | Search needs assessment device, search needs assessment system, and search needs assessment method | |
KR101055363B1 (en) | Apparatus and method for providing search information based on multiple resource | |
Ashoori et al. | Using clustering methods for identifying blood donors behavior | |
Hess et al. | C-salt: Mining class-specific alterations in boolean matrix factorization | |
Sang et al. | Faceted subtopic retrieval: Exploiting the topic hierarchy via a multi-modal framework | |
WO2018086518A1 (en) | Method and device for real-time detection of new subject | |
US11822609B2 (en) | Prediction of future prominence attributes in data set | |
Ma et al. | API prober–a tool for analyzing web API features and clustering web APIs | |
McGee et al. | Towards visual analytics of multilayer graphs for digital cultural heritage | |
Suresh et al. | A fuzzy based hybrid hierarchical clustering model for twitter sentiment analysis | |
Abinaya et al. | Effective Feature Selection For High Dimensional Data using Fast Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E601 | Decision to refuse application |