KR20150057497A - Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach - Google Patents

Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach Download PDF

Info

Publication number
KR20150057497A
KR20150057497A KR1020130140913A KR20130140913A KR20150057497A KR 20150057497 A KR20150057497 A KR 20150057497A KR 1020130140913 A KR1020130140913 A KR 1020130140913A KR 20130140913 A KR20130140913 A KR 20130140913A KR 20150057497 A KR20150057497 A KR 20150057497A
Authority
KR
South Korea
Prior art keywords
cluster
word
topic
similarity
document
Prior art date
Application number
KR1020130140913A
Other languages
Korean (ko)
Inventor
김한준
현만
Original Assignee
서울시립대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 서울시립대학교 산학협력단 filed Critical 서울시립대학교 산학협력단
Priority to KR1020130140913A priority Critical patent/KR20150057497A/en
Publication of KR20150057497A publication Critical patent/KR20150057497A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a method for searching a subject of an on-line text document based on a hierarchical tree. More specifically, a group of main key words existing in a group of big data web documents can be extracted and be constructed into a hierarchical tree; the main key words are extracted from the document group in almost real-time; and a subject tree is generated by automatically calculating hierarchical relation among the key words.

Description

[0001] The present invention relates to a hierarchical tree-based topic search method and system for online text documents,

The present invention relates to a method for detecting a subject in big data such as an online document, and more particularly, to detecting a subject through a new medium-based cluster proposed in the present invention based on a subject tree.

The present invention has been derived from research conducted as part of basic research projects (major category), middle researcher support projects (subdivision), and core research-individual (small dispute) of the future creation science department and the Korea Research Foundation [assignment number: 2013R1A2A2A01017030 , Title: Semantic Text Based on Queuboid - A Study on Large - Scale Text Mining Framework].

The modern Internet has a lot of information about various subjects. Therefore, the user can acquire the information using the search engine to retrieve the query information in order to obtain the information desired by the user. However, it is not easy to build a query language for the information each time the user desires and quickly find the desired information in a large number of results. To date, the most common way to organize large amounts of document information is to classify and index them hierarchically by topic. Korean Patent No. 10-0902673 entitled " Document Clustering Based Document Retrieval Service Providing Method and System "has been proposed as an example of a system constituting a conventional subject tree. In the prior art, a document search service based on title clustering is provided. Clusters are formed on the basis of a title, and the data is classified by mapping them on a subject basis.

Since the clustering technique has a disadvantage in that it supports only a linear form for the subject keyword search and maps the clustering technique to a category of the existing directory structure, it is possible to efficiently transmit a large amount of web documents on the Internet, It is not easy to classify.

Most information systems require considerable effort to build document indexes through this hierarchically classified topic tree. In addition, since the document information of the classified topic tree provides independent information and treats it as a separate information even if the document has the same or extremely similar contents, additional efforts to improve the efficiency in searching and browsing information .

Korean Patent No. 10-0902673 (registered on June 5, 2009)

SUMMARY OF THE INVENTION The present invention has been made to solve the problems of the prior art as described above, and it is an object of the present invention to extract a main keyword in real time from an incoming document set and automatically calculate a hierarchical relationship thereof, The objective is to automatically identify hierarchical phase relationships.

According to an aspect of the present invention, there is provided a hierarchical tree-based topic search method for an on-line text document, the method including: calculating similarity between documents included in a cluster; Setting a document having the smallest degree of similarity as a medoid; Calculating a similarity of the received document with the received document; Calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than a threshold value into a cluster to which the medoid belongs; Extracting a subject word from the cluster to generate a topic tree of the cluster; And generating a topic tree for constructing a hierarchical relationship between the topic words using the probability information generated in the cluster by the topic word.

In this case, if the merged cluster is merged into the one cluster, calculating the similarity between the documents included in the merged cluster and resetting the meridian may be performed.

The method may further include generating the received document as an independent cluster if the similarity between the received document and the method is greater than or equal to the threshold value.

In addition, the topic tree may include a hierarchical relationship of the topic words extracted from the generated cluster or the independent cluster.

The subject word extraction may include extracting a subject word using at least one of a frequency of words appearing in the cluster set, a frequency of a document containing the word, and a number of the clusters including the word can do.

In addition, the extraction of the subject word may include the use of the following CTF-CDF-ICF expression.

Figure pat00001

(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c,

Figure pat00002
(T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time
Figure pat00003
The word with the largest value is extracted as the subject word)

In addition, the probability information may include what is calculated using the following equation.

Figure pat00004
when
Figure pat00005

Figure pat00006

(only,

Figure pat00007
Word
Figure pat00008
The word in the document set that occurs
Figure pat00009
, W is the probability of occurrence
Figure pat00010
Lt;
Figure pat00011
Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)

In addition,

Figure pat00012
when
Figure pat00013
Is satisfied,
Figure pat00014
Gt;
Figure pat00015
And constructing a hierarchical tree of topics to be included.

A hierarchical tree-based topic search system for an online text document according to an embodiment of the present invention includes: a similarity calculation unit for calculating similarities between documents included in a cluster; A medoid setting unit for setting a document having the smallest value of the summed similarity as a medoid; A medoid calculation unit for calculating the similarity between the newly received received document and the medoid; A cluster merger for calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than the threshold into a cluster to which the medoid belongs; A topic tree generation unit for extracting a subject word from the cluster and generating a topic tree of the cluster; The topic tree generation unit may include building a hierarchical relationship between the topic words using probability information generated in the cluster.

In this case, the cluster merging unit may further include, when merging into the one cluster, calculating the degree of similarity between the documents included in the merged cluster and resetting the menu.

The cluster merging unit may further generate the received document as an independent cluster if the similarity degree between the received document and the mean is greater than or equal to the threshold value.

In addition, the topic tree generation unit may include a hierarchical relationship of the topic words extracted from the generated cluster or the independent cluster.

The topic tree generation unit may include extracting the topic word using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word have.

In addition, the topic tree generation unit may include using the following CTF-CDF-ICF expression.

Figure pat00016

(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c,

Figure pat00017
(T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time
Figure pat00018
The word with the largest value is extracted as the subject word).

In addition, the probability information may include what is calculated using the following equation.

Figure pat00019
when
Figure pat00020

Figure pat00021

(only,

Figure pat00022
Word
Figure pat00023
The word in the document set that occurs
Figure pat00024
, W is the probability of occurrence
Figure pat00025
Lt;
Figure pat00026
Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)

In addition,

Figure pat00027
when
Figure pat00028
Is satisfied,
Figure pat00029
Gt;
Figure pat00030
And constructing a hierarchical tree of topics to be included.

According to the hierarchical tree-based subject search method of the online text document of the present invention, a main keyword set existing in a big data web document set can be extracted to construct a hierarchical tree of them, and a large- The subject tree can be generated by automatically extracting the main keywords and automatically calculating their hierarchical relationships.

1 is a diagram illustrating a topic tree generation process of a medode-based cluster.
2 is a diagram showing an example of setting a threshold value in a cluster.
Figure 3 is a diagram illustrating a topic tree generation system of a medoid-based cluster.
FIG. 4 is a diagram showing a result of a subject tree generated through a conventional topic tree generation technique.
FIG. 5 is a diagram illustrating a result of a subject tree generated through the proposed topic tree generation technique. Referring to FIG.

Other objects and features of the present invention will become apparent from the following description of embodiments with reference to the accompanying drawings.

Preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

However, the present invention is not limited to or limited by the embodiments. Like reference symbols in the drawings denote like elements.

Hereinafter, a subject search method according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

FIG. 1 shows a topic tree generation process of a medode-based cluster.

Referring to FIG. 1, similarity between documents existing in a cluster is calculated (S100). Here, the similarity between documents indicates the distance between the document and the document. The shorter the distance between documents, the greater the similarity. The present invention proposes a concept of medoids that can determine the similarity between a document and a document. Medoid is the best description of the characteristics of a cluster, and all cluster enediyoids exist. Also, the distance between each of the documents included in the cluster and the document is the closest to the center document. The median representing the cluster calculates the distance by vectorizing the document, and selects the nearest vector as the medoid. The method of calculating the distance by vectorizing the document uses the usual euclidean distance.

Figure pat00031

Here, (a1, b1, c1, d1) and (a2, b2, c2, d2) represent words included in each document. For example, if the total word set is {a, b, c, d}, and there are words a, a, a, b in the x documents included in the cluster, d exists. To calculate the distance between documents, it can be expressed as an n-dimensional vector, and the number of dimensions is expressed as 4, the number of words. In other words, the word is assumed to be each dimension and the dimension is generated by the number of words. a vector can be obtained by taking the number of occurrences of a word as a vector factor on the n-dimensional space. In the document, there are 3 a and 1 b (3 1 0 0) in the document, and 2 documents (c) and 3 d (0 0 2 3) can be used as the vector parameters.

The mean is the actual vector with the median midpoint of the n-dimensional vector. In other words, the distance from the other vector is the minimum.

The similarity between the documents included in the cluster is calculated by the above process (S100).

In step S110, a document having the smallest similarity degree is set as a medoid.

In the prior art, a mean base using an average value was used. However, the mean vector of the mean base is susceptible to noise because virtual values that do not exist on the vector may become an average, and the cluster may collapse due to erroneous clustering due to noise. However, since the mean based on the present invention uses an actual vector value that is closest to the other vectors even if the center is not the center of the vector, noise is less likely to be included in the clustering process.

In the cluster, new documents are received in real time, and received documents similar to existing clusters are merged into one cluster, which enables more robust clustering. Therefore, the degree of similarity between the received document and the median is calculated (S120). If the similarity between the received document and the median of the cluster is less than the threshold value (S130), the received document is merged into the cluster to which the median belongs (S140). The received document with similarity less than the threshold value is considered to be close to the meridian and merged into one cluster. At this time, the threshold value is set to a threshold value as an arbitrary similarity among the results obtained by calculating the mutual similarity between documents. The reason for setting the threshold is to determine whether to merge the received document with the cluster or to allow it to become a new independent cluster, to classify received documents of good quality, I can grasp the distant extent. A received document having a threshold value or more is generated as an independent cluster set (S150). This is because it is judged that the distance between the received document and the cluster is long, so that the received document is created as an independent cluster so as to create a space for merging with the received document to be received later, and to form a more systematic clustering. In the merged cluster, the documents included in the existing cluster and the new document coexist, and the similarity between them is calculated and the medoid is reset. In this way, when a new received document is input, it is possible to perform more robust clustering by repeating the process of generating a cluster or an independent cluster by calculating the similarity between the reset menu and the new received document.

FIG. 2 is a diagram showing an embodiment relating to setting a threshold value. FIG.

Referring to FIG. 2, the degree of similarity between documents 210 existing in the cluster 200 is calculated. The calculated degree of similarity is calculated and an arbitrary value of the result of similarity is set as the threshold value 220. Thus, the initial threshold value is set by the initial cluster, and the threshold value is unique for each cluster.

Generally, clustered topic words frequently appear in clusters. However, rarely appearing words may have special meanings, or they may be words that have a general meaning. In the present invention, three methods of extracting a subject word are proposed: a CTF Cluster Term Frequency (CDF Cluster Document Frequency) and an ICF Inverse Cluster Frequency (ICF Inverse Cluster Frequency). This is used as a means to extract key words that best describe the cluster in the cluster and to create a topic tree that is a top-down relationship between words, the ultimate goal of the present invention.

Cluster term frequency (CTF) means the number of frequencies in which a given term appears in a cluster. The cluster term frequency is calculated as follows.

Figure pat00032

Where tc (t, c) is the frequency of occurrence of the term t in cluster c.

Figure pat00033
Represents the sum of the occurrences of all the terms existing in the cluster c.

The frequency of document appearance in a cluster called cluster document frequency (CDF) also needs to be considered. This represents the number of occurrences of the document containing the specified term in the cluster. The cluster document frequency is calculated as follows.

Figure pat00034

Where dc (t, c) is the frequency of the document with the term t contained in cluster C. D is the total number of documents in cluster c.

Finally, the number of clusters in which a given term, referred to as the inverse cluster frequency (ICF), occurs should be considered. This allows you to see how much of the specified term is used in the entire cluster. To identify the specified term, it is necessary to normalize the frequency of the inverse cluster, and the frequency of the inverse cluster is calculated as follows.

Figure pat00035

cc (t) is the frequency of clusters in which the term t appears, and C is the total number of clusters.

As mentioned above, the appearance frequency of the specified term in the cluster can be confirmed through the cluster term frequency and the cluster document frequency, and the reverse cluster frequency can confirm the number of clusters in which the designated term is used in the entire cluster. The present invention combines the above three methods, represented by CTF-CDF-ICF, to enhance the extraction ability of the subject words, which can represent and represent clusters.

Figure pat00036

The candidates of the subject words are sorted by the result calculated by the CTF-CDF-ICF,

Figure pat00037
Extracts the largest word among the candidates of the number of subject words as the subject word.
Figure pat00038
The reason for extracting a number of subject words is that the number of documents in each cluster is different. That is, a small number of words are used to represent a small number of documents and clusters, and a large number of words Is used. This is because, even if the same important word is influenced by the frequency with which the document and the word appear, the result extracted as the main word may vary depending on the appearance frequency. This is normalized in the same manner as above to extract the subject word under the same conditions.

To construct a topic tree, a generic way of constructing a conceptual hierarchy in words can be used. This is a common way,

Figure pat00039
end
Figure pat00040
when
Figure pat00041
If
Figure pat00042
The
Figure pat00043
. here
Figure pat00044
silver
Figure pat00045
From the set of documents where
Figure pat00046
.
Figure pat00047
Relatively
Figure pat00048
Compared to the general topic words. So in the topic tree The
Figure pat00050
Can be the mother word of
Figure pat00051
The
Figure pat00052
It can be a child word.

The above method creates a hierarchical topic tree by having a subordinate relationship between words. The existing method is effective and simple, but it does not grasp the meaning of the subject word and constructs the topic tree without the ability of words to describe clusters. Therefore, the present invention proposes the following parameters to consider the ability of the subject word to describe the cluster.

The method proposed by the present invention is based on two words

Figure pat00053
,
Figure pat00054
end
Figure pat00055
when
Figure pat00056
If
Figure pat00057
The
Figure pat00058
. here
Figure pat00059
Wow
Figure pat00060
Is measured by CTF-CDF-ICF and is an indicator of how well the subject word can describe the cluster.
Figure pat00061
The
Figure pat00062
end
Figure pat00063
The probability of being involved in
Figure pat00064
Lt; / RTI > In addition,
Figure pat00065
In the present invention, experiments on various data have been carried out
Figure pat00066
The probability condition of.

In using CTF-CDF-ICF to construct the topic tree, it is necessary to consider that the range of CTF-CDF-ICF values differs for each cluster. There are different numbers of words in each cluster, and the ranges for calculating CTF-CDF-ICF differ accordingly. Therefore, it is necessary to normalize CTF-CDF-ICF as follows.

Figure pat00067

Where w is the cluster

Figure pat00068
Is a list of topic words derived from CTF-CDF-ICF.
Figure pat00069
Is the largest value calculated by CTF-CDF-ICF in cluster W,
Figure pat00070
Is the smallest value calculated by CTF-CDF-ICF in the cluster. All CTF-CDF-ICF values in the cluster are mapped from 1 to (1 + d) through the above equation. Where d is a positive number and can be set according to the intention. The above equation can be used to correct different calculation ranges of CTF-CDF-ICF for each cluster, and is used as an index to show how many clusters are depicted under the same conditions of all the subject words. The larger the value, Able to know. remind
Figure pat00071
when
Figure pat00072
sign
Figure pat00073
The
Figure pat00074
, A topic tree in which the upper layer of the topic tree subscribes to the lower layer is generated. The upper layer is a topic word that better describes the cluster than the lower layer, so it can include one or more lower layers. The lower hierarchy can be subordinate to the lower hierarchy to construct a topic tree, and similarly, it can include more than one hierarchy lower than the lower hierarchy. This is repeated to create a topic tree. In the process of creating a topic tree, topic words that can not be linked up and down can arise. This topic tree can also be constructed independently. Therefore, one or more topic trees can be created in one cluster. The construction of the topic tree proposed by the present invention constructs a topic tree by identifying the top-to-bottom relationship between the topic words and repeating them (S160).

Figure 3 is a diagram illustrating a method tree based topic tree generation system.

Referring to FIG. 3, the similarity calculation unit 300 calculates the degree of similarity between documents included in the cluster.

At this time, the degree of similarity is the same as the degree of similarity in step S100 described above, and thus redundant description is omitted. The medoid setting unit 310 sets the document having the smallest value of the calculated degree of similarity calculated by the similarity calculation unit 300 as the medoid. In this case, the mean is the closest distance between the documents and is equivalent to the mean in step S110. The distance between documents uses Euclidean distance, and a detailed description thereof has been described above, so it is omitted. The median calculation unit 320 calculates the degree of similarity between the received document 330 and the established medoid. The cluster merging unit 340 merges the received document 330 into the cluster to which the message belongs if the result calculated by the mean calculation unit 320 is less than the threshold value. The received document 330 having a value equal to or greater than the threshold value is generated as an independent cluster set as in step S160. At this time, the threshold value of the threshold value is set to a threshold value as an arbitrary similarity among the results obtained as a result of calculating the degree of similarity between documents. When the received document and the cluster are merged, the cluster merge unit 340 resets the meridian by calculating the similarities between the documents included in the merged clusters. The topic tree generating unit 350 extracts a topic word using the following CTF-CDF-ICF in the cluster merged by the cluster merging unit 340 or in an independent cluster.

Figure pat00075

The extraction of topic words using CTF-CDF-ICF sorts the candidate words of the topic with the results calculated by CTF-CDF-ICF,

Figure pat00076
Extracts the largest word among the candidates of the number of subject words as the subject word.

The subject words extracted by CTF-CDF-ICF are two words

Figure pat00077
,
Figure pat00078
end
Figure pat00079
when
Figure pat00080
If
Figure pat00081
The
Figure pat00082
, And constructing a topic tree by normalizing the CTF-CDF-ICF values in the following manner, considering that the ranges of the CTF-CDF-ICF values are different from each other.

Figure pat00083

here

Figure pat00084
silver
Figure pat00085
From the set of documents where
Figure pat00086
. W is a cluster
Figure pat00087
Is a list of topic words derived from CTF-CDF-ICF.
Figure pat00088
The value calculated through CTF-CDF-ICF is the largest value in the cluster,
Figure pat00089
Is the smallest value calculated by CTF-CDF-ICF in the cluster. All CTF-CDF-ICF values in the cluster are mapped from 1 to (1 + d) through the above equation. Where d is a positive number and can be set according to the intention. The generation of the topic tree using the above equation is the same as that of S160 described above.

In order to evaluate the proposed method for constructing a topic tree, the present invention is described in Reuters-21578 (http://www.daviddlewis.com/resources/testcollections/reuters21578/) and Google News US (news.google.com) I prepared two data sets each. The first dataset, Dataset-1, contains 1000 randomly selected documents from Reuters-21578 and the second dataset, Dataset-2, is available from August 12, 2012 through September 24, 2012 I used a collection of documents containing 1000 documents on business, election, technology, entertainment, sports, science, and health.

The present invention automatically constructs a subject detection system that grasps the contents of a document and then generates a topic tree. The system can be written in Java, and term processing modules such as negation elimination and morphological analysis can be done by the R_TM package. In order to evaluate the accuracy of the generated topic tree, it is necessary to check whether it is included in the correct topic tree and the accuracy is calculated as follows.

Figure pat00090


Datasets
Accuracy
Baseline Proposed Dataset-1 0.743 0.875 Dataset-2 0.769 0.893

Table 1 shows the accuracy of the subordinate relations in the topic tree generated by the conventional method and the method proposed in the present invention. The accuracy of Dataset-2 increased by 16.1% for Google News data sets while the accuracy of Dataset-1 increased by 17.8% for Reuters -21578 data sets. We can confirm that the proposed method logically not only generates the topic tree, but also improves the accuracy of the subordinate relation.

FIG. 4 shows a part of a topic tree generated by a conventional baseline system, and FIG. 5 shows a part of a topic tree generated by a method proposed in the present invention. Both FIGS. 4 and 5 illustrate the main contents, but it can be seen that the method proposed in the present invention generates a more accurate topic tree than the prior art. Referring to FIG. 4, six topic trees are displayed, but it shows that there is a misleading relationship between win and britain. However, FIG. 5 shows that the relationship between the main words can be grasped and the main word is displayed in one topic tree in order to grasp the main word. Therefore, it can be seen that the subject search method presented in the present invention shows more accurate subject search results through the above-described embodiments.

The hierarchical tree-based subject search method of an online text document according to an embodiment of the present invention may be implemented in a form of a program command that can be executed through various computer means and recorded in a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and configured for the present invention or may be those known to the person skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- And various modifications and changes may be made thereto by those skilled in the art to which the present invention pertains.

Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .

Claims (16)

Calculating a mutual similarity between documents included in the cluster;
Setting a document having the smallest degree of similarity as a medoid;
Calculating a similarity of the received document with the received document;
Calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than a threshold value into a cluster to which the medoid belongs;
Extracting a subject word from the cluster to generate a topic tree of the cluster;
And generating a topic tree for constructing a hierarchical relationship between the topic words using probability information generated by the topic word in the cluster.
The method according to claim 1,
And if the merged cluster is merged into the one cluster, calculating the degree of similarity between documents included in the merged cluster and resetting the medoid.
The method according to claim 1,
And generating the received document as an independent cluster if the similarity degree between the received document and the mean is greater than or equal to the threshold value.
The method of claim 1, wherein
Wherein the topic tree is a hierarchical relationship of the topic words extracted from the generated cluster or the independent clusters.
The method of claim 1, wherein
Wherein the extraction of the subject word is performed by using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word, A hierarchical tree based topic search method for online text documents.
The method of claim 5, wherein
Wherein the topic word is extracted using the following CTF-CDF-ICF expression.
Figure pat00091

(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c,
Figure pat00092
(T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time
Figure pat00093
Extract the word with the highest value as the subject word.)
The method of claim 1, wherein
Wherein the probability information is calculated using the following equation.
Figure pat00094
when
Figure pat00095

Figure pat00096

(only,
Figure pat00097
Word
Figure pat00098
The word in the document set that occurs
Figure pat00099
, W is the probability of occurrence
Figure pat00100
Lt;
Figure pat00101
Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)
8. The method of claim 7,
In the formula
Figure pat00102
when
Figure pat00103
Is satisfied,
Figure pat00104
Gt;
Figure pat00105
And a hierarchical tree-based topic search method of an online text document.
A similarity calculation unit for calculating similarities between documents included in the cluster;
A medoid setting unit for setting a document having the smallest value of the summed similarity as a medoid;
A medoid calculation unit for calculating the similarity between the newly received received document and the medoid;
A cluster merger for calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than the threshold into a cluster to which the medoid belongs;
A topic tree generation unit for extracting a subject word from the cluster and generating a topic tree of the cluster;
Wherein the topic tree generation unit includes a hierarchical relationship between the topic words using the probability information generated by the topic word in the cluster.
The method of claim 9, wherein
Wherein the cluster merging unit comprises:
And if the merged cluster is merged into the one cluster, calculating the degree of similarity between documents included in the merged cluster and resetting the menu.
10. The method of claim 9,
Wherein the cluster merging unit comprises:
And generating the received document as an independent cluster if the degree of similarity between the received document and the mean is greater than or equal to the threshold value.
10. The method of claim 9,
Wherein the topic tree generating unit comprises:
Wherein the hierarchical tree-based topic search system is a hierarchical relationship of the topic words extracted from the generated cluster or the independent clusters.
10. The method of claim 9,
The topic tree generation unit,
Wherein the keyword extraction unit extracts a topic word using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word. Based topic search system.
14. The method of claim 13,
Wherein the topic tree generating unit comprises:
A hierarchical tree-based topic search system for online text documents, characterized by using the following CTF-CDF-ICF formula:
Figure pat00106

(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c,
Figure pat00107
(T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time
Figure pat00108
Extract the word with the highest value as the subject word.)
10. The method of claim 9,
Wherein the probability information is calculated using the following equation.
Figure pat00109
when
Figure pat00110

Figure pat00111

(only,
Figure pat00112
Word
Figure pat00113
The word in the document set that occurs
Figure pat00114
, W is the probability of occurrence
Figure pat00115
Lt;
Figure pat00116
Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)
16. The method of claim 15,
In the formula
Figure pat00117
when
Figure pat00118
Is satisfied,
Figure pat00119
Gt;
Figure pat00120
And a hierarchical tree-based topic search system for hierarchically structuring the online text documents.
KR1020130140913A 2013-11-19 2013-11-19 Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach KR20150057497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020130140913A KR20150057497A (en) 2013-11-19 2013-11-19 Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020130140913A KR20150057497A (en) 2013-11-19 2013-11-19 Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach

Publications (1)

Publication Number Publication Date
KR20150057497A true KR20150057497A (en) 2015-05-28

Family

ID=53392345

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020130140913A KR20150057497A (en) 2013-11-19 2013-11-19 Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach

Country Status (1)

Country Link
KR (1) KR20150057497A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101706300B1 (en) 2015-10-13 2017-02-14 포항공과대학교 산학협력단 Apparatus and method for generating word hierarchy of technology terms
CN111581355A (en) * 2020-05-13 2020-08-25 杭州安恒信息技术股份有限公司 Method, device and computer storage medium for detecting subject of threat intelligence

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101706300B1 (en) 2015-10-13 2017-02-14 포항공과대학교 산학협력단 Apparatus and method for generating word hierarchy of technology terms
CN111581355A (en) * 2020-05-13 2020-08-25 杭州安恒信息技术股份有限公司 Method, device and computer storage medium for detecting subject of threat intelligence
CN111581355B (en) * 2020-05-13 2023-07-25 杭州安恒信息技术股份有限公司 Threat information topic detection method, device and computer storage medium

Similar Documents

Publication Publication Date Title
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
Kang et al. On co-authorship for author disambiguation
JP5092165B2 (en) Data construction method and system
US9183286B2 (en) Methodologies and analytics tools for identifying white space opportunities in a given industry
CN107590128B (en) Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method
Shouzhong et al. Mining microblog user interests based on TextRank with TF-IDF factor
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN111382276B (en) Event development context graph generation method
Dinh et al. k-PbC: an improved cluster center initialization for categorical data clustering
Sarkhel et al. Visual segmentation for information extraction from heterogeneous visually rich documents
Baralis et al. Analysis of twitter data using a multiple-level clustering strategy
Wick et al. A unified approach for schema matching, coreference and canonicalization
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
Sapul et al. Trending topic discovery of Twitter Tweets using clustering and topic modeling algorithms
Zhu et al. Robust hybrid name disambiguation framework for large databases
JP5500070B2 (en) Data classification system, data classification method, and data classification program
Bagdouri et al. Profession-based person search in microblogs: Using seed sets to find journalists
CN103942204B (en) For excavating the method and apparatus being intended to
KR20080049239A (en) Method for disambiguation of same name authors by information extraction from orignal text
KR20150057497A (en) Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach
Oo Pattern discovery using association rule mining on clustered data
Su et al. Understanding query interfaces by statistical parsing
Tseng et al. Advances in knowledge discovery and data mining
KR101869329B1 (en) Data Mining Method of Spatial Frequent Wordset from Social Database
Chen et al. Topic detection over online forum

Legal Events

Date Code Title Description
E902 Notification of reason for refusal
E601 Decision to refuse application