KR20150057497A

KR20150057497A - Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach

Info

Publication number: KR20150057497A
Application number: KR1020130140913A
Authority: KR
Inventors: 김한준; 현만
Original assignee: 서울시립대학교 산학협력단
Priority date: 2013-11-19
Filing date: 2013-11-19
Publication date: 2015-05-28

Abstract

The present invention relates to a method for searching a subject of an on-line text document based on a hierarchical tree. More specifically, a group of main key words existing in a group of big data web documents can be extracted and be constructed into a hierarchical tree; the main key words are extracted from the document group in almost real-time; and a subject tree is generated by automatically calculating hierarchical relation among the key words.

Description

[0001] The present invention relates to a hierarchical tree-based topic search method and system for online text documents,

The present invention relates to a method for detecting a subject in big data such as an online document, and more particularly, to detecting a subject through a new medium-based cluster proposed in the present invention based on a subject tree.

The present invention has been derived from research conducted as part of basic research projects (major category), middle researcher support projects (subdivision), and core research-individual (small dispute) of the future creation science department and the Korea Research Foundation [assignment number: 2013R1A2A2A01017030 , Title: Semantic Text Based on Queuboid - A Study on Large - Scale Text Mining Framework].

The modern Internet has a lot of information about various subjects. Therefore, the user can acquire the information using the search engine to retrieve the query information in order to obtain the information desired by the user. However, it is not easy to build a query language for the information each time the user desires and quickly find the desired information in a large number of results. To date, the most common way to organize large amounts of document information is to classify and index them hierarchically by topic. Korean Patent No. 10-0902673 entitled " Document Clustering Based Document Retrieval Service Providing Method and System "has been proposed as an example of a system constituting a conventional subject tree. In the prior art, a document search service based on title clustering is provided. Clusters are formed on the basis of a title, and the data is classified by mapping them on a subject basis.

Since the clustering technique has a disadvantage in that it supports only a linear form for the subject keyword search and maps the clustering technique to a category of the existing directory structure, it is possible to efficiently transmit a large amount of web documents on the Internet, It is not easy to classify.

Most information systems require considerable effort to build document indexes through this hierarchically classified topic tree. In addition, since the document information of the classified topic tree provides independent information and treats it as a separate information even if the document has the same or extremely similar contents, additional efforts to improve the efficiency in searching and browsing information .

Korean Patent No. 10-0902673 (registered on June 5, 2009)

SUMMARY OF THE INVENTION The present invention has been made to solve the problems of the prior art as described above, and it is an object of the present invention to extract a main keyword in real time from an incoming document set and automatically calculate a hierarchical relationship thereof, The objective is to automatically identify hierarchical phase relationships.

According to an aspect of the present invention, there is provided a hierarchical tree-based topic search method for an on-line text document, the method including: calculating similarity between documents included in a cluster; Setting a document having the smallest degree of similarity as a medoid; Calculating a similarity of the received document with the received document; Calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than a threshold value into a cluster to which the medoid belongs; Extracting a subject word from the cluster to generate a topic tree of the cluster; And generating a topic tree for constructing a hierarchical relationship between the topic words using the probability information generated in the cluster by the topic word.

In this case, if the merged cluster is merged into the one cluster, calculating the similarity between the documents included in the merged cluster and resetting the meridian may be performed.

The method may further include generating the received document as an independent cluster if the similarity between the received document and the method is greater than or equal to the threshold value.

In addition, the topic tree may include a hierarchical relationship of the topic words extracted from the generated cluster or the independent cluster.

The subject word extraction may include extracting a subject word using at least one of a frequency of words appearing in the cluster set, a frequency of a document containing the word, and a number of the clusters including the word can do.

In addition, the extraction of the subject word may include the use of the following CTF-CDF-ICF expression.

(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c,

(T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time

The word with the largest value is extracted as the subject word)

In addition, the probability information may include what is calculated using the following equation.

when

(only,

Word

The word in the document set that occurs

, W is the probability of occurrence

Lt;

Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)

In addition,

when

Is satisfied,

Gt;

And constructing a hierarchical tree of topics to be included.

A hierarchical tree-based topic search system for an online text document according to an embodiment of the present invention includes: a similarity calculation unit for calculating similarities between documents included in a cluster; A medoid setting unit for setting a document having the smallest value of the summed similarity as a medoid; A medoid calculation unit for calculating the similarity between the newly received received document and the medoid; A cluster merger for calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than the threshold into a cluster to which the medoid belongs; A topic tree generation unit for extracting a subject word from the cluster and generating a topic tree of the cluster; The topic tree generation unit may include building a hierarchical relationship between the topic words using probability information generated in the cluster.

In this case, the cluster merging unit may further include, when merging into the one cluster, calculating the degree of similarity between the documents included in the merged cluster and resetting the menu.

The cluster merging unit may further generate the received document as an independent cluster if the similarity degree between the received document and the mean is greater than or equal to the threshold value.

In addition, the topic tree generation unit may include a hierarchical relationship of the topic words extracted from the generated cluster or the independent cluster.

The topic tree generation unit may include extracting the topic word using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word have.

In addition, the topic tree generation unit may include using the following CTF-CDF-ICF expression.

The word with the largest value is extracted as the subject word).

when

(only,

Word

The word in the document set that occurs

, W is the probability of occurrence

Lt;

Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)

In addition,

when

Is satisfied,

Gt;

And constructing a hierarchical tree of topics to be included.

According to the hierarchical tree-based subject search method of the online text document of the present invention, a main keyword set existing in a big data web document set can be extracted to construct a hierarchical tree of them, and a large- The subject tree can be generated by automatically extracting the main keywords and automatically calculating their hierarchical relationships.

1 is a diagram illustrating a topic tree generation process of a medode-based cluster.
2 is a diagram showing an example of setting a threshold value in a cluster.
Figure 3 is a diagram illustrating a topic tree generation system of a medoid-based cluster.
FIG. 4 is a diagram showing a result of a subject tree generated through a conventional topic tree generation technique.
FIG. 5 is a diagram illustrating a result of a subject tree generated through the proposed topic tree generation technique. Referring to FIG.

Other objects and features of the present invention will become apparent from the following description of embodiments with reference to the accompanying drawings.

Preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

However, the present invention is not limited to or limited by the embodiments. Like reference symbols in the drawings denote like elements.

Hereinafter, a subject search method according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

FIG. 1 shows a topic tree generation process of a medode-based cluster.

Referring to FIG. 1, similarity between documents existing in a cluster is calculated (S100). Here, the similarity between documents indicates the distance between the document and the document. The shorter the distance between documents, the greater the similarity. The present invention proposes a concept of medoids that can determine the similarity between a document and a document. Medoid is the best description of the characteristics of a cluster, and all cluster enediyoids exist. Also, the distance between each of the documents included in the cluster and the document is the closest to the center document. The median representing the cluster calculates the distance by vectorizing the document, and selects the nearest vector as the medoid. The method of calculating the distance by vectorizing the document uses the usual euclidean distance.

Here, (a1, b1, c1, d1) and (a2, b2, c2, d2) represent words included in each document. For example, if the total word set is {a, b, c, d}, and there are words a, a, a, b in the x documents included in the cluster, d exists. To calculate the distance between documents, it can be expressed as an n-dimensional vector, and the number of dimensions is expressed as 4, the number of words. In other words, the word is assumed to be each dimension and the dimension is generated by the number of words. a vector can be obtained by taking the number of occurrences of a word as a vector factor on the n-dimensional space. In the document, there are 3 a and 1 b (3 1 0 0) in the document, and 2 documents (c) and 3 d (0 0 2 3) can be used as the vector parameters.

The mean is the actual vector with the median midpoint of the n-dimensional vector. In other words, the distance from the other vector is the minimum.

The similarity between the documents included in the cluster is calculated by the above process (S100).

In step S110, a document having the smallest similarity degree is set as a medoid.

In the prior art, a mean base using an average value was used. However, the mean vector of the mean base is susceptible to noise because virtual values that do not exist on the vector may become an average, and the cluster may collapse due to erroneous clustering due to noise. However, since the mean based on the present invention uses an actual vector value that is closest to the other vectors even if the center is not the center of the vector, noise is less likely to be included in the clustering process.

In the cluster, new documents are received in real time, and received documents similar to existing clusters are merged into one cluster, which enables more robust clustering. Therefore, the degree of similarity between the received document and the median is calculated (S120). If the similarity between the received document and the median of the cluster is less than the threshold value (S130), the received document is merged into the cluster to which the median belongs (S140). The received document with similarity less than the threshold value is considered to be close to the meridian and merged into one cluster. At this time, the threshold value is set to a threshold value as an arbitrary similarity among the results obtained by calculating the mutual similarity between documents. The reason for setting the threshold is to determine whether to merge the received document with the cluster or to allow it to become a new independent cluster, to classify received documents of good quality, I can grasp the distant extent. A received document having a threshold value or more is generated as an independent cluster set (S150). This is because it is judged that the distance between the received document and the cluster is long, so that the received document is created as an independent cluster so as to create a space for merging with the received document to be received later, and to form a more systematic clustering. In the merged cluster, the documents included in the existing cluster and the new document coexist, and the similarity between them is calculated and the medoid is reset. In this way, when a new received document is input, it is possible to perform more robust clustering by repeating the process of generating a cluster or an independent cluster by calculating the similarity between the reset menu and the new received document.

FIG. 2 is a diagram showing an embodiment relating to setting a threshold value. FIG.

Referring to FIG. 2, the degree of similarity between documents 210 existing in the cluster 200 is calculated. The calculated degree of similarity is calculated and an arbitrary value of the result of similarity is set as the threshold value 220. Thus, the initial threshold value is set by the initial cluster, and the threshold value is unique for each cluster.

Generally, clustered topic words frequently appear in clusters. However, rarely appearing words may have special meanings, or they may be words that have a general meaning. In the present invention, three methods of extracting a subject word are proposed: a CTF Cluster Term Frequency (CDF Cluster Document Frequency) and an ICF Inverse Cluster Frequency (ICF Inverse Cluster Frequency). This is used as a means to extract key words that best describe the cluster in the cluster and to create a topic tree that is a top-down relationship between words, the ultimate goal of the present invention.

Cluster term frequency (CTF) means the number of frequencies in which a given term appears in a cluster. The cluster term frequency is calculated as follows.

Where tc (t, c) is the frequency of occurrence of the term t in cluster c.

Represents the sum of the occurrences of all the terms existing in the cluster c.

The frequency of document appearance in a cluster called cluster document frequency (CDF) also needs to be considered. This represents the number of occurrences of the document containing the specified term in the cluster. The cluster document frequency is calculated as follows.

Where dc (t, c) is the frequency of the document with the term t contained in cluster C. D is the total number of documents in cluster c.

Finally, the number of clusters in which a given term, referred to as the inverse cluster frequency (ICF), occurs should be considered. This allows you to see how much of the specified term is used in the entire cluster. To identify the specified term, it is necessary to normalize the frequency of the inverse cluster, and the frequency of the inverse cluster is calculated as follows.

cc (t) is the frequency of clusters in which the term t appears, and C is the total number of clusters.

As mentioned above, the appearance frequency of the specified term in the cluster can be confirmed through the cluster term frequency and the cluster document frequency, and the reverse cluster frequency can confirm the number of clusters in which the designated term is used in the entire cluster. The present invention combines the above three methods, represented by CTF-CDF-ICF, to enhance the extraction ability of the subject words, which can represent and represent clusters.

The candidates of the subject words are sorted by the result calculated by the CTF-CDF-ICF,

Extracts the largest word among the candidates of the number of subject words as the subject word.

The reason for extracting a number of subject words is that the number of documents in each cluster is different. That is, a small number of words are used to represent a small number of documents and clusters, and a large number of words Is used. This is because, even if the same important word is influenced by the frequency with which the document and the word appear, the result extracted as the main word may vary depending on the appearance frequency. This is normalized in the same manner as above to extract the subject word under the same conditions.

To construct a topic tree, a generic way of constructing a conceptual hierarchy in words can be used. This is a common way,

end

when

If

The

. here

silver

From the set of documents where

.

Relatively

Compared to the general topic words. So in the topic tree The

Can be the mother word of

The

It can be a child word.

The above method creates a hierarchical topic tree by having a subordinate relationship between words. The existing method is effective and simple, but it does not grasp the meaning of the subject word and constructs the topic tree without the ability of words to describe clusters. Therefore, the present invention proposes the following parameters to consider the ability of the subject word to describe the cluster.

The method proposed by the present invention is based on two words

,

end

when

If

The

. here

Wow

Is measured by CTF-CDF-ICF and is an indicator of how well the subject word can describe the cluster.

The

end

The probability of being involved in

Lt; / RTI > In addition,

In the present invention, experiments on various data have been carried out

The probability condition of.

In using CTF-CDF-ICF to construct the topic tree, it is necessary to consider that the range of CTF-CDF-ICF values differs for each cluster. There are different numbers of words in each cluster, and the ranges for calculating CTF-CDF-ICF differ accordingly. Therefore, it is necessary to normalize CTF-CDF-ICF as follows.

Where w is the cluster

Is a list of topic words derived from CTF-CDF-ICF.

Is the largest value calculated by CTF-CDF-ICF in cluster W,

Is the smallest value calculated by CTF-CDF-ICF in the cluster. All CTF-CDF-ICF values in the cluster are mapped from 1 to (1 + d) through the above equation. Where d is a positive number and can be set according to the intention. The above equation can be used to correct different calculation ranges of CTF-CDF-ICF for each cluster, and is used as an index to show how many clusters are depicted under the same conditions of all the subject words. The larger the value, Able to know. remind

when

sign

The

, A topic tree in which the upper layer of the topic tree subscribes to the lower layer is generated. The upper layer is a topic word that better describes the cluster than the lower layer, so it can include one or more lower layers. The lower hierarchy can be subordinate to the lower hierarchy to construct a topic tree, and similarly, it can include more than one hierarchy lower than the lower hierarchy. This is repeated to create a topic tree. In the process of creating a topic tree, topic words that can not be linked up and down can arise. This topic tree can also be constructed independently. Therefore, one or more topic trees can be created in one cluster. The construction of the topic tree proposed by the present invention constructs a topic tree by identifying the top-to-bottom relationship between the topic words and repeating them (S160).

Figure 3 is a diagram illustrating a method tree based topic tree generation system.

Referring to FIG. 3, the similarity calculation unit 300 calculates the degree of similarity between documents included in the cluster.

At this time, the degree of similarity is the same as the degree of similarity in step S100 described above, and thus redundant description is omitted. The medoid setting unit 310 sets the document having the smallest value of the calculated degree of similarity calculated by the similarity calculation unit 300 as the medoid. In this case, the mean is the closest distance between the documents and is equivalent to the mean in step S110. The distance between documents uses Euclidean distance, and a detailed description thereof has been described above, so it is omitted. The median calculation unit 320 calculates the degree of similarity between the received document 330 and the established medoid. The cluster merging unit 340 merges the received document 330 into the cluster to which the message belongs if the result calculated by the mean calculation unit 320 is less than the threshold value. The received document 330 having a value equal to or greater than the threshold value is generated as an independent cluster set as in step S160. At this time, the threshold value of the threshold value is set to a threshold value as an arbitrary similarity among the results obtained as a result of calculating the degree of similarity between documents. When the received document and the cluster are merged, the cluster merge unit 340 resets the meridian by calculating the similarities between the documents included in the merged clusters. The topic tree generating unit 350 extracts a topic word using the following CTF-CDF-ICF in the cluster merged by the cluster merging unit 340 or in an independent cluster.

The extraction of topic words using CTF-CDF-ICF sorts the candidate words of the topic with the results calculated by CTF-CDF-ICF,

The subject words extracted by CTF-CDF-ICF are two words

,

end

when

If

The

, And constructing a topic tree by normalizing the CTF-CDF-ICF values in the following manner, considering that the ranges of the CTF-CDF-ICF values are different from each other.

here

silver

From the set of documents where

. W is a cluster

Is a list of topic words derived from CTF-CDF-ICF.

The value calculated through CTF-CDF-ICF is the largest value in the cluster,

Is the smallest value calculated by CTF-CDF-ICF in the cluster. All CTF-CDF-ICF values in the cluster are mapped from 1 to (1 + d) through the above equation. Where d is a positive number and can be set according to the intention. The generation of the topic tree using the above equation is the same as that of S160 described above.

In order to evaluate the proposed method for constructing a topic tree, the present invention is described in Reuters-21578 (http://www.daviddlewis.com/resources/testcollections/reuters21578/) and Google News US (news.google.com) I prepared two data sets each. The first dataset, Dataset-1, contains 1000 randomly selected documents from Reuters-21578 and the second dataset, Dataset-2, is available from August 12, 2012 through September 24, 2012 I used a collection of documents containing 1000 documents on business, election, technology, entertainment, sports, science, and health.

The present invention automatically constructs a subject detection system that grasps the contents of a document and then generates a topic tree. The system can be written in Java, and term processing modules such as negation elimination and morphological analysis can be done by the R_TM package. In order to evaluate the accuracy of the generated topic tree, it is necessary to check whether it is included in the correct topic tree and the accuracy is calculated as follows.

Datasets Accuracy Baseline Proposed Dataset-1 0.743 0.875 Dataset-2 0.769 0.893

Table 1 shows the accuracy of the subordinate relations in the topic tree generated by the conventional method and the method proposed in the present invention. The accuracy of Dataset-2 increased by 16.1% for Google News data sets while the accuracy of Dataset-1 increased by 17.8% for Reuters -21578 data sets. We can confirm that the proposed method logically not only generates the topic tree, but also improves the accuracy of the subordinate relation.

FIG. 4 shows a part of a topic tree generated by a conventional baseline system, and FIG. 5 shows a part of a topic tree generated by a method proposed in the present invention. Both FIGS. 4 and 5 illustrate the main contents, but it can be seen that the method proposed in the present invention generates a more accurate topic tree than the prior art. Referring to FIG. 4, six topic trees are displayed, but it shows that there is a misleading relationship between win and britain. However, FIG. 5 shows that the relationship between the main words can be grasped and the main word is displayed in one topic tree in order to grasp the main word. Therefore, it can be seen that the subject search method presented in the present invention shows more accurate subject search results through the above-described embodiments.

The hierarchical tree-based subject search method of an online text document according to an embodiment of the present invention may be implemented in a form of a program command that can be executed through various computer means and recorded in a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and configured for the present invention or may be those known to the person skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- And various modifications and changes may be made thereto by those skilled in the art to which the present invention pertains.

Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .

Claims

Calculating a mutual similarity between documents included in the cluster;
Setting a document having the smallest degree of similarity as a medoid;
Calculating a similarity of the received document with the received document;
Calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than a threshold value into a cluster to which the medoid belongs;
Extracting a subject word from the cluster to generate a topic tree of the cluster;
And generating a topic tree for constructing a hierarchical relationship between the topic words using probability information generated by the topic word in the cluster.

The method according to claim 1,
And if the merged cluster is merged into the one cluster, calculating the degree of similarity between documents included in the merged cluster and resetting the medoid.

The method according to claim 1,
And generating the received document as an independent cluster if the similarity degree between the received document and the mean is greater than or equal to the threshold value.

The method of claim 1, wherein
Wherein the topic tree is a hierarchical relationship of the topic words extracted from the generated cluster or the independent clusters.

The method of claim 1, wherein
Wherein the extraction of the subject word is performed by using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word, A hierarchical tree based topic search method for online text documents.

The method of claim 5, wherein
Wherein the topic word is extracted using the following CTF-CDF-ICF expression.

Extract the word with the highest value as the subject word.)

The method of claim 1, wherein
Wherein the probability information is calculated using the following equation.

when

(only,

Word

The word in the document set that occurs

, W is the probability of occurrence

Lt;

Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)

8. The method of claim 7,
In the formula

when

Is satisfied,

Gt;

And a hierarchical tree-based topic search method of an online text document.

A similarity calculation unit for calculating similarities between documents included in the cluster;
A medoid setting unit for setting a document having the smallest value of the summed similarity as a medoid;
A medoid calculation unit for calculating the similarity between the newly received received document and the medoid;
A cluster merger for calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than the threshold into a cluster to which the medoid belongs;
A topic tree generation unit for extracting a subject word from the cluster and generating a topic tree of the cluster;
Wherein the topic tree generation unit includes a hierarchical relationship between the topic words using the probability information generated by the topic word in the cluster.

The method of claim 9, wherein
Wherein the cluster merging unit comprises:
And if the merged cluster is merged into the one cluster, calculating the degree of similarity between documents included in the merged cluster and resetting the menu.

10. The method of claim 9,
Wherein the cluster merging unit comprises:
And generating the received document as an independent cluster if the degree of similarity between the received document and the mean is greater than or equal to the threshold value.

10. The method of claim 9,
Wherein the topic tree generating unit comprises:
Wherein the hierarchical tree-based topic search system is a hierarchical relationship of the topic words extracted from the generated cluster or the independent clusters.

10. The method of claim 9,
The topic tree generation unit,
Wherein the keyword extraction unit extracts a topic word using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word. Based topic search system.

14. The method of claim 13,
Wherein the topic tree generating unit comprises:
A hierarchical tree-based topic search system for online text documents, characterized by using the following CTF-CDF-ICF formula: