KR101737887B1

KR101737887B1 - Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis

Info

Publication number: KR101737887B1
Application number: KR1020150132590A
Authority: KR
Inventors: 손경아; 조승우; 차문수
Original assignee: 아주대학교산학협력단
Priority date: 2015-09-18
Filing date: 2015-09-18
Publication date: 2017-05-19
Also published as: KR20170034206A

Abstract

The present invention relates to a method and apparatus for automatically classifying a subject category of text included in a web page or social media content created in the Internet.
To this end, the text subject category classification apparatus according to the present invention comprises: a data collection unit for receiving a plurality of documents classified in advance by theme category, selecting words in sentences contained in the document, The data collection unit receives the words collected by the subject category, calculates a weight for the input words, and includes a word dictionary that is present in the subject category among the input words based on the calculated weight A word dictionary generating unit for selecting a word for each of the subject categories and registering the selected word dictionary in each of the word dictionary and a classification target sentence, Words are selected, And a subject category classifier for generating a feature vector according to the weight of the selected words and determining the subject category of the classification target sentence based on the generated feature vector.

Description

Technical Field [0001] The present invention relates to a method and apparatus for automatically classifying subject categories of social media text based on cross-media analysis,

The present invention relates to a method and apparatus for automatically classifying a subject category of text included in a web page or social media content created in the Internet.

Due to the proliferation of mobile devices, the number of web contents transmitted on the Internet has been rapidly increasing. The number of users of social network services such as Twitter and Facebook is gradually increasing globally so that the number of data such as texts and images transmitted from Internet devices input from mobile devices or computer devices owned by users It is increasing rapidly.

Such web data on the Internet contains useful information in that it contains information on the status or interests of a large number of people. In particular, in the case of web data transmitted from a social network service, it is useful for grasping the status or information of the user in the point that the data is generated and transmitted by each user. Further, the status of the group and the information It is also useful data.

Therefore, researches have been conducted to analyze data on social networks and extract information therefrom. For example, "Kwak H, Lee C, Park H, Moon S. What is Twitter, a social network or a news media? In: Proceedings of the 19th international conference on World wide web, 2010. ACM, pp 591-600" And analyzing Twitter data to analyze when and where users talk about what topic they are talking to.

However, these existing studies mainly focus on specific keywords or topics that are handled at specific times or places, and do not provide a means to analyze the overall subject category in social media.

(Patent Document 0001) Korean Patent Publication No. 10-1480711

The present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to provide a method and apparatus for classifying a subject category of sentences generated on the Internet more reliably than existing sentence topic classification methods such as social media, And a text subject category classification apparatus, system, and method using a word dictionary for each subject category generated by using document data of different kinds of media.

In order to solve the above problems, a text subject category classification apparatus according to one type of the present invention receives a plurality of documents classified in advance by theme category, selects words in a sentence included in the document, A data collecting unit for collecting words, a data collecting unit for receiving the words collected by the subject category, calculating a weight for the input words, calculating a weight for the input words based on the calculated weight, A word dictionary generating unit for selecting a word dictionary to be included in a word dictionary existing for each subject category by the subject category and registering the word dictionary in each of the word dictionary and a classification target sentence, The words included in each of the word dictionary are selected for each category And a subject category classifying unit for generating a feature vector according to the weight of the selected words for each of the subject categories and determining the subject category of the classification subject sentence based on the generated feature vector.

Here, the data collecting unit may select a word to be input to the word dictionary generating unit from the sentence by performing a morphological analysis by removing a character string or a special character or numeric character composed of a predetermined number of characters or less from the sentence.

Here, the data collection unit may receive a news article, a newspaper article, or a magazine article document classified in advance by the subject category, as a plurality of the documents classified in advance by the subject category.

Here, the word dictionary generation unit may calculate a TF-IDF weight based on the sentence including the input word and information on the subject category, with respect to the words input from the data collection unit, And a first word dictionary generation unit for selecting a word to be included in the word dictionary from among the input words based on the IDF weight.

Here, the first word dictionary generating unit may generate the first word dictionary by adding the number of the input words appearing in the document, the number of the sentences including the input words to the document, and the number of the theme categories including the input word And the TF-IDF weight is calculated based on the TF-IDF weight.

Here, the word dictionary generation unit performs an LDA analysis on the words input from the data collection unit, calculates an LDA rank weight according to the analysis result, and calculates the weighted sum of the input words And a second word dictionary generation unit for selecting words to be included in the word dictionary.

Here, the second word dictionary generation unit may calculate a TF-IDF weight based on the sentence including the input word and information on the subject category, with respect to words input from the data collection unit, The LDA analysis is performed on the remaining words after the removal of the words, the LDA rank weight is calculated according to the analysis result, and the calculated TF-IDF weight is calculated And selects a word to be included in the word dictionary from the input words based on the LDA rank weight.

Here, the word dictionary generation unit may further include a duplicate word elimination unit for eliminating duplicated words among words included in the word dictionary for each subject category.

Here, the redundant word removal unit may remove the redundant word based on the TF-IDF weight of the redundant word or the occurrence frequency of the redundant word in the word dictionary, among the redundant words commonly included in the two or more word dictionary. Select the subject category to be removed, and remove the duplicate word from the word dictionary of the selected subject category.

Wherein the subject category classifier selects words included in each word dictionary of the subject category from words included in the classification subject sentence and calculates a value obtained by calculating each of the weight values of the selected words by the subject category A feature vector extractor configured to generate the feature vector by setting each element of the feature vector; And a classification unit that determines the subject category of the classification target sentence based on the generated feature vector.

Wherein the feature vector extractor sets a value obtained by summing each of the weight values of the selected words in each of the theme categories to each of the elements of the feature vector.

Wherein the classification unit determines the subject category corresponding to the element having the largest value among the elements of the feature vector as the subject category of the classification target sentence according to a maximum weight technique .

Wherein the classifier classifies the subject category of the classification target sentence based on the feature vector using a pre-learned classifier based on a support vector machine (SVM).

Here, the word dictionary creation unit may include a non-related word elimination unit for selecting words not related to the subject category from words included in the word dictionary for each subject category, and removing the selected words from the word dictionary .

Wherein the non-related word elimination unit is configured to classify the words included in each of the word dictionaries into the number of the words in the subject category, the number of the documents in which the words are included in the subject category, Clustering into a plurality of subsets based on the frequency of occurrence of the words in the document, selecting at least one or more non-related clusters based on the frequency of the clustering subsets, And removing the words from the word dictionary.

The text subject category classification apparatus may further include a word dictionary database for storing the word dictionary generated by the word dictionary generation unit.

In order to solve the above problems, a text subject category classification system according to one type of the present invention may include a service server.

Here, the service server may include a data collection unit for receiving a plurality of documents classified in advance by theme category, selecting words in a sentence included in the document, and collecting words by the theme category, The method of claim 1, further comprising the steps of: receiving words collected for each subject category, calculating a weight for the input words, calculating a word to be included in a word dictionary existing in the subject category among the input words based on the calculated weight, And a word dictionary generation unit for registering the selected word dictionary in each of the word dictionary.

Wherein the service server receives a classification target sentence and selects words included in each of the word dictionary for each of the subject categories from among words included in the classification target sentence, And a subject category classifier for generating the feature vector according to the feature vector and determining the subject category of the classification target sentence based on the generated feature vector.

Wherein the text subject category classification system comprises: a word dictionary database for storing the word dictionary generated by the word dictionary generation unit; And a terminal.

Wherein the terminal receives a classification target sentence and connects to the word dictionary database to select words included in each word dictionary for each subject category from words included in the classification subject sentence, And a subject category classifier for generating a feature vector according to the weight of the selected words and determining the subject category of the classification target sentence based on the generated feature vector.

According to an aspect of the present invention, there is provided a method of classifying a text subject category according to one aspect of the present invention, the method comprising: receiving, by a service server, a plurality of documents classified in advance by theme categories; A data collection step of collecting words according to a subject category, a service server calculating a weight for words collected by the subject category, calculating a weighting value for each of the collected words based on the calculated weight, A word dictionary creation step of selecting words to be included in the dictionary in accordance with the subject category and registering the words in each of the word dictionary; receiving a classification target sentence; receiving, from among the words included in the classification target sentence, The selected words are selected, And a subject category classification step of generating a feature vector according to the weight of the selected words and determining the subject category of the classification target sentence based on the generated feature vector.

The apparatus and method for classifying a text subject category according to the present invention is a method and apparatus for classifying a subject category of a sentence of a specific medium by using a document created from different media and different types of media, preferably newspapers, news, It is possible to generate a word dictionary more efficiently by using media classified by other preliminary topics without manually generating the word dictionary by manually classifying the data of the media targeted for classification by providing a configuration using word dictionaries for each category .

In particular, the apparatus and method for classifying text subject categories according to the present invention provide a structure for eliminating heterogeneity of data between disparate media, that is, words that are not included in any subject category semantically by using a clustering analysis method , It is possible to remove the heterogeneity of data generated by generating the word dictionary using the data of the heterogeneous media, and to classify the subject category of the classified sentence more reliably.

In addition, the apparatus and method for classifying text subject categories according to the present invention have the effect of reliably classifying the subject categories of sentences generated on the Internet, such as social network services, over existing sentence topic classification methods. Then, information on the interest or inclination of a specific user can be extracted using the subject category analysis result classified by sentence, or information on interest or inclination of users in a specific group or during a specific period can be extracted Number is effective.

1 is a block diagram of a text subject category classification apparatus according to an embodiment of the present invention.
2 is a block diagram of a text subject category classification apparatus according to another embodiment of the present invention.
3 is a block diagram of a text subject category classification apparatus according to another embodiment of the present invention.
4 is a detailed block diagram according to an embodiment of the word dictionary generator.
5 is a detailed block diagram of a word dictionary generation unit according to another embodiment of the present invention.
6 is a reference diagram for explaining the operation of non-related word removal.
7 is a detailed block diagram of the subject category classification section.
8 is a block diagram of a text subject category classification system in accordance with the present invention.
9 is a block diagram of a text subject category classification system in the case of another embodiment of the present invention.
10 is a flowchart of a text subject category classification method according to another embodiment of the present invention.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In addition, the preferred embodiments of the present invention will be described below, but it is needless to say that the technical idea of the present invention is not limited thereto and can be variously modified by those skilled in the art.

Accordingly, the present invention analyzes a sentence included in web data, for example, sentences generated on a social media medium such as a Twitter, and determines which category of the predetermined subject category the generated sentences belong to, , System, and method.

In particular, in the apparatus and method for classifying a text subject category according to the present invention, in generating a word dictionary to be used for classifying a subject category of a classification target sentence, a document in which types are classified in advance by theme category such as newspaper, news, As a source of a word dictionary.

In order to classify the subject category of the sentence to receive the classification target sentence, it is necessary to analyze which subject category the words included in the sentence correspond to, and for this, a word dictionary storing the words of the subject category is required. However, constructing a word dictionary by manually labeling the words to be included in the word dictionary of each subject category is very time-consuming and labor-intensive.

Accordingly, the apparatus and method for classifying a text subject category according to the present invention can be classified into a newspaper, a news article, or a magazine document by taking into consideration the fact that documents included in a newspaper, a news, To generate a word dictionary for each subject category. The present invention provides a configuration for classifying the subject category of the classification target sentence by using the word dictionary thus generated. For example, the apparatus and method for classifying a text subject category according to the present invention may be arranged such that a newspaper or a news article classified by a subject category is inputted and analyzed to generate a word dictionary, The subject category of the sentence can be classified by receiving the sentence.

As described above, in the apparatus and method for classifying text subject category according to the present invention, in order to classify a subject category of a sentence of a specific medium, different kinds of media other than the corresponding media, for example, document data of newspapers, A word dictionary for each subject category is generated. By performing the subject category classification using the heterogeneous media, the text subject category classification apparatus according to the present invention does not classify the subject, passively classifies the data of the media to be classified and does not generate the word dictionary, The word dictionary can be generated more efficiently by using the classified media.

In addition, the apparatus and method for classifying text subject categories according to the present invention can analyze words included in a document classified by a subject category as described above, and generate words using a word dictionary or LDA method generated using the TF-IDF method Provides a configuration using a word dictionary. Here, in order to refine the words included in the word dictionary according to the subject category, it is necessary to remove words repeatedly appearing in various subject categories, and in particular, when a word dictionary is generated using the LDA method, a word having a small TF- And 'Stop Word'.

In particular, in order to remove heterogeneity of data between different media, the present invention is judged to be included in a word dictionary of a specific subject category among the registered words according to the above-mentioned process, but actually, We propose a method to remove words from word dictionary using clustering analysis method. According to the present invention, the text subject category classification apparatus and method according to the present invention can remove the heterogeneity of data generated by generating word dictionaries using data of different media, It has the effect of classifying the subject category of the sentence.

In addition, the apparatus and method for classifying a text subject category according to the present invention provide a structure for classifying a subject category corresponding to a classification target sentence using a classifier based on a word dictionary for each subject category generated through the above process. More specifically, the words included in the classification target sentence are searched in a word dictionary for each subject category, the weight of the found words is calculated for each word dictionary, and the subject category having the highest calculation value is determined as the subject category of the classification target sentence .

According to an embodiment of the present invention, a text subject category classification apparatus and method according to the present invention receives a classification subject sentence generated on the Internet, extracts a feature vector capable of performing more effective classification based on the word dictionary, The subject category of the sentence can be reliably determined. Then, it is possible to extract information on the interest or inclination of a specific user using the subject category analysis result determined for each sentence, or to extract information on the interest or inclination of the user in a specific group or during a specific period .

Hereinafter, a text subject category classification apparatus, a method thereof, and a system therefor according to the present invention will be described in detail.

First, a text subject category classification apparatus according to an embodiment of the present invention will be described below.

1 is a block diagram of a text subject category classification apparatus according to an embodiment of the present invention.

The text subject category classification apparatus according to the present invention may include a data collection unit 100, a word dictionary generation unit 200, and a subject category classification unit 300.

Here, the text subject category classification apparatus according to the present invention may be embodied as a computer program having a program module that performs a part or all of the functions in combination with some or all of the constituent elements selectively combined in one or a plurality of hardware have. In addition, each component may be implemented as a single independent hardware or included in each hardware as needed. In addition, the text subject category classification apparatus according to the present invention may be implemented as a software program and operate on a processor or a signal processing module, or may be implemented in hardware form and included in various processors, chips, semiconductors, devices, . Further, the text subject category classification apparatus according to the present invention may be included in a form of hardware or software module on a computer, various embedded systems or devices. Preferably, the text subject category classification apparatus according to the present invention may be implemented in a server connected to a network or included in a server. Here, the data collection unit 100, the word dictionary generation unit 200, and the subject category classification unit 300 of the text subject category classification apparatus according to the present invention may be all implemented on one text subject category classification service server, And some configurations may be implemented on different servers or may exist on a plurality of servers as needed. In addition, some configurations may be implemented in the client terminal device, not the server, or included in the client terminal device, if necessary. For example, the data data collection unit 100 and the word dictionary generation unit 200 may be included in the service server, and the subject category classification unit 300 may be included in the client terminal device.

The data collecting unit 100 receives a plurality of documents classified in advance by theme category, selects words in the sentence included in the document, and collects words by the subject category.

The word dictionary generation unit 200 receives words collected by the subject category in the data collection unit 100, calculates weights for the input words, and outputs the input words A word dictionary to be included in the word dictionary existing for each subject category is selected for each subject category and registered in each word dictionary.

The subject category classification unit 300 receives a classification target sentence and selects words included in each of the word dictionary in the subject category from among words included in the classification target sentence, Generates a feature vector according to the weight, and determines the subject category of the classification target sentence based on the generated feature vector.

2 is a block diagram of a text subject category classification apparatus according to another embodiment of the present invention.

As shown in FIG. 2, the text subject category classification apparatus according to the present invention can operate in connection with an external word dictionary database 50. At this time, the text subject category classifier may store the word dictionary generated by the word dictionary generator 200 in the word dictionary database 50. [

If necessary, the text subject category classification apparatus according to another embodiment of the present invention may include a word dictionary database 50 storing the word dictionary generated by the word dictionary generation unit 200 in the apparatus.

FIG. 3 is a block diagram of a text subject category classification apparatus according to another embodiment of the present invention.

Next, the operation of the data collecting unit 100 will be described in more detail.

Here, the subject category is a plurality of categories classified in advance in order to classify the subject of the document, for example, 'politics', 'economy', 'culture', 'society', 'art', 'science' There may be pre-classified topic categories such as. The number and type of subject categories can be set by the user as needed. Here, the document means a set of at least one sentence, and may be a paragraph or a paragraph. Here, the document is divided into several sentences, which are analyzed as described in detail below, which is particularly useful in situations where the sentences within a certain length, typically used in social network services, It is for analysis. Here, a sentence refers to a set of one or more words, and refers to a string in which one or more words are gathered together irrespective of whether or not the sentence is syntactically complete, irrelevant to the grammatical right or wrong. Thus, one sentence may be a completed sentence such as 'I went to school', but it could be a string that is a set of grammatically incomplete words such as 'school attendance' and, in some cases, It may be an invalid character string such as 'school'. Here, the word means a set of at least one character defined as having a specific meaning in each language, and may be a set of specific characters defined by the user as needed. For example, a set of characters such as 'school' and 'school' may be words.

At this time, the data collecting unit 100 removes a character string or a special character or a numeral character composed of a predetermined number of characters or less from the sentence, performs morphological analysis, and extracts words to be input to the word dictionary generating unit 200 from the sentence Can be selected. That is, the data collecting unit 100 may remove special characters, numeric characters, and the like, which are characters that can not be regarded as representing a specific subject category, before the words to be included in the word dictionary are selected. Also, a character string including a predetermined number or less of characters may be selected and removed from the sentence. For example, the data collecting unit 100 may remove a character string having a length less than two from the sentence. Here, the data collecting unit 100 can use a variety of conventional morpheme analysis methods for extracting and selecting words included in a sentence.

At this time, the data collecting unit 100 may receive a news article, a newspaper article, or a magazine article document classified in advance by the subject category, as a plurality of the documents classified in advance by the subject category.

As described above, manually labeling words as a source for generating a word dictionary for each subject category and registering them in the word dictionary is a very time-consuming and labor-intensive task. Accordingly, in the present invention, the documents included in the newspaper, the news, or the magazine are effectively classified by the experts according to the subject, so that the data collection unit 100 may classify the subject category, A magazine document can be input and used. In this way, in order to classify the subject category of the sentence of the specific media, the text subject category classification apparatus according to the present invention generates and uses a word dictionary for each subject category by using document data of different kinds of media from different media, It is possible to generate a word dictionary more efficiently by using other media classified in advance, instead of manually generating the word dictionary by manually classifying the data of the corresponding media.

Next, the operation of the word dictionary generation unit 200 will be described in more detail.

The word dictionary generation unit 200 receives words collected by the subject category in the data collection unit 100, calculates weights for the input words, and calculates the weighted words based on the calculated weights, A word dictionary to be included in the word dictionary existing in the subject category is selected for each subject category and registered in each of the word dictionary.

Here, the weight is a number indicating the degree to which a particular word is associated with the subject category. Therefore, the word dictionary generation unit 200 can select words based on the weight value size and register them in the word dictionary for each subject category. For example, the weight of each word may be compared with a predetermined threshold to register words having a weight equal to or greater than the threshold value in the word dictionary, or the ranking between words according to the weight of each word may be calculated, May be selected and registered in the word dictionary.

Then, the word dictionary generator 200 also registers the calculated weight for the word when registering the selected word in each word dictionary.

According to the operation of the clue dictionary generation unit 200, the words are selected and registered in the word dictionary in each subject category. For example, words such as 'political party', 'election' and 'vote' can be selected and registered in the word dictionary of the 'political' subject category, , 'Art' can be selected and registered.

FIG. 4 is a detailed block diagram according to an embodiment of the word dictionary generation unit 200. Referring to FIG.

Here, the word dictionary generation unit 200 may include a first word dictionary generation unit 210 or a second word dictionary generation unit 220. Here, the first word dictionary generation unit 210 generates a word dictionary using a TF-IDF algorithm, and the second word dictionary generation unit 220 generates an LDA (Latent Dirichlet Allocation) algorithm To generate a word dictionary. The word dictionary generation unit 200 can use any one of the word dictionary of both schemes as needed.

The first word dictionary generation unit 210 generates a first word dictionary TF-IDF (Term Frequency-IDF) based on the sentence including the input word and information about the subject category, with respect to the words input from the data collection unit 100, Inverse Document Frequency (TF-IDF) weights may be calculated, and a word to be included in the word dictionary may be selected from the input words based on the calculated TF-IDF weight. At this time, the first word dictionary generation unit 210 may calculate TF-IDF weights for the input words in each subject category, and may select words to be registered in the word dictionary corresponding to each subject category.

The TF-IDF weight can be calculated according to the TF-IDF algorithm proposed in "Joachims T, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, DTIC Document (1996)". Here, the TF-IDF weight may be a weight value set for each word of a specific document according to the extent to which it is more related to a particular document than to other documents.

In the present invention, the modified TF-IDF algorithm proposed by Joachims T in the present invention is modified to calculate the modified TF-IDF weight to be more suitable for classification of text subject category using heterogeneous media analysis as in the present invention, Select words to be included in the dictionary. The following is a TF-IDF weight calculated according to the modified TF-IDF algorithm suggested by the present invention.

Here, it is preferable that the first word dictionary generation unit 210 calculates the TF-IDF weight so that a word having a higher TF-IDF weight for the subject category appears more in the subject category than the other subject categories .

For this, the first word dictionary generation unit 210 generates the first word dictionary generation unit 210 based on the number of the input words appearing in the document, the number of the sentences including the input words appearing in the document, The TF-IDF weight can be calculated based on the number of categories.

More specifically, when it is assumed that the word w _i is the i-th word included in the document d as a word collected for each document d for each document d in the data collection unit 100, the TF-IDF weight is a TF weight, The value can be set according to the weight and the IDF weight. Here, the TF weight is a value according to the frequency of occurrence of the word w _i in the document d, the SF weight is a value according to the frequency shown in the document d including the word w _i , and the IDF weight is a subject including the word w _i It can be a value according to the ratio of the number of categories.

Here, the first word dictionary generation unit 210 preferably calculates the TF-IDF weight according to Equation (1).

Where S (i) is the TF-IDF weighting, and TF is a weight (w _i, d) TF, SF (w _i, d) is SF weight, and the (w _i, d) IDF is IDF weight. Where w _i is the ith word in document d, word w _i _{, d} is the number of words w _i in document d, and word _{total, d} is the total number of words in document d. Where sentence _{wi, d} is the number of sentences containing the word w _i in document d, sentence _{total, and d} is the _total number of sentences in document d. In addition, where _wi is the category number of the category containing the word w _i, category _total is the total of the category.

Here, F _TF , F _SF , and F _IDF can be functions as shown in Equation (2) below.

Where a, b, and c are parameters set to adjust the value of the weight. For example, a, b may be set to 1, and c may be set between 0 and 1.

The first word dictionary generation unit 210 selects words in each subject category based on the TF-IDF weight calculated as described above, and registers the selected words in the word dictionary corresponding to each subject category. At this time, some words are selectively selected and removed using the redundant word removing unit 230 or the non-related word removing unit 240 as described below in detail, as necessary, for the words registered in the word dictionary .

Next, the second word dictionary generation unit 220 performs an LDA (Latent Dirichlet Allocation) analysis on the words input from the data collection unit 100, calculates an LDA rank weight according to the analysis result, A word to be included in the word dictionary can be selected from the input words based on the calculated LDA rank weight. At this time, the second word dictionary generation unit 220 may calculate LDA rank weights for the input words in each subject category, and may select words to be registered in the word dictionary corresponding to each subject category based on the LDA rank weight.

Here, the second word dictionary generation unit 220 performs an LDA analysis according to the LDA algorithm proposed in "Blei DM, Ng AY, Jordan MI, Latent dirichlet allocation. The Journal of machine learning research 3: 993-1022 Can be performed. Here, the LDA algorithm can distinguish words having a stronger association or importance within a particular document. When the LDA analysis is performed, assuming that the document is the sum of topics having different word distributions, the topics constituting each document are classified, and each topic includes words representing the topic.

When the LDA analysis is performed, it is assumed that the words included in the word dictionary of one subject category are words included in one document, and the words included in the word dictionary of each subject category , And classifies each word dictionary into a plurality of topics. Then, the second word dictionary generation unit 220 calculates the LDA word weight using the above-described classified topics, and calculates the LDA rank weight according to the weight.

The second word dictionary generation unit 220 may calculate the LDA word weight as shown in Equation (3).

Where W _in is the weight of the nth word for the i th topic and the ratio of the nth word in the topic. In this case, W _in is the sum of the frequencies of the corresponding words appearing _in the topic, and can be a value obtained by dividing the frequency of each word. Here, Pi represents the ratio of the i-th topic among the topics. And W _n is the LDA word weight of the nth word.

Next, the second word dictionary generation unit 220 calculates LDA rank weights using the LDA word weights calculated as described above. At this time, the second word dictionary generation unit 220 arranges the words according to the LDA word weight, sets the total number of words included in the document to the word having the highest LDA word weight, The LDA rank weight for each word is calculated by setting a predetermined number of words, for example, 1 -, and dividing the set number by the total number of words included in the document.

The second word dictionary generation unit 220 selects words for each subject category on the basis of the LDA rank weight calculated as described above, and registers the selected words in the word dictionary corresponding to each subject category. At this time, some words are selectively selected and removed using the redundant word removing unit 230 or the non-related word removing unit 240 as described below in detail, as necessary, for the words registered in the word dictionary .

The second word dictionary generation unit 220 may be configured to perform the LDA analysis as described above with respect to the words received from the data collection unit 100, IDF weight, and removing the words having the calculated TF-IDF weight less than a predetermined reference value from the input words, as described in detail above, based on the information about the TF-IDF weight. Here, the predetermined reference value may be set to a specific value as needed.

In the case where some words are removed according to the TF-IDF weight, the second word dictionary generator 220 performs an LDA analysis on the remaining words after the removal, calculates an LDA rank weight according to the analysis result, A word to be included in the word dictionary may be selected from the input words based on the calculated LDA rank weight.

At this time, the word dictionary generator 200 may further include a configuration for removing redundant words or a configuration for removing unrelated words.

5 is a detailed block diagram of the word dictionary generation unit 200 according to another embodiment of the present invention.

Here, the word dictionary generating unit 200 may further include at least one of the redundant word removing unit 230 and the non-related word removing unit 240.

The redundant word remover 230 removes the redundant word among the words included in the word dictionary for each subject category. When the word dictionary generation unit 200 calculates weights for the words input from the data collection unit 100 and selects words based on the weights and registers them in the word dictionary for each subject category, May exist, and such overlapping words may degrade subject category classification performance. Therefore, it is preferable that the redundant word remover 230 leaves only the word dictionary corresponding to the subject category having the highest relevance to the redundant words, and the redundant words are removed from the word dictionary of the remaining subject category.

For this, the redundant word remover 230 removes the redundant word based on the TF-IDF weight in the subject category of the redundant word among the redundant words commonly included in the two or more word dictionary Select the subject category, and remove the duplicate word from the word dictionary of the selected subject category. Here, the redundant word remover 230 may remove redundant words from the word dictionary of the remaining subject categories while leaving redundant words only in the word dictionary of the subject category with the highest TF-IDF weight among the redundant words.

Alternatively, the redundant word remover 230 may select the subject category to remove the redundant word based on the frequency of occurrence of the redundant word in the word dictionary, among the redundant words commonly included in the two or more word dictionary And remove the duplicate word from the word dictionary of the selected subject category. Here, the redundant word remover 230 may remove redundant words from the word dictionary of the remaining subject categories while leaving redundant words only in the word dictionary of the subject category having a high frequency of occurrence of redundant words.

Preferably, the redundant word remover 230 removes redundant words based on the TF-IDF weight as described above. When there are two or more subject categories having the highest TF-IDF weight, It is desirable to remove redundant words based on the frequency of occurrence of the word.

Next, the non-related word removal unit 240 selects a word that is not related to the subject category among the words included in the word dictionary for each subject category, and removes the selected words from the word dictionary.

Here, the non-related word removal unit 240 first determines words contained in each of the word dictionaries based on the number of occurrences of the word in the subject category, the number of the document in which the word is included in the subject category, May be clustered into a plurality of subsets based on the TF weight value according to the frequency of occurrence of the word in the document. Here, the TF weight value may be a value calculated in the same manner as described in the first word dictionary generation unit 110, and thus the value calculated by the first word dictionary generation unit 110 may be used.

Here, an EM (Expectation-Maximization) clustering algorithm can be used as an algorithm for clustering.

Next, the non-related word removal unit 240 may select at least one non-related cluster based on the TF weight value according to the frequency among the clustered subsets.

Preferably, the non-related word remover 240 may select a cluster including words having a TF weight value smaller than a predetermined reference value as the non-related cluster. For this purpose, representative values of TF weight values representing each cluster can be calculated for each cluster, and non-related clusters among the clusters can be selected based on the representative values. Here, the reference value may be a value that can be set as needed.

Next, the non-related word removal unit 240 may remove words included in the non-related cluster from the word dictionary.

6 is a reference diagram for explaining the operation of the non-related word removal unit 240. Referring to FIG.

FIG. 6 is a graph showing a result of clustering the words included in the word dictionary corresponding to the 'political' subject category according to the above-described method, according to the non-related word removing unit 240 according to the present invention. Referring to FIG. 6, words are divided into a total of 15 clusters. In FIG. 6, the number of documents in the y-axis indicates the number of the documents in which the word is included in the subject category, Means the number in the subject category. Here, the clusters represented by red are clusters whose TF weight values are higher than a predetermined criterion, and clusters represented by blue are clusters whose TF weight values are smaller than a predetermined criterion. In this case, the non-related word removal unit 240 selects the clusters having a TF weight value smaller than a predetermined criterion as the non-related clusters in FIG. 6, and the words included in the selected non- Can be removed from the word dictionary corresponding to the word dictionary.

In this way, the text subject category classification apparatus according to the present invention removes the heterogeneity of data generated by generating the word dictionary by using the data of the heterogeneous media, and more reliably removes the subject There is an effect of classifying categories.

Next, the operation of the subject category classification unit 300 will be described in more detail.

Here, the classified target sentence may be a sentence included in data transmitted from the Internet, and preferably a sentence generated in a social media. For example, a sentence to be classified can be a sentence occurring in a social network service such as Twitter or Facebook. The sentence in such a social network service is usually shorter in length than the sentences contained in other general texts, and thus has a characteristic that the number of words to be included is small. Therefore, it is said that the existing subject classification method, which is mainly applied when a plurality of words are included in the existing long sentence, is not suitable in the above case.

Accordingly, in order to more reliably classify the subject category with respect to a short sentence, the subject category classifier 300 according to the present invention searches each word included in the classification target sentence in the word dictionary for each subject category as described above, A feature vector is generated according to the weight of the searched words, and a subject category of the sentence is determined based on the generated feature vector.

More specifically, the subject category classifier 300 may include a feature vector extractor 310 and a classifier 320.

7 is a detailed block diagram of the subject category classification unit 300. As shown in FIG.

The characteristic vector extracting unit 310 selects words included in each of the word dictionary of the subject category among words included in the classification target sentence and calculates values of the weighted values of the selected words Can be set to each element of the feature vector to generate the feature vector.

Here, the feature vector may be a vector having a number of elements corresponding to the number of subject categories, and the value of each element may be a value obtained by calculating a weight value of words of the classification target sentences included in the word dictionary of each subject category have.

In this case, the feature vector extracting unit 310 may set a value obtained by summing each of the weights of the selected words in each of the theme categories to each of the elements of the feature vector.

For example, when the subject categories are seven, namely, "culture", "economy", "world", "politics", "science", " It is possible to search for words in the classification target sentence in a word dictionary existing for each subject category and add the weight of words included in the word dictionary for each subject category by subject category to obtain the sum of weights for each subject category. For example, for each subject category of 'culture', 'economy', 'world', 'politics', 'science', 'society', and 'sports', the sum of the weight of each included word is '4' (4, 6, 8, 3, 0, 2, 0) vector and a vector Likewise, feature vectors can be generated.

Next, the classifying unit 320 may determine the subject category of the classification target sentence based on the generated characteristic vector.

The classification unit 320 may determine the subject category corresponding to the element having the largest value among the elements of the feature vector as the subject category of the classification target sentence according to a maximum weight technique . That is, the subject category corresponding to the element having the maximum value among the element values of the feature vector can be determined as the subject category of the classification target sentence. For example, a feature vector extracted from a classification target sentence for each subject category of 'culture', 'economy', 'world', 'politics', 'science', 'society' The subject category "world" corresponding to the highest element value 8 when the number of the elements is 4, 6, 8, 3, 0, 2, 0 can be determined as the subject category of the classification target sentence.

Alternatively, the classifier 320 may classify the subject category of the classification subject sentence based on the feature vector, using a pre-learned classifier based on a support vector machine (SVM).

Here, the classifier based on the support vector machine can be set and applied according to the method proposed by " Suykens JA, Vandewalle J, Least squares support vector machine classifiers. Neural processing letters 9 (3): 293-300 (1999) ". Here, the learning of the classifier based on the SVM can be performed using training data in which the subject category is set for each feature vector in advance.

Here, in learning the support vector machine-based classifier, the classifier 320 may divide the unit of the training data by a sentence unit rather than the document unit, and perform learning on a sentence-by-sentence basis. Here, it is preferable that the classifying unit 320 learns the parameters of the support vector machine-based classifier using training data generated by dividing a document into a plurality of the documents classified by subject category in units of sentences. Here, for example, a news article, a newspaper article, or a magazine article document classified in advance by the theme category can be used.

Hereinafter, a text subject category classification system according to another embodiment of the present invention will be described.

The text subject category classification system according to the present invention may include a service server 10.

Here, the service server 10 may include a data collection unit 100, a word dictionary generation unit 200, and may further include a subject category classification unit 300 as needed. Here, the service server 10 may be a text subject category classification apparatus according to the present invention described above.

The data collecting unit 100 may receive a plurality of documents classified in advance by theme category, select words in the sentence included in the document, and collect words by the subject category.

The word dictionary generation unit 200 receives words collected by the subject category in the data collection unit 100, calculates weights for the input words, and outputs the input words The words to be included in the word dictionary existing for each subject category can be selected for each subject category and registered in each word dictionary.

The subject category classification unit 300 receives a classification target sentence and selects words included in each of the word dictionary in the subject category from among words included in the classification target sentence, A feature vector may be generated according to the weight, and the subject category of the classification target sentence may be determined based on the generated feature vector.

In the text subject category classification system according to the present invention, the service server 10 includes the data collection unit 100 and the word dictionary generation unit 200 described above. When the separate terminal 20 has the above- (300). &Lt; / RTI >

FIG. 8 is a block diagram of a text subject category classification system according to another embodiment of the present invention in the case where the terminal 20 exists separately.

At this time, the service server 10 can connect and store the word dictionary generated by the word dictionary generation unit 200 to the external word dictionary database 50, and the terminal 20 can connect to the word dictionary database 50, It is accessible in advance.

Here, the subject category classifier 300 included in the terminal 20 receives a classification target sentence, and connects the word dictionary database 50 to each of the words included in the classification target sentence, A word dictionary, words included in the word dictionary, a feature vector according to the weight of the selected words for each subject category, and the subject category of the classification subject sentence based on the generated feature vector .

In another embodiment, the service server 10 may include a word dictionary database 50 within the server device.

FIG. 9 is a block diagram of a text subject category classification system in the case of another embodiment including the word dictionary database 50 in the server device.

The data collecting unit 100, the word dictionary generating unit 200, and the subject category classifying unit 300 are the same as the data collecting unit 100 and the word extracting unit 300 in the text subject category classifying apparatus described with reference to FIGS. 1 to 7, The dictionary generation unit 200, and the subject category classification unit 300. [ The operation of each constituent part is described briefly by omitting duplicated parts.

10 is a flowchart of a text subject category classification method according to another embodiment of the present invention.

The text subject category classification method according to the present invention may include a data collection step S100, a word dictionary generation step S200, and a subject category classification step S300. The text subject category classification method according to the present invention can operate in the same manner as the text subject category classification apparatus described with reference to FIGS. 1 to 7 above. The overlapping portions will be omitted and briefly described.

In the data collection step S100, the service server 10 receives a plurality of documents classified in advance by subject category, selects words in sentences included in the document, and collects words by the subject category.

In the word dictionary creation step S200, the service server 10 calculates weights for the words collected for each of the subject categories, and for each of the collected words, based on the calculated weight, Can be selected for each subject category and registered in each of the word dictionary.

In the subject category classification step S300, a classification target sentence is input, words included in each of the word dictionary are selected for each of the subject categories from the words included in the classification target sentence, and the selected words A feature vector may be generated according to the weight, and the subject category of the classification target sentence may be determined based on the generated feature vector. Here, the operation of the subject category classification step (S300) may be performed in the service server 10 or in the terminal 20 as required.

It is to be understood that the present invention is not limited to these embodiments, and all elements constituting the embodiment of the present invention described above are described as being combined or operated in one operation. That is, within the scope of the present invention, all of the components may be selectively coupled to one or more of them.

In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to perform a part or all of the functions in one or a plurality of hardware. As shown in FIG. In addition, such a computer program may be stored in a computer readable medium such as a USB memory, a CD disk, a flash memory, etc., and read and executed by a computer to implement an embodiment of the present invention. As the recording medium of the computer program, a magnetic recording medium, an optical recording medium, a carrier wave medium, and the like can be included.

Furthermore, all terms including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined in the Detailed Description. Commonly used terms, such as predefined terms, should be interpreted to be consistent with the contextual meanings of the related art, and are not to be construed as ideal or overly formal, unless expressly defined to the contrary.

It will be apparent to those skilled in the art that various modifications, substitutions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. will be. Therefore, the embodiments disclosed in the present invention and the accompanying drawings are intended to illustrate and not to limit the technical spirit of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings . The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

10: Text subject category classification device
20:
50: Word dictionary database
100: Data collection unit
200: Word dictionary creation unit
210: First word dictionary generating unit
220: second word dictionary generation unit
230: Remove duplicate words
240: Non-related word removal
300: Subject category classification section
310: Feature vector extraction unit
320:
S100: Data collection phase
S200: Word dictionary creation step
S300: Subject category classification step

Claims

A data collecting unit that receives a plurality of documents classified in advance by theme category, selects words in sentences included in the document, and collects words by the subject category;
The data collection unit receives the words collected by the subject category, calculates a weight for the input words, and calculates a weight of the word dictionary based on the calculated weight, A word dictionary generation unit for selecting words to be included in each of the subject categories and registering the words in each of the word dictionary; And
A classification target sentence is received, words included in each word dictionary are selected for each of the subject categories from the words included in the classification target sentence, and a feature vector is generated according to the weight of the selected words for each subject category And a subject category classification unit for determining the subject category of the classification subject sentence based on the generated feature vector,
Wherein the word dictionary generating unit comprises:
And the words included in each of the word dictionaries for each of the subject categories are classified into a number of the words in the subject category, a number of the documents in which the words are included in the subject category, Clustering into a plurality of subsets based on the frequency at which words appear,
Selecting at least one unrelated cluster based on the frequency among the clustered subsets,
And a non-related word elimination unit for removing words included in the non-related cluster from the word dictionary.

The method according to claim 1,
Wherein the data collection unit removes a character string or a special character or numeric character composed of a predetermined number of characters or less from the sentence and performs morphological analysis to select words to be input to the word dictionary creation unit from the sentence, Text subject category sorting device.

The method according to claim 1,
Wherein the data collection unit receives a news article, a newspaper article, or a magazine article document classified in advance by the subject category, as a plurality of the documents classified in advance by the subject category.

2. The apparatus according to claim 1,
The TF-IDF weight calculating unit calculates a TF-IDF weight based on the sentence including the input word and the information about the subject category with respect to the words input from the data collecting unit, And a first word dictionary generation unit for selecting a word to be included in the word dictionary from the input words.

5. The method of claim 4,
Wherein the first word dictionary generation unit generates the first word dictionary based on the number of the input words appearing in the document, the number of the sentences including the inputted words appearing in the document, and the number of the theme categories including the input word And the TF-IDF weight is calculated by the TF-IDF weighting unit.

2. The apparatus according to claim 1,
The LDA analysis is performed on the words input from the data collection unit, the LDA word weight according to the distribution of the subject category and the distribution of words appearing in the subject category for the words, and the calculated LDA word Calculating a LDA rank weight by dividing the predetermined number by the total number of the words, sorting the words according to the weight, setting a predetermined number to the words, calculating an LDA rank weight by dividing the predetermined number by the total number of words, And a second word dictionary generation unit for selecting a word to be included in the word dictionary from among the plurality of words.

7. The method according to claim 6,
Wherein the TF-IDF weight calculating unit calculates a TF-IDF weight based on the sentence including the input word and the information about the subject category with respect to the words input from the data collecting unit, Removing the small words from the input words,
An LDA analysis is performed on the remaining words after the removal, an LDA rank weight is calculated according to the analysis result, and a word to be included in the word dictionary is selected from the input words based on the calculated LDA rank weight Wherein the text subject category classification apparatus comprises:

5. The apparatus according to claim 4,
Further comprising a duplicate word removing unit for removing duplicate words commonly included in two or more word dictionaries included in the word dictionary for each subject category.

9. The method of claim 8,
Wherein the redundant word remover selects the subject category in which the redundant word is to be removed based on the TF-IDF weight of the redundant word or the frequency of occurrence of the redundant word in the word dictionary, And removing the duplicate word.

2. The apparatus according to claim 1,
The words included in each word dictionary of the subject category among the words included in the classification target sentence, and calculating values of the respective weights of the selected words by the subject category to each element of the feature vector A feature vector extracting unit configured to generate the feature vector; And
And a classifier for determining the subject category of the classification target sentence based on the generated characteristic vector.

11. The method of claim 10,
Wherein the classification unit determines the subject category corresponding to the element having the largest value among the elements of the feature vector as the subject category of the classification target sentence according to a maximum weight technique. Text subject category sorting device.

11. The method of claim 10,
Wherein the classifier classifies the subject category of the classification subject sentence based on the feature vector using a pre-learned classifier based on a support vector machine (SVM).

delete

The method according to claim 1,
And a word dictionary database for storing the word dictionary generated by the word dictionary generating unit.

1. A text subject category classification system comprising a service server,
The service server,
A data collecting unit that receives a plurality of documents classified in advance by theme category, selects words in sentences included in the document, and collects words by the subject category; And
The data collection unit receives the words collected by the subject category, calculates a weight for the input words, and calculates a weight of the word dictionary based on the calculated weight, And a word dictionary generation unit for selecting words to be included in each of the subject categories and registering the words in each of the word dictionary,
Wherein the word dictionary generating unit comprises:
And the words included in each of the word dictionaries for each of the subject categories are classified into a number of the words in the subject category, a number of the documents in which the words are included in the subject category, Clustering into a plurality of subsets based on the frequency at which words appear,
Selecting at least one unrelated cluster based on the frequency among the clustered subsets,
And a non-related word removal unit for removing words included in the non-related cluster from the word dictionary.

A data collection step of the service server receiving a plurality of documents classified in advance by theme category, selecting words in a sentence included in the document, and collecting words by the theme category;
The service server calculates a weight for words collected by the subject category and selects words to be included in the word dictionary existing in the subject category among the collected words on the basis of the calculated weight by the subject category A word dictionary creation step of registering in each of the word dictionary; And
A classification target sentence is received, words included in each word dictionary are selected for each of the subject categories from the words included in the classification target sentence, and a feature vector is generated according to the weight of the selected words for each subject category And a subject category classification step of determining the subject category of the classification target sentence based on the generated feature vector,
Wherein the word dictionary generation step comprises:
And the words included in each of the word dictionaries for each of the subject categories are classified into a number of the words in the subject category, a number of the documents in which the words are included in the subject category, Clustering into a plurality of subsets based on the frequency at which words appear,
Selecting at least one unrelated cluster based on the frequency among the clustered subsets,
Removing the words included in the non-related cluster from the word dictionary.