KR20180059112A

KR20180059112A - Apparatus for classifying contents and method for using the same

Info

Publication number: KR20180059112A
Application number: KR1020160158289A
Authority: KR
Inventors: 윤여찬
Original assignee: 한국전자통신연구원
Priority date: 2016-11-25
Filing date: 2016-11-25
Publication date: 2018-06-04
Also published as: KR102007437B1

Abstract

A content classification apparatus and method are disclosed. A content classification apparatus according to an embodiment of the present invention includes a content storage unit for receiving and storing content from a content database; A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content; And a content analyzer for analyzing the content based on the statistical information of the statistical database by connecting the content to the extracted keyword and classifying the content for each keyword.

Description

[0001] APPARATUS FOR CLASSIFYING CONTENTS AND METHOD FOR USING THE SAME [0002]

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data analysis technique, and more particularly, to a big data text analysis technique.

The contents of this book are digital contents of documents which were published on existing paper. Therefore, it is possible to derive meaningful data by applying techniques such as text analysis and classification described in the technical field.

Text analysis technology can extract meaningful data by analyzing documents written in digital text such as blog, SNS, news. For example, it can correspond to a technique that allows the blog text to be analyzed to extract and retrieve key keywords. Big data text technology can extract meaningful information by analyzing a large amount of text. For example, by analyzing SNS, trends such as public opinion trends can be analyzed. Text classification techniques can group large texts into similar categories. For example, news documents can be automatically classified into economy, politics, culture, and so on.

The prior art has a disadvantage in that it is difficult to classify the contents according to the classification of the contents according to a search of a content or classification (for example, self-development, political management, novel, essay, poem, etc.) specified in advance.

Korean Patent Laid-Open Publication No. 10-2013-0104573, entitled " Method and apparatus for classifying morpheme-based contents, "is a method for classifying morphemes in online titles, such as blogs and news, A compound word or the like is mapped to a predetermined category by adding a search and a weight to the content, thereby efficiently classifying the content in real time.

However, Korean Patent Laid-Open No. 10-2013-0104573 has a limitation in that it consumes much time and expense because it analyzes content based on extracted nouns only.

It is an object of the present invention to extract contents of keywords automatically by analyzing contents and subdivide contents according to various classification.

In addition, the present invention aims at automatically grasping sales volume, preference, and the like for various categories through such an analysis method.

In addition, the present invention aims to improve the quality of content providing service and improve the efficiency of analysis by classifying the content in various ways.

According to an aspect of the present invention, there is provided a content classification apparatus comprising: a content storage unit for receiving and storing content from a content database; A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content; And a content analyzer for analyzing the content based on the statistical information of the statistical database by connecting the content to the extracted keyword and classifying the content for each keyword.

The present invention can extract keywords automatically by analyzing the contents, and can provide the contents according to various classification.

In addition, the present invention can automatically grasp the sales volume, preference, and the like by various types of analysis through this analysis method.

In addition, the present invention can classify contents into various categories, thereby improving the quality of the contents providing service and increasing the efficiency of the analysis.

1 is a block diagram showing a content classification apparatus according to an embodiment of the present invention.
2 is a diagram illustrating a keyword extraction method according to an embodiment of the present invention.
3 is a table of contents of north-south contents according to an embodiment of the present invention.
4 is a diagram illustrating a two-part content classification algorithm according to an embodiment of the present invention.
5 is a graph illustrating sales amounts of first contents according to an embodiment of the present invention.
FIG. 6 is a graph illustrating sales amounts of second contents according to an embodiment of the present invention.
FIG. 7 is a table showing preferred keywords according to an embodiment of the present invention.
FIG. 8 is a table showing preferred keywords according to an embodiment of the present invention.
9 is a flowchart illustrating a content classification method according to an embodiment of the present invention.
FIG. 10 is an operation flowchart showing an example of the content classification step shown in FIG. 9 in detail.
11 is a block diagram illustrating a computer system in accordance with an embodiment of the present invention.

The present invention will now be described in detail with reference to the accompanying drawings. Hereinafter, a repeated description, a known function that may obscure the gist of the present invention, and a detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. Accordingly, the shapes and sizes of the elements in the drawings and the like can be exaggerated for clarity.

Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

1 is a block diagram showing a content classification apparatus according to an embodiment of the present invention. 2 is a diagram illustrating a keyword extraction method according to an embodiment of the present invention. 3 is a table of contents of north-south contents according to an embodiment of the present invention. 4 is a diagram illustrating a two-part content classification algorithm according to an embodiment of the present invention

1, a content classification apparatus 100 according to an embodiment of the present invention includes a content storage unit 110, a keyword extraction unit 120, a content classification unit 130, a content analysis unit 140, A database 10 and a statistical database 20.

In the present invention, an example of a content is described as an example of the content of the book, and all various contents including words, characters, and text are also applicable to the present invention.

At this time, the content database 10 and the northbook statistics database 20 may exist outside.

The content storage unit 110 may receive and store the content from the content database 10. [

The keyword extracting unit 120 may extract keywords from the stored contents in consideration of the appearance position of the keyword.

At this time, the keyword extracting unit 120 can extract keywords using various extraction methods.

At this time, the keyword extracting unit 120 measures a value obtained by multiplying the appearance frequency (TF) in the document for each word by the threshold value (IDF) of the appearance frequency in the entire document set, and uses the measured value as the weight of the word, A method of defining a word having a specific weight or more as a keyword can be used.

At this time, the keyword extracting unit 120 may consider the position where the keyword appears in order to determine the weight of the keyword.

In content that includes text, words can appear at various locations, such as the title, table of contents, body text, and picture labels of the content.

At this time, the keyword extracting unit 120 may consider the words from the title and the table of contents as more important keywords. In the title and table of contents, words that can represent contents of content can be selectively used. Thus, the words appearing in the title and table of contents may represent the characteristics of the content and may be more important than the words used in the text.

In this case, the keyword extracting unit 120 may determine the weights of the words by multiplying the weights corresponding to the occurrence positions and linearly combining the weights in consideration of the positions where the keywords appear in the keyword weights.

[Equation 1]

In Equation (1), W _t is the weight of the word W appearing in the title, W _c is the weight of the word W appearing in the table of contents, and W _d is the weight of the word appearing in the text.

At this time, the keyword extracting unit 120 may merge words having similar meaning in the extracted keywords.

For example, the extracted "family" and "home" keywords are very similar in terms of semantics, and therefore it is possible to derive more accurate analysis results by merging and classifying them.

For this, the keyword extracting unit 120 may utilize a vocabulary dictionary and utilize a synonym search technique.

Referring to FIG. 2, after extracting keywords from contents, it is understood that keywords that are similar to each other are merged.

The content classification unit 130 may classify the content by keyword by connecting the content to the extracted keyword.

Referring to FIG. 3, it can be seen that the content classifier 130 classifies the content in three stages.

At this time, the content classification unit 130 may classify the content by linking the keyword to the keyword only when the keyword appears in the table of contents or title so that the classification accuracy is high in the first step. By classifying in this way, it is possible to select and link related contents for each extracted keyword.

Also, in step 2, the content classifying unit 130 may sort the documents belonging to the classified classifications in step 1 so as to belong to the corresponding classifications. The similarity between the content classification belonging to a specific category and the individual content category can utilize various techniques such as K-means using the cosine similarity. If the similarity between content classification and individual contents exceeds a certain threshold value, a method of including the content into the classification can be used to include a larger amount of content in the category selected for the accuracy of the first step.

In step 3, the content classification unit 130 may analyze the similarities of the classified content bundles and combine the classified content bundle pairs having a similarity degree equal to or greater than the threshold value into one to combine the two similar ones into one.

At this time, the content classification unit 130 may merge keyword pairs that are not merged in the keyword extraction unit 120 by performing the three-step operation.

The content analyzer 140 may analyze the content classified by the keyword based on the statistical information of the statistical database 20.

The statistical information may correspond to sales / sales volume, number of publishing units, sales volume information by age / sex / area,

At this time, the content analyzer 140 can analyze the sales volume based on information such as age / sex / region with respect to a specific keyword classification.

At this time, the content analyzer 140 can analyze characteristics, preference trends, and the like of the main content usage layer for the keyword using the analyzed sales volume.

5 is a graph illustrating sales amounts of first contents according to an embodiment of the present invention. FIG. 6 is a graph illustrating sales amounts of second contents according to an embodiment of the present invention.

5 and 6, the content classification apparatus 100 according to an exemplary embodiment of the present invention analyzes a sales amount based on information such as age / gender / region with respect to a specific keyword classification, You can analyze what features and preferences a layer has.

FIG. 7 is a table showing preferred keywords according to an embodiment of the present invention. FIG. 8 is a table showing preferred keywords according to an embodiment of the present invention.

Referring to FIGS. 7 and 8, FIG. 7 shows an analysis of preferred content keywords for each age group up to three levels, and FIG. 8 shows a table in which three preferred content keywords are analyzed for each region.

9 is a flowchart illustrating a content classification method according to an embodiment of the present invention.

Referring to FIG. 9, a content classification method according to an embodiment of the present invention may extract a keyword first (S210).

That is, the step S210 may extract the keyword from the stored content.

At this time, the step S210 may receive and store contents from the contents database 10 first.

In this case, the keyword may be extracted in step S210 in consideration of the appearance position of the keyword from the stored content.

In this case, step S210 may extract keywords using various extraction methods.

In this case, in step S210, a value obtained by multiplying the appearance frequency (TF) in the document and the threshold value (IDF) of the appearance frequency in the entire document set for each word is measured and used as the weight of the word, Or more can be used as a keyword.

In this case, in step S210, the position where the keyword appears may be considered to determine the weight of the keyword.

At this time, the step S210 may consider the title of the content and the word from the table of contents as more important keywords. In the title and table of contents, words that can represent contents of content can be selectively used. Thus, the words appearing in the title and table of contents may represent the characteristics of the content and may be more important than the words used in the text.

In this case, in step S210, weights of the words may be determined by multiplying the weights corresponding to the occurrence positions and linearly combining the weights in consideration of the positions where the keywords appear in the keyword weights.

In Equation (1), W _t is a weight of a word W appearing in a title, W _c is a weight of a word W appearing in the table of contents, and W _d is a weight of a word appearing in the text.

At this time, the step S210 may merge words having similar meanings in the extracted keywords.

For example, the keywords "family" and "home" are very similar in terms of semantics, so it is possible to derive more accurate analysis results by merging them.

To this end, step S210 may utilize a vocabulary dictionary and utilize a synonym search technique.

In addition, the content classification method according to an embodiment of the present invention may classify the content (S220).

That is, in step S220, the content may be connected to the extracted keyword, and the content may be classified according to the keyword.

In this case, in step S220, the content may be classified and linked to the keyword only when the keyword appears in the table of contents or title so that the classification accuracy is high in step 1 (S221). By classifying in this way, it is possible to select and link related contents for each extracted keyword.

In step S220, the documents having high similarity to the contents belonging to the classified content classified in step 1 may be selected in step S220 so as to belong to the corresponding classification (step S222). The similarity between the content classification belonging to a specific category and the individual content category can utilize various techniques such as K-means using the cosine similarity. If the similarity between content classification and individual contents exceeds a certain threshold value, a method of including the content into the classification can be used to include a larger amount of content in the category selected for the accuracy of the first step.

In step S 220, the similarity degree is analyzed for each classified content bundle in step S 220, and the classified content bundle pairs whose similarities are equal to or more than the threshold value are combined into one, so that two similar classes can be combined into one.

At this time, the step S223 may merge the keyword pairs that are not merged in the keyword extracting unit 120 by performing the three-step operation.

In addition, the content classification method according to an embodiment of the present invention may analyze the content (S230).

That is, the content analysis may be performed based on the statistical information of the statistical database 20 in the content classified by keyword in step S230.

At this time, the step S230 can analyze the sales volume based on information such as age / sex / region for a specific keyword classification.

At this time, in step S230, characteristics or preference trends of the main content usage layer for the keyword can be analyzed using the analyzed sales volume.

FIG. 10 is an operation flowchart showing an example of the content classification step shown in FIG. 9 in detail.

Referring to FIG. 10, in step S220, the content may be classified and linked to the keyword only when the keyword appears in the table of contents or the title so that the classification accuracy is high (step S221). By classifying in this way, it is possible to select and link related contents for each extracted keyword.

11 is a block diagram illustrating a computer system in accordance with an embodiment of the present invention.

Referring to FIG. 11, embodiments of the present invention may be implemented in a computer system 1100, such as a computer-readable recording medium. 11, the computer system 1100 includes one or more processors 1110, a memory 1130, a user interface input device 1140, a user interface output device 1150, And storage 1160. In addition, the computer system 1100 may further include a network interface 1170 connected to the network 1180. Processor 1110 may be a central processing unit or a semiconductor device that executes memory 1130 or processing instructions stored in storage 1160. Memory 1130 and storage 1160 can be various types of volatile or non-volatile storage media. For example, the memory may include ROM 1131 or RAM 1132.

As described above, the content classification apparatus and method according to the present invention are not limited to the configuration and method of the embodiments described above, but the embodiments can be applied to all of the embodiments Or some of them may be selectively combined.

10: content database 20: statistics database
100: Content classification apparatus 110: Content storage unit
120: Keyword extracting unit 130:
140: Content analysis section
1100: Computer system 1110: Processor
1120: bus 1130: memory
1131: ROM 1132: RAM
1140: User input device 1150: User output device
1160: Storage 1170: Network Interface
1180: Network

Claims

A content storage unit for receiving and storing content from a content database;
A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content;
A content classifying unit for classifying the content for each keyword by linking the extracted keyword to the content; And
A content analyzer for analyzing contents based on statistical information of a statistical database, the contents classified by the keywords;
The content classification apparatus comprising: