KR20180059112A - Apparatus for classifying contents and method for using the same - Google Patents

Apparatus for classifying contents and method for using the same Download PDF

Info

Publication number
KR20180059112A
KR20180059112A KR1020160158289A KR20160158289A KR20180059112A KR 20180059112 A KR20180059112 A KR 20180059112A KR 1020160158289 A KR1020160158289 A KR 1020160158289A KR 20160158289 A KR20160158289 A KR 20160158289A KR 20180059112 A KR20180059112 A KR 20180059112A
Authority
KR
South Korea
Prior art keywords
content
keyword
contents
present
classification
Prior art date
Application number
KR1020160158289A
Other languages
Korean (ko)
Other versions
KR102007437B1 (en
Inventor
윤여찬
Original Assignee
한국전자통신연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국전자통신연구원 filed Critical 한국전자통신연구원
Priority to KR1020160158289A priority Critical patent/KR102007437B1/en
Publication of KR20180059112A publication Critical patent/KR20180059112A/en
Application granted granted Critical
Publication of KR102007437B1 publication Critical patent/KR102007437B1/en

Links

Images

Classifications

    • G06F17/30705
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A content classification apparatus and method are disclosed. A content classification apparatus according to an embodiment of the present invention includes a content storage unit for receiving and storing content from a content database; A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content; And a content analyzer for analyzing the content based on the statistical information of the statistical database by connecting the content to the extracted keyword and classifying the content for each keyword.

Description

[0001] APPARATUS FOR CLASSIFYING CONTENTS AND METHOD FOR USING THE SAME [0002]

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data analysis technique, and more particularly, to a big data text analysis technique.

The contents of this book are digital contents of documents which were published on existing paper. Therefore, it is possible to derive meaningful data by applying techniques such as text analysis and classification described in the technical field.

Text analysis technology can extract meaningful data by analyzing documents written in digital text such as blog, SNS, news. For example, it can correspond to a technique that allows the blog text to be analyzed to extract and retrieve key keywords. Big data text technology can extract meaningful information by analyzing a large amount of text. For example, by analyzing SNS, trends such as public opinion trends can be analyzed. Text classification techniques can group large texts into similar categories. For example, news documents can be automatically classified into economy, politics, culture, and so on.

The prior art has a disadvantage in that it is difficult to classify the contents according to the classification of the contents according to a search of a content or classification (for example, self-development, political management, novel, essay, poem, etc.) specified in advance.

Korean Patent Laid-Open Publication No. 10-2013-0104573, entitled " Method and apparatus for classifying morpheme-based contents, "is a method for classifying morphemes in online titles, such as blogs and news, A compound word or the like is mapped to a predetermined category by adding a search and a weight to the content, thereby efficiently classifying the content in real time.

However, Korean Patent Laid-Open No. 10-2013-0104573 has a limitation in that it consumes much time and expense because it analyzes content based on extracted nouns only.

It is an object of the present invention to extract contents of keywords automatically by analyzing contents and subdivide contents according to various classification.

In addition, the present invention aims at automatically grasping sales volume, preference, and the like for various categories through such an analysis method.

In addition, the present invention aims to improve the quality of content providing service and improve the efficiency of analysis by classifying the content in various ways.

According to an aspect of the present invention, there is provided a content classification apparatus comprising: a content storage unit for receiving and storing content from a content database; A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content; And a content analyzer for analyzing the content based on the statistical information of the statistical database by connecting the content to the extracted keyword and classifying the content for each keyword.

The present invention can extract keywords automatically by analyzing the contents, and can provide the contents according to various classification.

In addition, the present invention can automatically grasp the sales volume, preference, and the like by various types of analysis through this analysis method.

In addition, the present invention can classify contents into various categories, thereby improving the quality of the contents providing service and increasing the efficiency of the analysis.

1 is a block diagram showing a content classification apparatus according to an embodiment of the present invention.
2 is a diagram illustrating a keyword extraction method according to an embodiment of the present invention.
3 is a table of contents of north-south contents according to an embodiment of the present invention.
4 is a diagram illustrating a two-part content classification algorithm according to an embodiment of the present invention.
5 is a graph illustrating sales amounts of first contents according to an embodiment of the present invention.
FIG. 6 is a graph illustrating sales amounts of second contents according to an embodiment of the present invention.
FIG. 7 is a table showing preferred keywords according to an embodiment of the present invention.
FIG. 8 is a table showing preferred keywords according to an embodiment of the present invention.
9 is a flowchart illustrating a content classification method according to an embodiment of the present invention.
FIG. 10 is an operation flowchart showing an example of the content classification step shown in FIG. 9 in detail.
11 is a block diagram illustrating a computer system in accordance with an embodiment of the present invention.

The present invention will now be described in detail with reference to the accompanying drawings. Hereinafter, a repeated description, a known function that may obscure the gist of the present invention, and a detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. Accordingly, the shapes and sizes of the elements in the drawings and the like can be exaggerated for clarity.

Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

1 is a block diagram showing a content classification apparatus according to an embodiment of the present invention. 2 is a diagram illustrating a keyword extraction method according to an embodiment of the present invention. 3 is a table of contents of north-south contents according to an embodiment of the present invention. 4 is a diagram illustrating a two-part content classification algorithm according to an embodiment of the present invention

1, a content classification apparatus 100 according to an embodiment of the present invention includes a content storage unit 110, a keyword extraction unit 120, a content classification unit 130, a content analysis unit 140, A database 10 and a statistical database 20.

In the present invention, an example of a content is described as an example of the content of the book, and all various contents including words, characters, and text are also applicable to the present invention.

At this time, the content database 10 and the northbook statistics database 20 may exist outside.

The content storage unit 110 may receive and store the content from the content database 10. [

The keyword extracting unit 120 may extract keywords from the stored contents in consideration of the appearance position of the keyword.

At this time, the keyword extracting unit 120 can extract keywords using various extraction methods.

At this time, the keyword extracting unit 120 measures a value obtained by multiplying the appearance frequency (TF) in the document for each word by the threshold value (IDF) of the appearance frequency in the entire document set, and uses the measured value as the weight of the word, A method of defining a word having a specific weight or more as a keyword can be used.

At this time, the keyword extracting unit 120 may consider the position where the keyword appears in order to determine the weight of the keyword.

In content that includes text, words can appear at various locations, such as the title, table of contents, body text, and picture labels of the content.

At this time, the keyword extracting unit 120 may consider the words from the title and the table of contents as more important keywords. In the title and table of contents, words that can represent contents of content can be selectively used. Thus, the words appearing in the title and table of contents may represent the characteristics of the content and may be more important than the words used in the text.

In this case, the keyword extracting unit 120 may determine the weights of the words by multiplying the weights corresponding to the occurrence positions and linearly combining the weights in consideration of the positions where the keywords appear in the keyword weights.

[Equation 1]

Figure pat00001

In Equation (1), W t is the weight of the word W appearing in the title, W c is the weight of the word W appearing in the table of contents, and W d is the weight of the word appearing in the text.

At this time, the keyword extracting unit 120 may merge words having similar meaning in the extracted keywords.

For example, the extracted "family" and "home" keywords are very similar in terms of semantics, and therefore it is possible to derive more accurate analysis results by merging and classifying them.

For this, the keyword extracting unit 120 may utilize a vocabulary dictionary and utilize a synonym search technique.

Referring to FIG. 2, after extracting keywords from contents, it is understood that keywords that are similar to each other are merged.

The content classification unit 130 may classify the content by keyword by connecting the content to the extracted keyword.

Referring to FIG. 3, it can be seen that the content classifier 130 classifies the content in three stages.

At this time, the content classification unit 130 may classify the content by linking the keyword to the keyword only when the keyword appears in the table of contents or title so that the classification accuracy is high in the first step. By classifying in this way, it is possible to select and link related contents for each extracted keyword.

Also, in step 2, the content classifying unit 130 may sort the documents belonging to the classified classifications in step 1 so as to belong to the corresponding classifications. The similarity between the content classification belonging to a specific category and the individual content category can utilize various techniques such as K-means using the cosine similarity. If the similarity between content classification and individual contents exceeds a certain threshold value, a method of including the content into the classification can be used to include a larger amount of content in the category selected for the accuracy of the first step.

In step 3, the content classification unit 130 may analyze the similarities of the classified content bundles and combine the classified content bundle pairs having a similarity degree equal to or greater than the threshold value into one to combine the two similar ones into one.

At this time, the content classification unit 130 may merge keyword pairs that are not merged in the keyword extraction unit 120 by performing the three-step operation.

The content analyzer 140 may analyze the content classified by the keyword based on the statistical information of the statistical database 20.

The statistical information may correspond to sales / sales volume, number of publishing units, sales volume information by age / sex / area,

At this time, the content analyzer 140 can analyze the sales volume based on information such as age / sex / region with respect to a specific keyword classification.

At this time, the content analyzer 140 can analyze characteristics, preference trends, and the like of the main content usage layer for the keyword using the analyzed sales volume.

5 is a graph illustrating sales amounts of first contents according to an embodiment of the present invention. FIG. 6 is a graph illustrating sales amounts of second contents according to an embodiment of the present invention.

5 and 6, the content classification apparatus 100 according to an exemplary embodiment of the present invention analyzes a sales amount based on information such as age / gender / region with respect to a specific keyword classification, You can analyze what features and preferences a layer has.

FIG. 7 is a table showing preferred keywords according to an embodiment of the present invention. FIG. 8 is a table showing preferred keywords according to an embodiment of the present invention.

Referring to FIGS. 7 and 8, FIG. 7 shows an analysis of preferred content keywords for each age group up to three levels, and FIG. 8 shows a table in which three preferred content keywords are analyzed for each region.

9 is a flowchart illustrating a content classification method according to an embodiment of the present invention.

Referring to FIG. 9, a content classification method according to an embodiment of the present invention may extract a keyword first (S210).

That is, the step S210 may extract the keyword from the stored content.

At this time, the step S210 may receive and store contents from the contents database 10 first.

In this case, the keyword may be extracted in step S210 in consideration of the appearance position of the keyword from the stored content.

In this case, step S210 may extract keywords using various extraction methods.

In this case, in step S210, a value obtained by multiplying the appearance frequency (TF) in the document and the threshold value (IDF) of the appearance frequency in the entire document set for each word is measured and used as the weight of the word, Or more can be used as a keyword.

In this case, in step S210, the position where the keyword appears may be considered to determine the weight of the keyword.

In content that includes text, words can appear at various locations, such as the title, table of contents, body text, and picture labels of the content.

At this time, the step S210 may consider the title of the content and the word from the table of contents as more important keywords. In the title and table of contents, words that can represent contents of content can be selectively used. Thus, the words appearing in the title and table of contents may represent the characteristics of the content and may be more important than the words used in the text.

In this case, in step S210, weights of the words may be determined by multiplying the weights corresponding to the occurrence positions and linearly combining the weights in consideration of the positions where the keywords appear in the keyword weights.

In Equation (1), W t is a weight of a word W appearing in a title, W c is a weight of a word W appearing in the table of contents, and W d is a weight of a word appearing in the text.

At this time, the step S210 may merge words having similar meanings in the extracted keywords.

For example, the keywords "family" and "home" are very similar in terms of semantics, so it is possible to derive more accurate analysis results by merging them.

To this end, step S210 may utilize a vocabulary dictionary and utilize a synonym search technique.

Referring to FIG. 2, after extracting keywords from contents, it is understood that keywords that are similar to each other are merged.

In addition, the content classification method according to an embodiment of the present invention may classify the content (S220).

That is, in step S220, the content may be connected to the extracted keyword, and the content may be classified according to the keyword.

Referring to FIG. 3, it can be seen that the content classifier 130 classifies the content in three stages.

In this case, in step S220, the content may be classified and linked to the keyword only when the keyword appears in the table of contents or title so that the classification accuracy is high in step 1 (S221). By classifying in this way, it is possible to select and link related contents for each extracted keyword.

In step S220, the documents having high similarity to the contents belonging to the classified content classified in step 1 may be selected in step S220 so as to belong to the corresponding classification (step S222). The similarity between the content classification belonging to a specific category and the individual content category can utilize various techniques such as K-means using the cosine similarity. If the similarity between content classification and individual contents exceeds a certain threshold value, a method of including the content into the classification can be used to include a larger amount of content in the category selected for the accuracy of the first step.

In step S 220, the similarity degree is analyzed for each classified content bundle in step S 220, and the classified content bundle pairs whose similarities are equal to or more than the threshold value are combined into one, so that two similar classes can be combined into one.

At this time, the step S223 may merge the keyword pairs that are not merged in the keyword extracting unit 120 by performing the three-step operation.

In addition, the content classification method according to an embodiment of the present invention may analyze the content (S230).

That is, the content analysis may be performed based on the statistical information of the statistical database 20 in the content classified by keyword in step S230.

The statistical information may correspond to sales / sales volume, number of publishing units, sales volume information by age / sex / area,

At this time, the step S230 can analyze the sales volume based on information such as age / sex / region for a specific keyword classification.

At this time, in step S230, characteristics or preference trends of the main content usage layer for the keyword can be analyzed using the analyzed sales volume.

FIG. 10 is an operation flowchart showing an example of the content classification step shown in FIG. 9 in detail.

Referring to FIG. 10, in step S220, the content may be classified and linked to the keyword only when the keyword appears in the table of contents or the title so that the classification accuracy is high (step S221). By classifying in this way, it is possible to select and link related contents for each extracted keyword.

In step S220, the documents having high similarity to the contents belonging to the classified content classified in step 1 may be selected in step S220 so as to belong to the corresponding classification (step S222). The similarity between the content classification belonging to a specific category and the individual content category can utilize various techniques such as K-means using the cosine similarity. If the similarity between content classification and individual contents exceeds a certain threshold value, a method of including the content into the classification can be used to include a larger amount of content in the category selected for the accuracy of the first step.

In step S 220, the similarity degree is analyzed for each classified content bundle in step S 220, and the classified content bundle pairs whose similarities are equal to or more than the threshold value are combined into one, so that two similar classes can be combined into one.

At this time, the step S223 may merge the keyword pairs that are not merged in the keyword extracting unit 120 by performing the three-step operation.

11 is a block diagram illustrating a computer system in accordance with an embodiment of the present invention.

Referring to FIG. 11, embodiments of the present invention may be implemented in a computer system 1100, such as a computer-readable recording medium. 11, the computer system 1100 includes one or more processors 1110, a memory 1130, a user interface input device 1140, a user interface output device 1150, And storage 1160. In addition, the computer system 1100 may further include a network interface 1170 connected to the network 1180. Processor 1110 may be a central processing unit or a semiconductor device that executes memory 1130 or processing instructions stored in storage 1160. Memory 1130 and storage 1160 can be various types of volatile or non-volatile storage media. For example, the memory may include ROM 1131 or RAM 1132.

As described above, the content classification apparatus and method according to the present invention are not limited to the configuration and method of the embodiments described above, but the embodiments can be applied to all of the embodiments Or some of them may be selectively combined.

10: content database 20: statistics database
100: Content classification apparatus 110: Content storage unit
120: Keyword extracting unit 130:
140: Content analysis section
1100: Computer system 1110: Processor
1120: bus 1130: memory
1131: ROM 1132: RAM
1140: User input device 1150: User output device
1160: Storage 1170: Network Interface
1180: Network

Claims (1)

A content storage unit for receiving and storing content from a content database;
A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content;
A content classifying unit for classifying the content for each keyword by linking the extracted keyword to the content; And
A content analyzer for analyzing contents based on statistical information of a statistical database, the contents classified by the keywords;
The content classification apparatus comprising:
KR1020160158289A 2016-11-25 2016-11-25 Apparatus for classifying contents and method for using the same KR102007437B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020160158289A KR102007437B1 (en) 2016-11-25 2016-11-25 Apparatus for classifying contents and method for using the same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020160158289A KR102007437B1 (en) 2016-11-25 2016-11-25 Apparatus for classifying contents and method for using the same

Publications (2)

Publication Number Publication Date
KR20180059112A true KR20180059112A (en) 2018-06-04
KR102007437B1 KR102007437B1 (en) 2019-08-05

Family

ID=62628538

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160158289A KR102007437B1 (en) 2016-11-25 2016-11-25 Apparatus for classifying contents and method for using the same

Country Status (1)

Country Link
KR (1) KR102007437B1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102089666B1 (en) * 2019-03-14 2020-03-16 (주)디에스솔루션즈 Method for automatically aggregating and evaluating seller credit rate using big data and ai auto classification server
KR20210086402A (en) * 2019-12-31 2021-07-08 인천국제공항공사 Apparatus and methods for trend analysis in airport and aviation technology
KR20220070824A (en) * 2020-11-23 2022-05-31 (주)아이브릭스 The keywords extraction method for unstructured data using property dictionary of goods

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100034140A (en) * 2008-09-23 2010-04-01 주식회사 버즈니 System and method for searching opinion using internet
KR20160091756A (en) * 2015-01-26 2016-08-03 (주)해나소프트 Relative quality index estimation apparatus of the web page using keyword search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100034140A (en) * 2008-09-23 2010-04-01 주식회사 버즈니 System and method for searching opinion using internet
KR20160091756A (en) * 2015-01-26 2016-08-03 (주)해나소프트 Relative quality index estimation apparatus of the web page using keyword search

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102089666B1 (en) * 2019-03-14 2020-03-16 (주)디에스솔루션즈 Method for automatically aggregating and evaluating seller credit rate using big data and ai auto classification server
KR20210086402A (en) * 2019-12-31 2021-07-08 인천국제공항공사 Apparatus and methods for trend analysis in airport and aviation technology
KR20220070824A (en) * 2020-11-23 2022-05-31 (주)아이브릭스 The keywords extraction method for unstructured data using property dictionary of goods

Also Published As

Publication number Publication date
KR102007437B1 (en) 2019-08-05

Similar Documents

Publication Publication Date Title
Huston et al. Evaluating verbose query processing techniques
JP7164701B2 (en) Computer-readable storage medium storing methods, apparatus, and instructions for matching semantic text data with tags
US20180300315A1 (en) Systems and methods for document processing using machine learning
WO2017097231A1 (en) Topic processing method and device
CN106776574B (en) User comment text mining method and device
CN107577671B (en) Subject term extraction method based on multi-feature fusion
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
US8412703B2 (en) Search engine for scientific literature providing interface with automatic image ranking
CN111368038B (en) Keyword extraction method and device, computer equipment and storage medium
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
US11144723B2 (en) Method, device, and program for text classification
US10572528B2 (en) System and method for automatic detection and clustering of articles using multimedia information
CN113961685A (en) Information extraction method and device
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
CN108491512A (en) The method of abstracting and device of headline
CN108460150A (en) The processing method and processing device of headline
Gunawan et al. Multi-document summarization by using textrank and maximal marginal relevance for text in Bahasa Indonesia
CN108399265A (en) Real-time hot news providing method based on search and device
CN108363700A (en) The method for evaluating quality and device of headline
US20130052619A1 (en) Method for building information on emotion lexicon and apparatus for the same
KR102007437B1 (en) Apparatus for classifying contents and method for using the same
WO2022105178A1 (en) Keyword extraction method and related device
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN108475265B (en) Method and device for acquiring unknown words
CN112559679B (en) Political new media propagation force detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant