KR20180059112A - Apparatus for classifying contents and method for using the same - Google Patents
Apparatus for classifying contents and method for using the same Download PDFInfo
- Publication number
- KR20180059112A KR20180059112A KR1020160158289A KR20160158289A KR20180059112A KR 20180059112 A KR20180059112 A KR 20180059112A KR 1020160158289 A KR1020160158289 A KR 1020160158289A KR 20160158289 A KR20160158289 A KR 20160158289A KR 20180059112 A KR20180059112 A KR 20180059112A
- Authority
- KR
- South Korea
- Prior art keywords
- content
- keyword
- contents
- present
- classification
- Prior art date
Links
Images
Classifications
-
- G06F17/30705—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A content classification apparatus and method are disclosed. A content classification apparatus according to an embodiment of the present invention includes a content storage unit for receiving and storing content from a content database; A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content; And a content analyzer for analyzing the content based on the statistical information of the statistical database by connecting the content to the extracted keyword and classifying the content for each keyword.
Description
BACKGROUND OF THE
The contents of this book are digital contents of documents which were published on existing paper. Therefore, it is possible to derive meaningful data by applying techniques such as text analysis and classification described in the technical field.
Text analysis technology can extract meaningful data by analyzing documents written in digital text such as blog, SNS, news. For example, it can correspond to a technique that allows the blog text to be analyzed to extract and retrieve key keywords. Big data text technology can extract meaningful information by analyzing a large amount of text. For example, by analyzing SNS, trends such as public opinion trends can be analyzed. Text classification techniques can group large texts into similar categories. For example, news documents can be automatically classified into economy, politics, culture, and so on.
The prior art has a disadvantage in that it is difficult to classify the contents according to the classification of the contents according to a search of a content or classification (for example, self-development, political management, novel, essay, poem, etc.) specified in advance.
Korean Patent Laid-Open Publication No. 10-2013-0104573, entitled " Method and apparatus for classifying morpheme-based contents, "is a method for classifying morphemes in online titles, such as blogs and news, A compound word or the like is mapped to a predetermined category by adding a search and a weight to the content, thereby efficiently classifying the content in real time.
However, Korean Patent Laid-Open No. 10-2013-0104573 has a limitation in that it consumes much time and expense because it analyzes content based on extracted nouns only.
It is an object of the present invention to extract contents of keywords automatically by analyzing contents and subdivide contents according to various classification.
In addition, the present invention aims at automatically grasping sales volume, preference, and the like for various categories through such an analysis method.
In addition, the present invention aims to improve the quality of content providing service and improve the efficiency of analysis by classifying the content in various ways.
According to an aspect of the present invention, there is provided a content classification apparatus comprising: a content storage unit for receiving and storing content from a content database; A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content; And a content analyzer for analyzing the content based on the statistical information of the statistical database by connecting the content to the extracted keyword and classifying the content for each keyword.
The present invention can extract keywords automatically by analyzing the contents, and can provide the contents according to various classification.
In addition, the present invention can automatically grasp the sales volume, preference, and the like by various types of analysis through this analysis method.
In addition, the present invention can classify contents into various categories, thereby improving the quality of the contents providing service and increasing the efficiency of the analysis.
1 is a block diagram showing a content classification apparatus according to an embodiment of the present invention.
2 is a diagram illustrating a keyword extraction method according to an embodiment of the present invention.
3 is a table of contents of north-south contents according to an embodiment of the present invention.
4 is a diagram illustrating a two-part content classification algorithm according to an embodiment of the present invention.
5 is a graph illustrating sales amounts of first contents according to an embodiment of the present invention.
FIG. 6 is a graph illustrating sales amounts of second contents according to an embodiment of the present invention.
FIG. 7 is a table showing preferred keywords according to an embodiment of the present invention.
FIG. 8 is a table showing preferred keywords according to an embodiment of the present invention.
9 is a flowchart illustrating a content classification method according to an embodiment of the present invention.
FIG. 10 is an operation flowchart showing an example of the content classification step shown in FIG. 9 in detail.
11 is a block diagram illustrating a computer system in accordance with an embodiment of the present invention.
The present invention will now be described in detail with reference to the accompanying drawings. Hereinafter, a repeated description, a known function that may obscure the gist of the present invention, and a detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. Accordingly, the shapes and sizes of the elements in the drawings and the like can be exaggerated for clarity.
Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.
Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.
1 is a block diagram showing a content classification apparatus according to an embodiment of the present invention. 2 is a diagram illustrating a keyword extraction method according to an embodiment of the present invention. 3 is a table of contents of north-south contents according to an embodiment of the present invention. 4 is a diagram illustrating a two-part content classification algorithm according to an embodiment of the present invention
1, a
In the present invention, an example of a content is described as an example of the content of the book, and all various contents including words, characters, and text are also applicable to the present invention.
At this time, the
The
The
At this time, the
At this time, the
At this time, the
In content that includes text, words can appear at various locations, such as the title, table of contents, body text, and picture labels of the content.
At this time, the
In this case, the
[Equation 1]
In Equation (1), W t is the weight of the word W appearing in the title, W c is the weight of the word W appearing in the table of contents, and W d is the weight of the word appearing in the text.
At this time, the
For example, the extracted "family" and "home" keywords are very similar in terms of semantics, and therefore it is possible to derive more accurate analysis results by merging and classifying them.
For this, the
Referring to FIG. 2, after extracting keywords from contents, it is understood that keywords that are similar to each other are merged.
The
Referring to FIG. 3, it can be seen that the
At this time, the
Also, in
In
At this time, the
The
The statistical information may correspond to sales / sales volume, number of publishing units, sales volume information by age / sex / area,
At this time, the
At this time, the
5 is a graph illustrating sales amounts of first contents according to an embodiment of the present invention. FIG. 6 is a graph illustrating sales amounts of second contents according to an embodiment of the present invention.
5 and 6, the
FIG. 7 is a table showing preferred keywords according to an embodiment of the present invention. FIG. 8 is a table showing preferred keywords according to an embodiment of the present invention.
Referring to FIGS. 7 and 8, FIG. 7 shows an analysis of preferred content keywords for each age group up to three levels, and FIG. 8 shows a table in which three preferred content keywords are analyzed for each region.
9 is a flowchart illustrating a content classification method according to an embodiment of the present invention.
Referring to FIG. 9, a content classification method according to an embodiment of the present invention may extract a keyword first (S210).
That is, the step S210 may extract the keyword from the stored content.
At this time, the step S210 may receive and store contents from the
In this case, the keyword may be extracted in step S210 in consideration of the appearance position of the keyword from the stored content.
In this case, step S210 may extract keywords using various extraction methods.
In this case, in step S210, a value obtained by multiplying the appearance frequency (TF) in the document and the threshold value (IDF) of the appearance frequency in the entire document set for each word is measured and used as the weight of the word, Or more can be used as a keyword.
In this case, in step S210, the position where the keyword appears may be considered to determine the weight of the keyword.
In content that includes text, words can appear at various locations, such as the title, table of contents, body text, and picture labels of the content.
At this time, the step S210 may consider the title of the content and the word from the table of contents as more important keywords. In the title and table of contents, words that can represent contents of content can be selectively used. Thus, the words appearing in the title and table of contents may represent the characteristics of the content and may be more important than the words used in the text.
In this case, in step S210, weights of the words may be determined by multiplying the weights corresponding to the occurrence positions and linearly combining the weights in consideration of the positions where the keywords appear in the keyword weights.
In Equation (1), W t is a weight of a word W appearing in a title, W c is a weight of a word W appearing in the table of contents, and W d is a weight of a word appearing in the text.
At this time, the step S210 may merge words having similar meanings in the extracted keywords.
For example, the keywords "family" and "home" are very similar in terms of semantics, so it is possible to derive more accurate analysis results by merging them.
To this end, step S210 may utilize a vocabulary dictionary and utilize a synonym search technique.
Referring to FIG. 2, after extracting keywords from contents, it is understood that keywords that are similar to each other are merged.
In addition, the content classification method according to an embodiment of the present invention may classify the content (S220).
That is, in step S220, the content may be connected to the extracted keyword, and the content may be classified according to the keyword.
Referring to FIG. 3, it can be seen that the
In this case, in step S220, the content may be classified and linked to the keyword only when the keyword appears in the table of contents or title so that the classification accuracy is high in step 1 (S221). By classifying in this way, it is possible to select and link related contents for each extracted keyword.
In step S220, the documents having high similarity to the contents belonging to the classified content classified in
In step S 220, the similarity degree is analyzed for each classified content bundle in step S 220, and the classified content bundle pairs whose similarities are equal to or more than the threshold value are combined into one, so that two similar classes can be combined into one.
At this time, the step S223 may merge the keyword pairs that are not merged in the
In addition, the content classification method according to an embodiment of the present invention may analyze the content (S230).
That is, the content analysis may be performed based on the statistical information of the
The statistical information may correspond to sales / sales volume, number of publishing units, sales volume information by age / sex / area,
At this time, the step S230 can analyze the sales volume based on information such as age / sex / region for a specific keyword classification.
At this time, in step S230, characteristics or preference trends of the main content usage layer for the keyword can be analyzed using the analyzed sales volume.
FIG. 10 is an operation flowchart showing an example of the content classification step shown in FIG. 9 in detail.
Referring to FIG. 10, in step S220, the content may be classified and linked to the keyword only when the keyword appears in the table of contents or the title so that the classification accuracy is high (step S221). By classifying in this way, it is possible to select and link related contents for each extracted keyword.
In step S220, the documents having high similarity to the contents belonging to the classified content classified in
In step S 220, the similarity degree is analyzed for each classified content bundle in step S 220, and the classified content bundle pairs whose similarities are equal to or more than the threshold value are combined into one, so that two similar classes can be combined into one.
At this time, the step S223 may merge the keyword pairs that are not merged in the
11 is a block diagram illustrating a computer system in accordance with an embodiment of the present invention.
Referring to FIG. 11, embodiments of the present invention may be implemented in a
As described above, the content classification apparatus and method according to the present invention are not limited to the configuration and method of the embodiments described above, but the embodiments can be applied to all of the embodiments Or some of them may be selectively combined.
10: content database 20: statistics database
100: Content classification apparatus 110: Content storage unit
120: Keyword extracting unit 130:
140: Content analysis section
1100: Computer system 1110: Processor
1120: bus 1130: memory
1131: ROM 1132: RAM
1140: User input device 1150: User output device
1160: Storage 1170: Network Interface
1180: Network
Claims (1)
A keyword extracting unit for extracting the keyword in consideration of a location of a keyword from the stored content;
A content classifying unit for classifying the content for each keyword by linking the extracted keyword to the content; And
A content analyzer for analyzing contents based on statistical information of a statistical database, the contents classified by the keywords;
The content classification apparatus comprising:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160158289A KR102007437B1 (en) | 2016-11-25 | 2016-11-25 | Apparatus for classifying contents and method for using the same |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160158289A KR102007437B1 (en) | 2016-11-25 | 2016-11-25 | Apparatus for classifying contents and method for using the same |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20180059112A true KR20180059112A (en) | 2018-06-04 |
KR102007437B1 KR102007437B1 (en) | 2019-08-05 |
Family
ID=62628538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020160158289A KR102007437B1 (en) | 2016-11-25 | 2016-11-25 | Apparatus for classifying contents and method for using the same |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR102007437B1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102089666B1 (en) * | 2019-03-14 | 2020-03-16 | (주)디에스솔루션즈 | Method for automatically aggregating and evaluating seller credit rate using big data and ai auto classification server |
KR20210086402A (en) * | 2019-12-31 | 2021-07-08 | 인천국제공항공사 | Apparatus and methods for trend analysis in airport and aviation technology |
KR20220070824A (en) * | 2020-11-23 | 2022-05-31 | (주)아이브릭스 | The keywords extraction method for unstructured data using property dictionary of goods |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100034140A (en) * | 2008-09-23 | 2010-04-01 | 주식회사 버즈니 | System and method for searching opinion using internet |
KR20160091756A (en) * | 2015-01-26 | 2016-08-03 | (주)해나소프트 | Relative quality index estimation apparatus of the web page using keyword search |
-
2016
- 2016-11-25 KR KR1020160158289A patent/KR102007437B1/en active IP Right Grant
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100034140A (en) * | 2008-09-23 | 2010-04-01 | 주식회사 버즈니 | System and method for searching opinion using internet |
KR20160091756A (en) * | 2015-01-26 | 2016-08-03 | (주)해나소프트 | Relative quality index estimation apparatus of the web page using keyword search |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102089666B1 (en) * | 2019-03-14 | 2020-03-16 | (주)디에스솔루션즈 | Method for automatically aggregating and evaluating seller credit rate using big data and ai auto classification server |
KR20210086402A (en) * | 2019-12-31 | 2021-07-08 | 인천국제공항공사 | Apparatus and methods for trend analysis in airport and aviation technology |
KR20220070824A (en) * | 2020-11-23 | 2022-05-31 | (주)아이브릭스 | The keywords extraction method for unstructured data using property dictionary of goods |
Also Published As
Publication number | Publication date |
---|---|
KR102007437B1 (en) | 2019-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Huston et al. | Evaluating verbose query processing techniques | |
JP7164701B2 (en) | Computer-readable storage medium storing methods, apparatus, and instructions for matching semantic text data with tags | |
US20180300315A1 (en) | Systems and methods for document processing using machine learning | |
WO2017097231A1 (en) | Topic processing method and device | |
CN106776574B (en) | User comment text mining method and device | |
CN107577671B (en) | Subject term extraction method based on multi-feature fusion | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
US8412703B2 (en) | Search engine for scientific literature providing interface with automatic image ranking | |
CN111368038B (en) | Keyword extraction method and device, computer equipment and storage medium | |
CN112347778A (en) | Keyword extraction method and device, terminal equipment and storage medium | |
US11144723B2 (en) | Method, device, and program for text classification | |
US10572528B2 (en) | System and method for automatic detection and clustering of articles using multimedia information | |
CN113961685A (en) | Information extraction method and device | |
US11893537B2 (en) | Linguistic analysis of seed documents and peer groups | |
CN108491512A (en) | The method of abstracting and device of headline | |
CN108460150A (en) | The processing method and processing device of headline | |
Gunawan et al. | Multi-document summarization by using textrank and maximal marginal relevance for text in Bahasa Indonesia | |
CN108399265A (en) | Real-time hot news providing method based on search and device | |
CN108363700A (en) | The method for evaluating quality and device of headline | |
US20130052619A1 (en) | Method for building information on emotion lexicon and apparatus for the same | |
KR102007437B1 (en) | Apparatus for classifying contents and method for using the same | |
WO2022105178A1 (en) | Keyword extraction method and related device | |
CN110888983B (en) | Positive and negative emotion analysis method, terminal equipment and storage medium | |
CN108475265B (en) | Method and device for acquiring unknown words | |
CN112559679B (en) | Political new media propagation force detection method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant |