KR101664711B1 - Keyword Extraction Method Using Period Weighting Value - Google Patents
Keyword Extraction Method Using Period Weighting Value Download PDFInfo
- Publication number
- KR101664711B1 KR101664711B1 KR1020150087212A KR20150087212A KR101664711B1 KR 101664711 B1 KR101664711 B1 KR 101664711B1 KR 1020150087212 A KR1020150087212 A KR 1020150087212A KR 20150087212 A KR20150087212 A KR 20150087212A KR 101664711 B1 KR101664711 B1 KR 101664711B1
- Authority
- KR
- South Korea
- Prior art keywords
- period
- words
- extraction method
- distribution curve
- text data
- Prior art date
Links
Images
Classifications
-
- G06F17/277—
-
- G06F17/2795—
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
The present invention relates to a method for extracting key words from text data, and more particularly, to a key word extraction method for extracting key words in consideration of period weight.
Since the era of Web 2.0, the Internet has become more open and interactive, and users are sharing and disseminating experiences richly in a variety of ways. Typically, blogs, Twitter, Facebook, etc. occupy user online time.
There are many kinds of contents created in these services, such as moving pictures and photographs, but there are many types of data in a text form which is not particularly limited to a specific format or a limited range of contents. These texts provide important data for analyzing images of companies' brands, products, services, and so on.
Therefore, there are several methods for extracting information from text. For example, there are methods of extracting the frequency of words (keywords) in text, and methods of grasping the importance of text in documents.
However, these methods do not have enough consideration of duration or time. In other words, the frequency of a word is extracted at a certain point in time. In such a case, there is a problem that the portion where the frequency of the word changes according to the period can not be directly confirmed or confirmed.
Therefore, there is a need for a method for selecting a keyword in consideration of the frequency of words used in accordance with the change of the period.
The keyword extraction method using period weight according to the present invention has an object to effectively analyze trends by selecting key words considering time flow.
The solution of the present invention is not limited to those mentioned above, and other solutions not mentioned can be clearly understood by those skilled in the art from the following description.
A key word extraction method using period weighting according to the present invention includes the steps of (a) classifying text data constituting a corpus according to a detailed period, (b) assigning a period weight to a plurality of words included in the text data, And (c) extracting a word having a reference value or more from among the plurality of words to which the term weighting is applied by the step (b).
In the step (b), the degree of uneven distribution over the entire period may be given as a period weight for the plurality of words.
In the step (b), period weighting may be given based on the absolute value of the degree of distortion of the distribution curve of the plurality of words.
And, the distortion of the distribution curve,
Of the above-mentioned equation.
In the step (b), period weighting may be applied based on the absolute value of the kurtosis of the distribution curve of the plurality of words.
And the kurtosis of the distribution curve,
Of the above-mentioned equation.
The key word extraction method using period weighting according to the present invention has the following effects.
First, since key words can be extracted in consideration of time rather than a predetermined time, there is an advantage that trends within a predetermined period can be analyzed precisely and effectively.
Secondly, it has the advantage of presenting important data for analyzing images of brands, products, and services of various companies.
The effects of the present invention are not limited to those mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the following description.
FIG. 1 is a flowchart illustrating a key word extraction method according to an exemplary embodiment of the present invention. Referring to FIG.
2 is a graph showing a normal distribution curve.
3 is a graph showing a distribution curve in which the degree of distortion is smaller than zero.
4 is a graph showing a distribution curve in which the degree of distortion is larger than zero.
FIG. 5 is a graph showing a distribution curve having a kurtosis of greater than 3. FIG.
6 is a graph showing a distribution curve with a kurtosis less than 3. Fig.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings. In describing the present embodiment, the same designations and the same reference numerals are used for the same components, and further description thereof will be omitted.
FIG. 1 is a flowchart illustrating a key word extraction method according to an exemplary embodiment of the present invention. Referring to FIG.
As shown in FIG. 1, a key word extraction method according to an embodiment of the present invention includes: (a) classifying text data constituting a corpus according to a detailed period; and And (c) extracting a word having a value greater than a reference value among a plurality of words to which period weighting is applied by the step (b).
That is, the present invention can select key words in consideration of the passage of time, and thereby can effectively analyze trends, interests, and the like within the period.
In the conventional key word derivation method, the frequency of words is considered. However, since the frequency of use varies with time, words need to be weighted.
For example, the frequency of major words varies by day / month / year, disappears, and in the case of buzzwords, the word or sentence may be used only at that time and rarely appear afterwards. In addition, it is difficult for the word or sentence to represent the whole period.
Therefore, it is very important to consider the term, that is, it is necessary to identify which words are changing with time, and to select key words according to the changes.
Further, the present invention can provide a selection criterion by presenting a criterion that can be selected automatically or from a calculated value, rather than manually selecting a word. In other words, by providing the index of the word, the criterion for selecting the keyword can be suggested.
Hereinafter, each of the above-described steps will be described in detail.
First, step (a) of classifying the text data constituting the corpus according to the detailed period is performed. The corpus refers to a set of data, and in the present embodiment refers to a set of text data of an online service such as the Web or SNS.
In this step, the text data existing on the online service is classified according to a predetermined criterion. That is, the text data is classified according to a predetermined unit period or the like. For example, the text data may be classified according to a month, or a period from a certain point of view to another point of time may be classified.
Next, a period weight is assigned to a plurality of words included in the text data.
In this step, in order to extract a meaningful word among a plurality of words included in the text data classified according to the period by the step (a), a period weight is given to each word.
In other words, generally used terms will generally have a distribution that does not show a large change in the frequency of use regardless of the period, and statistically excluded terms in such a distribution can be judged to have significance within the period .
Therefore, in this step, the uneven distribution over the entire period is given as a period weight for a plurality of words.
More specifically, in the case of this embodiment, when the frequency of use of the word according to the period is graphed, the period weight can be calculated on the basis of how far the curve is from the normal distribution curve. For this purpose, the present embodiment utilizes the skewness and kurtosis of the curve shown in the graph.
2 is a graph showing a normal distribution curve. As shown in FIG. 2, the normal distribution curve is formed symmetrically with respect to the center point. In other words, the closer to this distribution, the more commonly used words can be judged regardless of the period.
In the case of FIG. 3, the distribution is a distribution curve having a shape less than zero, and FIG. 4 is a distribution curve having a shape having a distribution greater than zero. In the case of FIGS. 3 and 4, the shape of the distribution is shifted to one side with respect to the central axis, and it can be determined that the word is used in a specific period.
In other words, the larger the magnitude, the more pronounced this tendency, and therefore the absolute value of the distribution curve can be used as the period weight.
In the present embodiment, the distortion as the period weight can be calculated by the following formula.
That is, it is possible to calculate the absolute value of the degree of distortion by sequentially substituting corresponding values represented by the classified periods.
On the other hand, FIG. 5 shows a distribution curve having a kurtosis greater than 3, and FIG. 6 shows a distribution curve with a kurtosis less than 3. FIG. In the case of Figs. 5 and 6, the shape of the distribution is formed such that the highest point is higher or lower than the normal distribution curve.
That is, the kurtosis is an index indicating the degree of sharpness of the distribution curve, and it can be judged that the word is used intensively in a specific period. Therefore, a word with a high kurtosis can be regarded as a word that reflects important words or trends within that period.
On the other hand, the kurtosis of the normal distribution curve is 3, and thus the kurtosis of the distribution curve with the highest absolute value can be used as the period weight.
In the present embodiment, the kurtosis as the period weight can be calculated by the following formula.
As described above, the step (b) is performed, and then the step (c) of extracting a word of a reference value or more out of a plurality of words to which the term weighting is applied is performed.
In this step, among the plurality of words to which the period weight is assigned, words longer than a certain reference value are extracted and selected as key words. In this case, the reference value may be a predetermined absolute reference value, but it may be based on a relative standard with other words.
As described above, according to the present invention, key words can be extracted in consideration of a time period rather than a predetermined time point. Therefore, it is possible to analyze trends within a predetermined period in a precise and effective manner. It is possible to present important data that can analyze the image of the user.
The embodiments and the accompanying drawings described in the present specification are merely illustrative of some of the technical ideas included in the present invention. Accordingly, the embodiments disclosed herein are for the purpose of describing rather than limiting the technical spirit of the present invention, and it is apparent that the scope of the technical idea of the present invention is not limited by these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (6)
(A) classifying text data constituting a corpus according to a detailed period;
(B) assigning a period weight to a plurality of words included in the text data; And
(C) extracting a word having a reference value or more among a plurality of words to which period weighting is applied by the step (b);
/ RTI >
The step (b)
A key word extraction method for assigning a period weight to a plurality of words based on the absolute value of the degree of distortion of the distribution curve of the plurality of words so as to give a degree of uneven distribution over the entire period as a period weight value .
(A) classifying text data constituting a corpus according to a detailed period;
(B) assigning a period weight to a plurality of words included in the text data; And
(C) extracting a word having a reference value or more among a plurality of words to which period weighting is applied by the step (b);
/ RTI >
The step (b)
Wherein a term weighting value is given based on an absolute value of a kurtosis with respect to a distribution curve of the plurality of words so that a degree of uneven distribution over the entire period is given as a period weight for the plurality of words.
The distortion of the distribution curve is,
The key word extraction method which is calculated by the formula of.
The kurtosis of the distribution curve,
The key word extraction method which is calculated by the formula of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150087212A KR101664711B1 (en) | 2015-06-19 | 2015-06-19 | Keyword Extraction Method Using Period Weighting Value |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150087212A KR101664711B1 (en) | 2015-06-19 | 2015-06-19 | Keyword Extraction Method Using Period Weighting Value |
Publications (1)
Publication Number | Publication Date |
---|---|
KR101664711B1 true KR101664711B1 (en) | 2016-10-10 |
Family
ID=57145634
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150087212A KR101664711B1 (en) | 2015-06-19 | 2015-06-19 | Keyword Extraction Method Using Period Weighting Value |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101664711B1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190050180A (en) | 2017-11-02 | 2019-05-10 | 서강대학교산학협력단 | keyword extraction method and apparatus for science document |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100626817B1 (en) * | 2005-05-13 | 2006-09-20 | 한국과학기술정보연구원 | System for extracting words, system and method for management of words life cycle and medium for storing for program carrying out method of management of words life cycle |
KR20090083747A (en) | 2008-01-30 | 2009-08-04 | 삼성전자주식회사 | User terminal and method for providing summery of web page |
KR20090125559A (en) * | 2008-06-02 | 2009-12-07 | 엔에이치엔(주) | Method and system for providing search service using timeliness query |
KR101401175B1 (en) * | 2012-12-28 | 2014-05-29 | 성균관대학교산학협력단 | Method and system for text mining using weighted term frequency |
KR20150071833A (en) * | 2013-12-19 | 2015-06-29 | 한국전자통신연구원 | Processing Method For Social Media Issue and Server Device supporting the same |
-
2015
- 2015-06-19 KR KR1020150087212A patent/KR101664711B1/en active IP Right Grant
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100626817B1 (en) * | 2005-05-13 | 2006-09-20 | 한국과학기술정보연구원 | System for extracting words, system and method for management of words life cycle and medium for storing for program carrying out method of management of words life cycle |
KR20090083747A (en) | 2008-01-30 | 2009-08-04 | 삼성전자주식회사 | User terminal and method for providing summery of web page |
KR20090125559A (en) * | 2008-06-02 | 2009-12-07 | 엔에이치엔(주) | Method and system for providing search service using timeliness query |
KR101401175B1 (en) * | 2012-12-28 | 2014-05-29 | 성균관대학교산학협력단 | Method and system for text mining using weighted term frequency |
KR20150071833A (en) * | 2013-12-19 | 2015-06-29 | 한국전자통신연구원 | Processing Method For Social Media Issue and Server Device supporting the same |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190050180A (en) | 2017-11-02 | 2019-05-10 | 서강대학교산학협력단 | keyword extraction method and apparatus for science document |
KR102017227B1 (en) | 2017-11-02 | 2019-09-02 | 서강대학교산학협력단 | keyword extraction method and apparatus for science document |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Finkelstein et al. | App store analysis: Mining app stores for relationships between customer, business and technical characteristics | |
Amanatidis et al. | Social media for cultural communication: A critical investigation of museums’ Instagram practices | |
US11416680B2 (en) | Classifying social media inputs via parts-of-speech filtering | |
US9583099B2 (en) | Method and system for performing term analysis in social data | |
WO2017085717A1 (en) | System and method for presentation of content linked comments | |
JP6599319B2 (en) | Use of social information to recommend applications | |
US10528223B2 (en) | Photo narrative essay application | |
US10832142B2 (en) | System, method, and recording medium for expert recommendation while composing messages | |
US20140074857A1 (en) | Weighted ranking of video data | |
US10339559B2 (en) | Associating social comments with individual assets used in a campaign | |
US10530889B2 (en) | Identifying member profiles containing disallowed content in an online social network | |
KR101664711B1 (en) | Keyword Extraction Method Using Period Weighting Value | |
US20120311421A1 (en) | Server device and method | |
Bahrini et al. | It’s Long and Complicated! Enhancing One-Pager Privacy Policies in Smart Home Applications | |
TWI575391B (en) | Social data filtering system, method and non-transitory computer readable storage medium of the same | |
US10432572B2 (en) | Content posting method and apparatus | |
CN105706409B (en) | Method, device and system for enhancing user engagement with service | |
JP5813052B2 (en) | Information processing apparatus, method, and program | |
CN104317581B (en) | Display method and electronic equipment | |
CN110058992B (en) | Text template effect feedback method and device and electronic equipment | |
JP6732472B2 (en) | User information processing server and user information processing method | |
US20150142576A1 (en) | Methods and mobile devices for displaying an adaptive advertisement object and systems for generating the adaptive advertisement object | |
US9152679B2 (en) | Displaying recommended entities in a relevance map | |
US20160078002A1 (en) | Representing numerical data in a mobile device | |
JP2016045552A (en) | Feature extraction program, feature extraction method, and feature extraction device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant | ||
FPAY | Annual fee payment |
Payment date: 20190905 Year of fee payment: 4 |