KR20170142526A

KR20170142526A - Apparatus and method of generating word set for analyzing text

Info

Publication number: KR20170142526A
Application number: KR1020160076096A
Authority: KR
Inventors: 천세욱
Original assignee: 천세욱
Priority date: 2016-06-17
Filing date: 2016-06-17
Publication date: 2017-12-28

Abstract

Disclosed are a device and a method to generate a word group for text analysis. The present invention comprises: a word extracting module extracting a word with a meaning by conducting morphological analysis on text; a word frequency calculating module calculating the frequency of the same word as the word extracted by the word extracting module; and a word group generating module inserting the word, extracted by the word extracting module, into each pixel of a word group table comprising a plurality of pixels, and inserting the same ones among the extracted words into the same pixel by displaying the frequency calculated by the frequency calculating module. According to the present invention, as a word group table is generated by extracting words from an email or long text file and the frequency and categories of the words are visually displayed, the present invention is capable of increasing work efficiency by enabling a user to recognize categories and summaries beforehand without checking the email or text file. Additionally, the user is able to get information without opening a spam email, thereby easily preventing the spread of computer viruses.

Description

[0001] APPARATUS AND METHOD OF GENERATING WORD SET FOR ANALYZING TEXT [0002]

The present invention relates to an apparatus and method for text analysis, and more particularly, to an apparatus and method for generating a word set for text analysis.

Patent Publication No. 10-1315734 discloses an arrangement for extracting and analyzing words in text.

The analysis techniques using word extraction of text are implemented and used in various ways. It is often used for analysis of big data of social network service (SNS).

However, a method of analyzing text such as a long document file or an email, analyzing the contents of the text, or analyzing and visually confirming whether the text is spam mail has not been disclosed yet.

If you can visually check the content of any document file or e-mail in advance, you can get a quick and accurate picture of the type of document or e-mail, its content, and its importance.

From the viewpoint of business or organizing documents, you can improve the efficiency of your work by doing a lot of document work.

10-1315734

An object of the present invention is to provide a word set generation apparatus for text analysis.

Another object of the present invention is to provide a method of generating a word set for text analysis.

According to an aspect of the present invention, there is provided an apparatus for generating a word set for text analysis, the apparatus comprising: a word extraction module for extracting words having a meaning by performing morphological analysis on text; A word frequency calculation module for calculating a frequency of the same word among the words extracted by the word extraction module; The word extracting module inserts the extracted word into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is displayed on the same pixel by displaying the frequency calculated by the word frequency calculating module And a word set generation module configured to insert the word set.

Here, the word having the above meaning may be a word of a noun or a predicate.

The word set generation module displays each pixel in which the word is inserted in a gray color or an RGB color and displays it in the gray scale according to the frequency of the word inserted into the pixel or the RGB And may be configured to display different colors of darkness.

And a word filtering module for determining whether the word extracted by the word extraction module corresponds to a predetermined category or a predetermined word.

If the word filtering module determines that a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word, the word-set generation module may generate a word- And to display the corresponding pixel in a predetermined color. The word filtering module may further comprise a word filtering module.

According to another aspect of the present invention, there is provided a method of generating a word set for text analysis, the method comprising: extracting a word having a meaning by performing a morphological analysis on a text; Calculating a frequency of the same word among the words extracted from the word extraction module by the word frequency calculation module; The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words indicates a frequency calculated by the word frequency calculation module And inserting the same into the same pixel.

Here, the word having the above meaning may be a stem of a noun or a descriptive word.

The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is converted into a frequency coefficient calculated by the word frequency calculation module Wherein the step of inserting the pixels into the same pixel comprises: displaying the pixels in which the word is embedded in a gray color or an RGB color, (gray scale) or differently display the enhancement of the RGB color.

Determining whether a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word; And if the determination result corresponds to a predetermined category or a predetermined word, the word set generation module may display the corresponding pixel in a predetermined color.

According to the apparatus and method for generating a word set for text analysis described above, words such as a long document file and e-mail are extracted to generate a word set table, and the frequency and category are easily visualized and displayed. It is possible to recognize in advance the content, the category, the subject, and the like of the e-mail without directly confirming the contents of the e-mail, thereby improving the efficiency of the work.

In addition, the spread of computer viruses can be easily prevented because the spam email can be known in advance without opening the spam email.

1 is a block diagram of a word set generation apparatus for text analysis according to an embodiment of the present invention.
FIGs. 2A and 2B are illustrations of word aggregation tables according to an embodiment of the present invention. FIG.
3 is a flowchart illustrating a method of generating a word set for text analysis according to an embodiment of the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail to the concrete inventive concept. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

FIG. 1 is a block diagram of a word set generation apparatus for text analysis according to an embodiment of the present invention, and FIGS. 2A and 2B are illustrations of word set tables according to an embodiment of the present invention.

Referring to FIG. 1, a word set generation apparatus 100 for text analysis according to an embodiment of the present invention includes a word extraction module 110, a word frequency calculation module 120, a word filtering module 130, And a set generation module 140 as shown in FIG.

The word-set generation apparatus 100 for text analysis is configured to automatically extract words from text and visualize them in a word-set table based on the frequency or the informality.

The user can know the information of the text in advance through the word set table displayed visually without viewing the text, and the visualized text can be usefully used for enhancing the work efficiency.

Hereinafter, the detailed configuration will be described.

The word extraction module 110 may be configured to perform morphological analysis on the text to extract a word having a meaning. The word extraction module 110 may utilize analytic techniques such as text mining.

Here, a meaningful word means a word except for the word of the research or the end of the predicate in the case of Korean. Basically, it can be a noun, and if you extend the scope of word extraction, the stem of a predicate corresponding to a verb or adjective can also correspond to this.

For example, the words 'domestic', 'industrial property rights', 'application', and 'support' can be extracted when there is a text 'supporting application for domestic industrial property rights'.

In the case of English, nouns are basically extracted, and verbs and adjectives can be further extracted according to extraction conditions.

The word frequency calculation module 120 may be configured to calculate the frequency of the same word among the words extracted from the word extraction module 110. [

And sequentially accumulate the appearance frequencies for the same word.

The word filtering module 130 may be configured to determine whether a word extracted from the word extraction module 110 corresponds to a predetermined category or a predetermined filtering word.

A particular category can be a business-related category, a category for a specific field, and the category can include jargon or idioms in the field.

By the above filtering, the nature and kind of the text can be known in advance.

Categories or predefined filtering words may be cunning or slang, words often appearing in spam, and such categories or filtering words can be used to filter out such text.

The word set generation module 140 may be configured to generate a word set table by inserting the word extracted from the word extraction module 110 into each pixel of a word set table composed of a plurality of pixels.

Here, the word set table can be extended according to the number of words of the text, such as 3 X 3 pixels or 5 X 5 pixels.

Words extracted from the word extraction module 110 may be sequentially inserted into each pixel. At this time, the same word among the extracted words may be configured to display the frequency calculated by the word frequency calculation module 120 and to insert the calculated frequency into the same pixel. That is, the same word is inserted only once into one pixel and the frequency of the word can be displayed.

At this time, although the frequency may be displayed as a number, for the sake of more convenient visualization, the word set generation module 140 displays each pixel in which the word is inserted in a gray color or an RGB color, Or may be configured to display a gray scale or a different display of the intensity of the RGB color according to the frequency of the word.

Therefore, it can be seen that the frequency of pixels displayed darker is greater. The user can easily grasp the relative frequency of each word according to the difference in color or the gray scale even if the exact frequency is not known.

On the other hand, if the word extracted by the word extraction module 110 corresponds to a predetermined category or a predetermined word, the word set generation module 140 may display the corresponding pixel in a predetermined color. For example, in the case of the banned category, you can see in red how many banned words are included in the text.

The word set generation module 140 may judge how much pixels are occupied by all the pixels and classify the text into spam mails, abusive mails, unhealthy stories, and the like.

Meanwhile, the word set generation module 140 may use a histogram method, a term frequency / inverted term frequency (TF / ITF) method, or a dictionary-based cluster method to specify the color of the word set table .

This approach can also be useful for generating word aggregation tables through multiple documents.

2A shows an example of generating a word set table using the histogram method. A histogram of each word extracted from the entire document can be generated and processed.

The TF / ITF method calculates the TF / ITF value by multiplying the reciprocal of the total number of words by the frequency of the word.

The TF value is a count of how many occurrences of a word occur in a document.

For example, if the total number of words in the text is 100 and the frequency of the word A is 10, 10 X (1/100) = 1/10. Here, the TF value becomes 0.1, and a color value corresponding to 0.1 is displayed on the corresponding pixel. In the case of gray scale, it becomes 254 X 0.1, and the pixel is displayed with the brightness of 25.4.

In this case, if it becomes too dark gray, the contrast ratio can be adjusted globally as needed.

In the case of color display, a color value set to at least one of R, G, and B can be applied to select a corresponding color, and a user can arbitrarily select R, G, and B.

The ITF value counts how many occurrences of a word occur in 100 documents, and is calculated as log (number of occurrences / total number of documents).

The dictionary-based cluster method is a method of grouping similar words into a group.

3 is a flowchart illustrating a method of generating a word set for text analysis according to an embodiment of the present invention.

Referring to FIG. 3, first, the word extraction module 110 performs a morphological analysis on a text to extract a word having a meaning (S101).

At this time, a meaningful word can be a stem of a noun or a predicate.

Next, the word frequency calculation module 120 calculates the frequency of the same word among the words extracted from the word extraction module 110 (S102).

Next, the word set generation module 140 inserts the word extracted from the word extraction module 110 into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is calculated as a word frequency calculation The frequency calculated by the module 120 is displayed and inserted into the same pixel (S103).

Here, the word set generation module 140 may be configured to display each pixel in which a word is inserted in a gray color or an RGB color.

The word set generation module 140 may be configured to display gray scale or gray scale of RGB color according to the frequency of words inserted in each pixel.

Next, the word filtering module 130 determines whether the word extracted from the word extraction module 110 corresponds to a predetermined category or a predetermined word (S104).

If it is determined that the predetermined category or predetermined word is included, the word set generation module 140 displays the corresponding pixel in a predetermined color (S106).

It will be understood by those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention as defined in the following claims will be.

110: word extraction module
120: word frequency calculation module
130: word filtering module
140: Word set generation module

Claims

A word extraction module for performing a morphological analysis on a text to extract a word having a meaning;
A word frequency calculation module for calculating a frequency of the same word among the words extracted by the word extraction module;
The word extracting module inserts the extracted word into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is displayed on the same pixel by displaying the frequency calculated by the word frequency calculating module A word set generation module configured to generate a word set, and to insert the word set into the word set generation module.

The method according to claim 1, wherein the word having the above-
Wherein the word dictionary is a word of a noun or a predicate.

The method according to claim 1,
Each pixel in which the word is inserted is displayed in a gray color or an RGB color, and the gray scale or the RGB color enhancement is displayed differently according to the frequency of the word inserted into the pixel And generating a word set for the text analysis.

The method according to claim 1,
Further comprising a word filtering module for determining whether the word extracted by the word extraction module corresponds to a predetermined category or a predetermined word.

The method according to claim 4,
If the word filtering module determines that a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word and the corresponding pixel corresponds to a predetermined category or a predetermined word, And a word-filtering module configured to generate the word-set for the text analysis.

Extracting a meaningful word by performing a morphological analysis on a text by a word extraction module;
Calculating a frequency of the same word among the words extracted from the word extraction module by the word frequency calculation module;
The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words indicates a frequency calculated by the word frequency calculation module And inserting the same into the same pixel.

The method according to claim 6, wherein the word having the above-
Wherein the word is a stem of a noun or a predicate.

The method according to claim 6, wherein the word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is calculated by the word frequency calculation The step of displaying the calculated frequency in the module and inserting the same in the same pixel,
The word set generation module displays each pixel in which the word is inserted in a gray color or an RGB color and displays it in the gray scale according to the frequency of the word inserted into the pixel, Wherein the first and second words are differently displayed.

The method of claim 6, wherein the word filtering module comprises: determining whether a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word;
Further comprising the step of the word set generation module displaying the corresponding pixel in a predetermined color if the determination result corresponds to a predetermined category or a predetermined word.