KR20170142526A - Apparatus and method of generating word set for analyzing text - Google Patents

Apparatus and method of generating word set for analyzing text Download PDF

Info

Publication number
KR20170142526A
KR20170142526A KR1020160076096A KR20160076096A KR20170142526A KR 20170142526 A KR20170142526 A KR 20170142526A KR 1020160076096 A KR1020160076096 A KR 1020160076096A KR 20160076096 A KR20160076096 A KR 20160076096A KR 20170142526 A KR20170142526 A KR 20170142526A
Authority
KR
South Korea
Prior art keywords
word
module
frequency
extracted
pixel
Prior art date
Application number
KR1020160076096A
Other languages
Korean (ko)
Inventor
천세욱
Original Assignee
천세욱
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 천세욱 filed Critical 천세욱
Priority to KR1020160076096A priority Critical patent/KR20170142526A/en
Publication of KR20170142526A publication Critical patent/KR20170142526A/en

Links

Images

Classifications

    • G06F17/2715
    • G06F17/211
    • G06F17/2755
    • G06F17/277

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

Disclosed are a device and a method to generate a word group for text analysis. The present invention comprises: a word extracting module extracting a word with a meaning by conducting morphological analysis on text; a word frequency calculating module calculating the frequency of the same word as the word extracted by the word extracting module; and a word group generating module inserting the word, extracted by the word extracting module, into each pixel of a word group table comprising a plurality of pixels, and inserting the same ones among the extracted words into the same pixel by displaying the frequency calculated by the frequency calculating module. According to the present invention, as a word group table is generated by extracting words from an email or long text file and the frequency and categories of the words are visually displayed, the present invention is capable of increasing work efficiency by enabling a user to recognize categories and summaries beforehand without checking the email or text file. Additionally, the user is able to get information without opening a spam email, thereby easily preventing the spread of computer viruses.

Description

[0001] APPARATUS AND METHOD OF GENERATING WORD SET FOR ANALYZING TEXT [0002]

The present invention relates to an apparatus and method for text analysis, and more particularly, to an apparatus and method for generating a word set for text analysis.

Patent Publication No. 10-1315734 discloses an arrangement for extracting and analyzing words in text.

The analysis techniques using word extraction of text are implemented and used in various ways. It is often used for analysis of big data of social network service (SNS).

However, a method of analyzing text such as a long document file or an email, analyzing the contents of the text, or analyzing and visually confirming whether the text is spam mail has not been disclosed yet.

If you can visually check the content of any document file or e-mail in advance, you can get a quick and accurate picture of the type of document or e-mail, its content, and its importance.

From the viewpoint of business or organizing documents, you can improve the efficiency of your work by doing a lot of document work.

10-1315734

An object of the present invention is to provide a word set generation apparatus for text analysis.

Another object of the present invention is to provide a method of generating a word set for text analysis.

According to an aspect of the present invention, there is provided an apparatus for generating a word set for text analysis, the apparatus comprising: a word extraction module for extracting words having a meaning by performing morphological analysis on text; A word frequency calculation module for calculating a frequency of the same word among the words extracted by the word extraction module; The word extracting module inserts the extracted word into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is displayed on the same pixel by displaying the frequency calculated by the word frequency calculating module And a word set generation module configured to insert the word set.

Here, the word having the above meaning may be a word of a noun or a predicate.

The word set generation module displays each pixel in which the word is inserted in a gray color or an RGB color and displays it in the gray scale according to the frequency of the word inserted into the pixel or the RGB And may be configured to display different colors of darkness.

And a word filtering module for determining whether the word extracted by the word extraction module corresponds to a predetermined category or a predetermined word.

If the word filtering module determines that a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word, the word-set generation module may generate a word- And to display the corresponding pixel in a predetermined color. The word filtering module may further comprise a word filtering module.

According to another aspect of the present invention, there is provided a method of generating a word set for text analysis, the method comprising: extracting a word having a meaning by performing a morphological analysis on a text; Calculating a frequency of the same word among the words extracted from the word extraction module by the word frequency calculation module; The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words indicates a frequency calculated by the word frequency calculation module And inserting the same into the same pixel.

Here, the word having the above meaning may be a stem of a noun or a descriptive word.

The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is converted into a frequency coefficient calculated by the word frequency calculation module Wherein the step of inserting the pixels into the same pixel comprises: displaying the pixels in which the word is embedded in a gray color or an RGB color, (gray scale) or differently display the enhancement of the RGB color.

Determining whether a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word; And if the determination result corresponds to a predetermined category or a predetermined word, the word set generation module may display the corresponding pixel in a predetermined color.

According to the apparatus and method for generating a word set for text analysis described above, words such as a long document file and e-mail are extracted to generate a word set table, and the frequency and category are easily visualized and displayed. It is possible to recognize in advance the content, the category, the subject, and the like of the e-mail without directly confirming the contents of the e-mail, thereby improving the efficiency of the work.

In addition, the spread of computer viruses can be easily prevented because the spam email can be known in advance without opening the spam email.

1 is a block diagram of a word set generation apparatus for text analysis according to an embodiment of the present invention.
FIGs. 2A and 2B are illustrations of word aggregation tables according to an embodiment of the present invention. FIG.
3 is a flowchart illustrating a method of generating a word set for text analysis according to an embodiment of the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail to the concrete inventive concept. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

FIG. 1 is a block diagram of a word set generation apparatus for text analysis according to an embodiment of the present invention, and FIGS. 2A and 2B are illustrations of word set tables according to an embodiment of the present invention.

Referring to FIG. 1, a word set generation apparatus 100 for text analysis according to an embodiment of the present invention includes a word extraction module 110, a word frequency calculation module 120, a word filtering module 130, And a set generation module 140 as shown in FIG.

The word-set generation apparatus 100 for text analysis is configured to automatically extract words from text and visualize them in a word-set table based on the frequency or the informality.

The user can know the information of the text in advance through the word set table displayed visually without viewing the text, and the visualized text can be usefully used for enhancing the work efficiency.

Hereinafter, the detailed configuration will be described.

The word extraction module 110 may be configured to perform morphological analysis on the text to extract a word having a meaning. The word extraction module 110 may utilize analytic techniques such as text mining.

Here, a meaningful word means a word except for the word of the research or the end of the predicate in the case of Korean. Basically, it can be a noun, and if you extend the scope of word extraction, the stem of a predicate corresponding to a verb or adjective can also correspond to this.

For example, the words 'domestic', 'industrial property rights', 'application', and 'support' can be extracted when there is a text 'supporting application for domestic industrial property rights'.

In the case of English, nouns are basically extracted, and verbs and adjectives can be further extracted according to extraction conditions.

The word frequency calculation module 120 may be configured to calculate the frequency of the same word among the words extracted from the word extraction module 110. [

And sequentially accumulate the appearance frequencies for the same word.

The word filtering module 130 may be configured to determine whether a word extracted from the word extraction module 110 corresponds to a predetermined category or a predetermined filtering word.

A particular category can be a business-related category, a category for a specific field, and the category can include jargon or idioms in the field.

By the above filtering, the nature and kind of the text can be known in advance.

Categories or predefined filtering words may be cunning or slang, words often appearing in spam, and such categories or filtering words can be used to filter out such text.

The word set generation module 140 may be configured to generate a word set table by inserting the word extracted from the word extraction module 110 into each pixel of a word set table composed of a plurality of pixels.

Here, the word set table can be extended according to the number of words of the text, such as 3 X 3 pixels or 5 X 5 pixels.

Words extracted from the word extraction module 110 may be sequentially inserted into each pixel. At this time, the same word among the extracted words may be configured to display the frequency calculated by the word frequency calculation module 120 and to insert the calculated frequency into the same pixel. That is, the same word is inserted only once into one pixel and the frequency of the word can be displayed.

At this time, although the frequency may be displayed as a number, for the sake of more convenient visualization, the word set generation module 140 displays each pixel in which the word is inserted in a gray color or an RGB color, Or may be configured to display a gray scale or a different display of the intensity of the RGB color according to the frequency of the word.

Therefore, it can be seen that the frequency of pixels displayed darker is greater. The user can easily grasp the relative frequency of each word according to the difference in color or the gray scale even if the exact frequency is not known.

On the other hand, if the word extracted by the word extraction module 110 corresponds to a predetermined category or a predetermined word, the word set generation module 140 may display the corresponding pixel in a predetermined color. For example, in the case of the banned category, you can see in red how many banned words are included in the text.

The word set generation module 140 may judge how much pixels are occupied by all the pixels and classify the text into spam mails, abusive mails, unhealthy stories, and the like.

Meanwhile, the word set generation module 140 may use a histogram method, a term frequency / inverted term frequency (TF / ITF) method, or a dictionary-based cluster method to specify the color of the word set table .

This approach can also be useful for generating word aggregation tables through multiple documents.

2A shows an example of generating a word set table using the histogram method. A histogram of each word extracted from the entire document can be generated and processed.

The TF / ITF method calculates the TF / ITF value by multiplying the reciprocal of the total number of words by the frequency of the word.

The TF value is a count of how many occurrences of a word occur in a document.

For example, if the total number of words in the text is 100 and the frequency of the word A is 10, 10 X (1/100) = 1/10. Here, the TF value becomes 0.1, and a color value corresponding to 0.1 is displayed on the corresponding pixel. In the case of gray scale, it becomes 254 X 0.1, and the pixel is displayed with the brightness of 25.4.

In this case, if it becomes too dark gray, the contrast ratio can be adjusted globally as needed.

In the case of color display, a color value set to at least one of R, G, and B can be applied to select a corresponding color, and a user can arbitrarily select R, G, and B.

The ITF value counts how many occurrences of a word occur in 100 documents, and is calculated as log (number of occurrences / total number of documents).

The dictionary-based cluster method is a method of grouping similar words into a group.

3 is a flowchart illustrating a method of generating a word set for text analysis according to an embodiment of the present invention.

Referring to FIG. 3, first, the word extraction module 110 performs a morphological analysis on a text to extract a word having a meaning (S101).

At this time, a meaningful word can be a stem of a noun or a predicate.

Next, the word frequency calculation module 120 calculates the frequency of the same word among the words extracted from the word extraction module 110 (S102).

Next, the word set generation module 140 inserts the word extracted from the word extraction module 110 into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is calculated as a word frequency calculation The frequency calculated by the module 120 is displayed and inserted into the same pixel (S103).

Here, the word set generation module 140 may be configured to display each pixel in which a word is inserted in a gray color or an RGB color.

The word set generation module 140 may be configured to display gray scale or gray scale of RGB color according to the frequency of words inserted in each pixel.

Next, the word filtering module 130 determines whether the word extracted from the word extraction module 110 corresponds to a predetermined category or a predetermined word (S104).

If it is determined that the predetermined category or predetermined word is included, the word set generation module 140 displays the corresponding pixel in a predetermined color (S106).

It will be understood by those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention as defined in the following claims will be.

110: word extraction module
120: word frequency calculation module
130: word filtering module
140: Word set generation module

Claims (9)

A word extraction module for performing a morphological analysis on a text to extract a word having a meaning;
A word frequency calculation module for calculating a frequency of the same word among the words extracted by the word extraction module;
The word extracting module inserts the extracted word into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is displayed on the same pixel by displaying the frequency calculated by the word frequency calculating module A word set generation module configured to generate a word set, and to insert the word set into the word set generation module.
The method according to claim 1, wherein the word having the above-
Wherein the word dictionary is a word of a noun or a predicate.
The method according to claim 1,
Each pixel in which the word is inserted is displayed in a gray color or an RGB color, and the gray scale or the RGB color enhancement is displayed differently according to the frequency of the word inserted into the pixel And generating a word set for the text analysis.
The method according to claim 1,
Further comprising a word filtering module for determining whether the word extracted by the word extraction module corresponds to a predetermined category or a predetermined word.
The method according to claim 4,
If the word filtering module determines that a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word and the corresponding pixel corresponds to a predetermined category or a predetermined word, And a word-filtering module configured to generate the word-set for the text analysis.
Extracting a meaningful word by performing a morphological analysis on a text by a word extraction module;
Calculating a frequency of the same word among the words extracted from the word extraction module by the word frequency calculation module;
The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words indicates a frequency calculated by the word frequency calculation module And inserting the same into the same pixel.
The method according to claim 6, wherein the word having the above-
Wherein the word is a stem of a noun or a predicate.
The method according to claim 6, wherein the word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is calculated by the word frequency calculation The step of displaying the calculated frequency in the module and inserting the same in the same pixel,
The word set generation module displays each pixel in which the word is inserted in a gray color or an RGB color and displays it in the gray scale according to the frequency of the word inserted into the pixel, Wherein the first and second words are differently displayed.
The method of claim 6, wherein the word filtering module comprises: determining whether a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word;
Further comprising the step of the word set generation module displaying the corresponding pixel in a predetermined color if the determination result corresponds to a predetermined category or a predetermined word.
KR1020160076096A 2016-06-17 2016-06-17 Apparatus and method of generating word set for analyzing text KR20170142526A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020160076096A KR20170142526A (en) 2016-06-17 2016-06-17 Apparatus and method of generating word set for analyzing text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020160076096A KR20170142526A (en) 2016-06-17 2016-06-17 Apparatus and method of generating word set for analyzing text

Publications (1)

Publication Number Publication Date
KR20170142526A true KR20170142526A (en) 2017-12-28

Family

ID=60939536

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160076096A KR20170142526A (en) 2016-06-17 2016-06-17 Apparatus and method of generating word set for analyzing text

Country Status (1)

Country Link
KR (1) KR20170142526A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102085216B1 (en) 2019-10-02 2020-03-04 (주)디앤아이파비스 Method, apparatus and program for calculating for weight score of word
KR20210039909A (en) 2019-10-02 2021-04-12 (주)디앤아이파비스 Method for calculating for weight score of word ussing sub-importance
KR20210039908A (en) 2019-10-02 2021-04-12 (주)디앤아이파비스 Method for calculating for weight score of word based reference information of patent document
KR20210039907A (en) 2019-10-02 2021-04-12 (주)디앤아이파비스 Method for calculating for weight score using appearance rate of word
KR20210039910A (en) 2020-02-21 2021-04-12 (주)디앤아이파비스 Method, apparatus and program for calculating for weight score of word
KR102348239B1 (en) * 2021-03-16 2022-01-07 주식회사 샌즈랩 Method for Analyzing Keywords in Email

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102085216B1 (en) 2019-10-02 2020-03-04 (주)디앤아이파비스 Method, apparatus and program for calculating for weight score of word
KR20210039909A (en) 2019-10-02 2021-04-12 (주)디앤아이파비스 Method for calculating for weight score of word ussing sub-importance
KR20210039908A (en) 2019-10-02 2021-04-12 (주)디앤아이파비스 Method for calculating for weight score of word based reference information of patent document
KR20210039907A (en) 2019-10-02 2021-04-12 (주)디앤아이파비스 Method for calculating for weight score using appearance rate of word
KR20210039910A (en) 2020-02-21 2021-04-12 (주)디앤아이파비스 Method, apparatus and program for calculating for weight score of word
KR102348239B1 (en) * 2021-03-16 2022-01-07 주식회사 샌즈랩 Method for Analyzing Keywords in Email

Similar Documents

Publication Publication Date Title
KR20170142526A (en) Apparatus and method of generating word set for analyzing text
US7433548B2 (en) Efficient processing of non-reflow content in a digital image
US8023738B1 (en) Generating reflow files from digital images for rendering on various sized displays
US7788580B1 (en) Processing digital images including headers and footers into reflow content
US20160285810A1 (en) Analyzing email threads
US20160364497A1 (en) Method and device for increasing the speed of online browsing and loading of pdf document
US9064009B2 (en) Attribute cloud
US11321524B1 (en) Systems and methods for testing content developed for access via a network
CN107391684B (en) Method and system for generating threat information
US11093540B2 (en) Unstructured response extraction
CN106354731A (en) Document inspection method and device
US20210157928A1 (en) Information processing apparatus, information processing method, and program
US10191955B2 (en) Detection and visualization of schema-less data
US10019412B2 (en) Dissociative view of content types to improve user experience
KR20090108943A (en) Apparatus and Method for text extracting attached file in internet mail
JP2010102564A (en) Emotion specifying device, emotion specification method, program, and recording medium
US20120016890A1 (en) Assigning visual characteristics to records
US10606904B2 (en) System and method for providing contextual information in a document
CN105630928B (en) The identification method and device of text
US11776176B2 (en) Visual representation of directional correlation of service health
JP2018073191A (en) Project management item evaluation system and project management item evaluation method
WO2012073376A1 (en) Electronic document processing device, electronic document processing method, and computer-readable recording medium
CN109992751A (en) Amplification display method, device, electronic equipment and the storage medium of table objects
JP2022085668A (en) Character break check device, character break check method, and character break check program
JP5854957B2 (en) Information processing apparatus and feature word evaluation method

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E601 Decision to refuse application