KR20170142526A - Apparatus and method of generating word set for analyzing text - Google Patents
Apparatus and method of generating word set for analyzing text Download PDFInfo
- Publication number
- KR20170142526A KR20170142526A KR1020160076096A KR20160076096A KR20170142526A KR 20170142526 A KR20170142526 A KR 20170142526A KR 1020160076096 A KR1020160076096 A KR 1020160076096A KR 20160076096 A KR20160076096 A KR 20160076096A KR 20170142526 A KR20170142526 A KR 20170142526A
- Authority
- KR
- South Korea
- Prior art keywords
- word
- module
- frequency
- extracted
- pixel
- Prior art date
Links
Images
Classifications
-
- G06F17/2715—
-
- G06F17/211—
-
- G06F17/2755—
-
- G06F17/277—
Landscapes
- Document Processing Apparatus (AREA)
Abstract
Description
The present invention relates to an apparatus and method for text analysis, and more particularly, to an apparatus and method for generating a word set for text analysis.
Patent Publication No. 10-1315734 discloses an arrangement for extracting and analyzing words in text.
The analysis techniques using word extraction of text are implemented and used in various ways. It is often used for analysis of big data of social network service (SNS).
However, a method of analyzing text such as a long document file or an email, analyzing the contents of the text, or analyzing and visually confirming whether the text is spam mail has not been disclosed yet.
If you can visually check the content of any document file or e-mail in advance, you can get a quick and accurate picture of the type of document or e-mail, its content, and its importance.
From the viewpoint of business or organizing documents, you can improve the efficiency of your work by doing a lot of document work.
An object of the present invention is to provide a word set generation apparatus for text analysis.
Another object of the present invention is to provide a method of generating a word set for text analysis.
According to an aspect of the present invention, there is provided an apparatus for generating a word set for text analysis, the apparatus comprising: a word extraction module for extracting words having a meaning by performing morphological analysis on text; A word frequency calculation module for calculating a frequency of the same word among the words extracted by the word extraction module; The word extracting module inserts the extracted word into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is displayed on the same pixel by displaying the frequency calculated by the word frequency calculating module And a word set generation module configured to insert the word set.
Here, the word having the above meaning may be a word of a noun or a predicate.
The word set generation module displays each pixel in which the word is inserted in a gray color or an RGB color and displays it in the gray scale according to the frequency of the word inserted into the pixel or the RGB And may be configured to display different colors of darkness.
And a word filtering module for determining whether the word extracted by the word extraction module corresponds to a predetermined category or a predetermined word.
If the word filtering module determines that a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word, the word-set generation module may generate a word- And to display the corresponding pixel in a predetermined color. The word filtering module may further comprise a word filtering module.
According to another aspect of the present invention, there is provided a method of generating a word set for text analysis, the method comprising: extracting a word having a meaning by performing a morphological analysis on a text; Calculating a frequency of the same word among the words extracted from the word extraction module by the word frequency calculation module; The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words indicates a frequency calculated by the word frequency calculation module And inserting the same into the same pixel.
Here, the word having the above meaning may be a stem of a noun or a descriptive word.
The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is converted into a frequency coefficient calculated by the word frequency calculation module Wherein the step of inserting the pixels into the same pixel comprises: displaying the pixels in which the word is embedded in a gray color or an RGB color, (gray scale) or differently display the enhancement of the RGB color.
Determining whether a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word; And if the determination result corresponds to a predetermined category or a predetermined word, the word set generation module may display the corresponding pixel in a predetermined color.
According to the apparatus and method for generating a word set for text analysis described above, words such as a long document file and e-mail are extracted to generate a word set table, and the frequency and category are easily visualized and displayed. It is possible to recognize in advance the content, the category, the subject, and the like of the e-mail without directly confirming the contents of the e-mail, thereby improving the efficiency of the work.
In addition, the spread of computer viruses can be easily prevented because the spam email can be known in advance without opening the spam email.
1 is a block diagram of a word set generation apparatus for text analysis according to an embodiment of the present invention.
FIGs. 2A and 2B are illustrations of word aggregation tables according to an embodiment of the present invention. FIG.
3 is a flowchart illustrating a method of generating a word set for text analysis according to an embodiment of the present invention.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail to the concrete inventive concept. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.
The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.
It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.
The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.
Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.
Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.
FIG. 1 is a block diagram of a word set generation apparatus for text analysis according to an embodiment of the present invention, and FIGS. 2A and 2B are illustrations of word set tables according to an embodiment of the present invention.
Referring to FIG. 1, a word
The word-
The user can know the information of the text in advance through the word set table displayed visually without viewing the text, and the visualized text can be usefully used for enhancing the work efficiency.
Hereinafter, the detailed configuration will be described.
The
Here, a meaningful word means a word except for the word of the research or the end of the predicate in the case of Korean. Basically, it can be a noun, and if you extend the scope of word extraction, the stem of a predicate corresponding to a verb or adjective can also correspond to this.
For example, the words 'domestic', 'industrial property rights', 'application', and 'support' can be extracted when there is a text 'supporting application for domestic industrial property rights'.
In the case of English, nouns are basically extracted, and verbs and adjectives can be further extracted according to extraction conditions.
The word
And sequentially accumulate the appearance frequencies for the same word.
The
A particular category can be a business-related category, a category for a specific field, and the category can include jargon or idioms in the field.
By the above filtering, the nature and kind of the text can be known in advance.
Categories or predefined filtering words may be cunning or slang, words often appearing in spam, and such categories or filtering words can be used to filter out such text.
The word
Here, the word set table can be extended according to the number of words of the text, such as 3 X 3 pixels or 5 X 5 pixels.
Words extracted from the
At this time, although the frequency may be displayed as a number, for the sake of more convenient visualization, the word set
Therefore, it can be seen that the frequency of pixels displayed darker is greater. The user can easily grasp the relative frequency of each word according to the difference in color or the gray scale even if the exact frequency is not known.
On the other hand, if the word extracted by the
The word set
Meanwhile, the word set
This approach can also be useful for generating word aggregation tables through multiple documents.
2A shows an example of generating a word set table using the histogram method. A histogram of each word extracted from the entire document can be generated and processed.
The TF / ITF method calculates the TF / ITF value by multiplying the reciprocal of the total number of words by the frequency of the word.
The TF value is a count of how many occurrences of a word occur in a document.
For example, if the total number of words in the text is 100 and the frequency of the word A is 10, 10 X (1/100) = 1/10. Here, the TF value becomes 0.1, and a color value corresponding to 0.1 is displayed on the corresponding pixel. In the case of gray scale, it becomes 254 X 0.1, and the pixel is displayed with the brightness of 25.4.
In this case, if it becomes too dark gray, the contrast ratio can be adjusted globally as needed.
In the case of color display, a color value set to at least one of R, G, and B can be applied to select a corresponding color, and a user can arbitrarily select R, G, and B.
The ITF value counts how many occurrences of a word occur in 100 documents, and is calculated as log (number of occurrences / total number of documents).
The dictionary-based cluster method is a method of grouping similar words into a group.
3 is a flowchart illustrating a method of generating a word set for text analysis according to an embodiment of the present invention.
Referring to FIG. 3, first, the
At this time, a meaningful word can be a stem of a noun or a predicate.
Next, the word
Next, the word set
Here, the word set
The word set
Next, the
If it is determined that the predetermined category or predetermined word is included, the word set
It will be understood by those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention as defined in the following claims will be.
110: word extraction module
120: word frequency calculation module
130: word filtering module
140: Word set generation module
Claims (9)
A word frequency calculation module for calculating a frequency of the same word among the words extracted by the word extraction module;
The word extracting module inserts the extracted word into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words is displayed on the same pixel by displaying the frequency calculated by the word frequency calculating module A word set generation module configured to generate a word set, and to insert the word set into the word set generation module.
Wherein the word dictionary is a word of a noun or a predicate.
Each pixel in which the word is inserted is displayed in a gray color or an RGB color, and the gray scale or the RGB color enhancement is displayed differently according to the frequency of the word inserted into the pixel And generating a word set for the text analysis.
Further comprising a word filtering module for determining whether the word extracted by the word extraction module corresponds to a predetermined category or a predetermined word.
If the word filtering module determines that a word extracted from the word extraction module corresponds to a predetermined category or a predetermined word and the corresponding pixel corresponds to a predetermined category or a predetermined word, And a word-filtering module configured to generate the word-set for the text analysis.
Calculating a frequency of the same word among the words extracted from the word extraction module by the word frequency calculation module;
The word set generation module inserts a word extracted from the word extraction module into each pixel of a word set table composed of a plurality of pixels, and the same word among the extracted words indicates a frequency calculated by the word frequency calculation module And inserting the same into the same pixel.
Wherein the word is a stem of a noun or a predicate.
The word set generation module displays each pixel in which the word is inserted in a gray color or an RGB color and displays it in the gray scale according to the frequency of the word inserted into the pixel, Wherein the first and second words are differently displayed.
Further comprising the step of the word set generation module displaying the corresponding pixel in a predetermined color if the determination result corresponds to a predetermined category or a predetermined word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160076096A KR20170142526A (en) | 2016-06-17 | 2016-06-17 | Apparatus and method of generating word set for analyzing text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160076096A KR20170142526A (en) | 2016-06-17 | 2016-06-17 | Apparatus and method of generating word set for analyzing text |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20170142526A true KR20170142526A (en) | 2017-12-28 |
Family
ID=60939536
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020160076096A KR20170142526A (en) | 2016-06-17 | 2016-06-17 | Apparatus and method of generating word set for analyzing text |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20170142526A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102085216B1 (en) | 2019-10-02 | 2020-03-04 | (주)디앤아이파비스 | Method, apparatus and program for calculating for weight score of word |
KR20210039909A (en) | 2019-10-02 | 2021-04-12 | (주)디앤아이파비스 | Method for calculating for weight score of word ussing sub-importance |
KR20210039908A (en) | 2019-10-02 | 2021-04-12 | (주)디앤아이파비스 | Method for calculating for weight score of word based reference information of patent document |
KR20210039907A (en) | 2019-10-02 | 2021-04-12 | (주)디앤아이파비스 | Method for calculating for weight score using appearance rate of word |
KR20210039910A (en) | 2020-02-21 | 2021-04-12 | (주)디앤아이파비스 | Method, apparatus and program for calculating for weight score of word |
KR102348239B1 (en) * | 2021-03-16 | 2022-01-07 | 주식회사 샌즈랩 | Method for Analyzing Keywords in Email |
-
2016
- 2016-06-17 KR KR1020160076096A patent/KR20170142526A/en not_active Application Discontinuation
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102085216B1 (en) | 2019-10-02 | 2020-03-04 | (주)디앤아이파비스 | Method, apparatus and program for calculating for weight score of word |
KR20210039909A (en) | 2019-10-02 | 2021-04-12 | (주)디앤아이파비스 | Method for calculating for weight score of word ussing sub-importance |
KR20210039908A (en) | 2019-10-02 | 2021-04-12 | (주)디앤아이파비스 | Method for calculating for weight score of word based reference information of patent document |
KR20210039907A (en) | 2019-10-02 | 2021-04-12 | (주)디앤아이파비스 | Method for calculating for weight score using appearance rate of word |
KR20210039910A (en) | 2020-02-21 | 2021-04-12 | (주)디앤아이파비스 | Method, apparatus and program for calculating for weight score of word |
KR102348239B1 (en) * | 2021-03-16 | 2022-01-07 | 주식회사 샌즈랩 | Method for Analyzing Keywords in Email |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR20170142526A (en) | Apparatus and method of generating word set for analyzing text | |
US7433548B2 (en) | Efficient processing of non-reflow content in a digital image | |
US8023738B1 (en) | Generating reflow files from digital images for rendering on various sized displays | |
US7788580B1 (en) | Processing digital images including headers and footers into reflow content | |
US20160285810A1 (en) | Analyzing email threads | |
US20160364497A1 (en) | Method and device for increasing the speed of online browsing and loading of pdf document | |
US9064009B2 (en) | Attribute cloud | |
US11321524B1 (en) | Systems and methods for testing content developed for access via a network | |
CN107391684B (en) | Method and system for generating threat information | |
US11093540B2 (en) | Unstructured response extraction | |
CN106354731A (en) | Document inspection method and device | |
US20210157928A1 (en) | Information processing apparatus, information processing method, and program | |
US10191955B2 (en) | Detection and visualization of schema-less data | |
US10019412B2 (en) | Dissociative view of content types to improve user experience | |
KR20090108943A (en) | Apparatus and Method for text extracting attached file in internet mail | |
JP2010102564A (en) | Emotion specifying device, emotion specification method, program, and recording medium | |
US20120016890A1 (en) | Assigning visual characteristics to records | |
US10606904B2 (en) | System and method for providing contextual information in a document | |
CN105630928B (en) | The identification method and device of text | |
US11776176B2 (en) | Visual representation of directional correlation of service health | |
JP2018073191A (en) | Project management item evaluation system and project management item evaluation method | |
WO2012073376A1 (en) | Electronic document processing device, electronic document processing method, and computer-readable recording medium | |
CN109992751A (en) | Amplification display method, device, electronic equipment and the storage medium of table objects | |
JP2022085668A (en) | Character break check device, character break check method, and character break check program | |
JP5854957B2 (en) | Information processing apparatus and feature word evaluation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E601 | Decision to refuse application |