KR102042991B1 - Apparatus for tokenizing based on korean affix and method thereof - Google Patents
Apparatus for tokenizing based on korean affix and method thereof Download PDFInfo
- Publication number
- KR102042991B1 KR102042991B1 KR1020180061920A KR20180061920A KR102042991B1 KR 102042991 B1 KR102042991 B1 KR 102042991B1 KR 1020180061920 A KR1020180061920 A KR 1020180061920A KR 20180061920 A KR20180061920 A KR 20180061920A KR 102042991 B1 KR102042991 B1 KR 102042991B1
- Authority
- KR
- South Korea
- Prior art keywords
- root
- candidate
- word
- macro
- affix
- Prior art date
Links
Images
Classifications
-
- G06F17/277—
Landscapes
- Machine Translation (AREA)
Abstract
The present invention relates to a Korean affix based tokenizing device and method thereof.
Torqueizing method using the Korean affix-based torqueizing device according to the present invention comprises the steps of receiving a Korean sentence to be torqueized, separating the input Korean sentence in the form of a word based on the spacing standard, separated words Outputting a word corresponding to a preset condition, generating a root-close candidate corresponding to the number of syllables for each word, and based on statistics generated on each of the generated root-close candidates Torquening by applying a root-macro segregation algorithm and outputting a final result using the output word and the torqued root-macro candidate.
As described above, according to the present invention, by using the affix list and the statistical information of the document, the word is spoken at the boundary of the affix, so that information of a word whose meaning is lost when analyzing in a morpheme unit can be preserved.
Description
The present invention relates to a Korean affix-based tokenizing method, and more specifically, to a word affix based on affixes using a macro list and document statistical information to preserve word information. A gong apparatus and a method thereof.
Since 2013, word embedding method and deep learning method have been proposed, and dense vector expression-based neural networks show excellent results in various natural language processing tasks. This trend can also be seen in the fact that the percentage of deep learning articles published in major natural fish processing academies such as ACL, EMNLP, EACL, and NAACL has grown more than 20% compared to 2012.
In Korean natural language processing, word embedding and neural network are increasing and performance is improved. In order to use neural network model, preprocessing to convert Korean into vector representation is essential, and word embedding models such as Skip-gram, CBOW, and GloVe are used.
However, word embedding models such as Skipgram, CBOW, and GloVe were developed to analyze English data. Korean and English are different in structure and grammar and cannot be analyzed using the same rules. Therefore, there is a problem that the known word embedding models do not reflect Korean characteristics.
Word embedding also has a big difference between Korean and English. Words used by the existing word embedding model are tokenized based on spacing. In this case, tokenizing means dividing a sentence into semantic units. That is, since English has meanings separated from each other by words, even if it is tonicated based on spacing, it may be a minimum meaning itself.
However, Korean is more frequently used with words and words such as surveys and endings. Therefore, if the word embedding models are applied to Korean, compound nouns or behavioral information may be classified, resulting in loss of information. have.
The background technology of the present invention is disclosed in Republic of Korea Patent Publication No. 10-2006-0064447 (August 22, 2006).
SUMMARY OF THE INVENTION The present invention has been made in an effort to provide a Korean affix-based tokenizing apparatus and method for tokenizing words around a macro by using affix lists and document statistical information to preserve word information. .
According to an embodiment of the present invention for achieving the above technical problem, the method of torqueizing using the Korean affix-based talkizing apparatus includes: receiving a Korean sentence to be talked about; Dividing the input Korean sentence into a word form based on a spacing criterion; Outputting a word corresponding to a preset condition among the separated words; Generating a root-close candidate for the non-output word corresponding to the number of syllables for each word; Tokenizing by applying a statistical-based root-macro separation algorithm to each of the generated root-macro candidates; And outputting a final result using the output word and the tokenized root-close candidate.
The outputting of the word may include outputting the word as it is if the separated word belongs to any one of adverbs, tubular articles, and interjections, or if the last syllable of the word does not belong to an investigation or a mother.
The generating of the root-macro candidate may include generating a root-macro candidate for each word by dividing the boundary by shifting one syllable in the direction of the first syllable from the last syllable of the word. have.
In addition, the torqueizing step may be performed by comparing a mean value of a probability of a root candidate and a probability of affixing a macro with a threshold value by using a statistical score for each root-close candidate for each word.
In addition, the torqueizing step may calculate the statistics-based score using the following equation.
Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x t ) is the root candidate is root. Probability, and affix (z t ) is the probability that the macro candidate is a macro.
Further, in the tokenizing step, when the root-macro candidate is one, the probability word (x t ) and the probability that the macro candidate is affix are affix (z t ) using the following equation. ) Can be calculated separately.
Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.
In addition, in the torqueizing step, when two root-macro candidates are two or more, the root and affix may be separated based on a value calculated from the following equation.
Here, affixVariety (x t ) is a number normalizing the number of affixes to be combined with the root candidate (x t ) with a probability value of [0, 1], and affixVarietyNum (x t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value among the number of affixes associated with all roots, and affixVarietyMax is the largest number of types of affixes associated with all roots.
The outputting of the final result may be performed in the form of (word, P) when the word is output as it is, and in the form of (root, macro) when the root-close candidate is output.
In addition, the Korean affix-based talkizing apparatus according to an embodiment of the present invention, the input unit for receiving a Korean sentence to talk about; A separator for separating the input Korean sentence into a word form based on a spacing criterion; A candidate generator configured to generate a root-close candidate corresponding to the number of syllables for each word of a word that does not correspond to a preset condition among the separated words; A torqueizing unit for torqueizing by applying a statistic-based root-macro separation algorithm to each of the generated root-macro candidates; And an output unit for outputting a final result by using a word corresponding to a preset condition among the separated words and the tokenized root-close candidate.
According to the present invention, by using the affix list and the statistical information of the document, the word is spoken at the affix boundary to preserve word information whose meaning is lost when analyzed in morpheme units.
In addition, according to the present invention, since only the affix list and statistical information of the document to be analyzed are used, the word boundary can be accurately determined without defining a word dictionary and a tokenizing rule for all words.
In addition, according to the present invention, it can be used for a variety of purposes, such as meaning determination, emotional analysis, there is a huge range of applications.
1 is a block diagram illustrating a Korean affix based tokenizing apparatus according to an embodiment of the present invention.
2 is a flowchart illustrating an operation flow of a Korean affix based tokenizing method according to an embodiment of the present invention.
3 is an exemplary diagram for explaining a Korean affix based tokenizing method according to an embodiment of the present invention.
FIG. 4 is a diagram for describing root-macro candidates according to torqueing positions in an embodiment of the present invention.
FIG. 5 is a diagram for describing a process of obtaining a macro type value in a Korean affix based tokenizing method according to an embodiment of the present invention.
Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this process, the thickness of the lines or the size of the components shown in the drawings may be exaggerated for clarity and convenience of description.
In addition, terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to a user's or operator's intention or custom. Therefore, the definitions of these terms should be made based on the contents throughout the specification.
Hereinafter, the Korean affix based talkizing method according to an embodiment of the present invention will be described in detail with reference to FIG. 1.
1 is a block diagram illustrating a Korean affix based tokenizing apparatus according to an embodiment of the present invention.
According to an embodiment of the present invention, the torqueizing
First, the
The separating
The
In this case, the
In addition, the
In addition, the torqueizing
At this time, the tokenizing
Then, the root-macro candidate of a word having a root-macro candidate close to a word whose average value is larger than a threshold value is output, and the macro-macro candidates of a root word having two or more root-macro candidates having a mean value larger than a threshold value are affixed. Output the root-macro candidate with the most kinds.
Finally, the
At this time, the
Hereinafter, the Korean affix based tokenizing method according to an embodiment of the present invention will be described with reference to FIGS. 2 to 5.
FIG. 2 is a flowchart illustrating an operation flow of a Korean affix based tokenizing method according to an embodiment of the present invention, and FIG. 3 is an exemplary diagram for describing a Korean affix based tonicizing method according to an embodiment of the present invention. With reference to this, it will be described the specific operation of the present invention.
According to the exemplary embodiment of the present invention, first, the
In the embodiment of the present invention, as illustrated in FIG. 3, the description of the tokenizing method will be described using the sentence “Suddenly, the shopping mall shopping cart has stopped” as an example.
When the sentence "Suddenly, the shopping mall shopping cart has stopped" is input in step S210, the separating
That is, the separating
Next, the
That is, if the word separated in step S220 belongs to any one of adverbs, tubular articles, or interjections, or if the last syllable of the word does not belong to the survey or the mother, the word is output as it is.
That is, in Figure 3 'suddenly' is the last syllable of the word belongs to the investigation, but 'suddenly' is an adverb, so the word is output without separating the word, 'shopping mall' is a separate word adverb, tubular adjective, interjection Although it does not belong to any of the words, the last syllable of the word 'mall' does not belong to the survey or the ending, so the word is output without being separated.
However, since the last syllable of the shopping cart, payment, and stop has belonged to the investigation or the ending, the words are not output in step S230.
When a word is output as it is, it is output in the form of (word, P), that is, (suddenly, P) and (shopping mall, P), where P is a word-adjunct structure of words without affixes such as adverbs or proper nouns. When separated, the padding token used to fill in affixes.
Therefore, (Suddenly, P) and (Shopping Mall, P) are respectively output in step S230.
Subsequently, the
For more details, in the step S230, 'shopping cart', 'payment' and 'stopped' are shifted by one syllable from the last syllable in the direction of the first syllable to the number of syllables. A corresponding root-macro candidate is generated for each word.
FIG. 4 is a diagram for describing root-macro candidates according to torqueing positions in an embodiment of the present invention.
For example, referring to a word composed of three syllables called 'payment value' with reference to FIG. 4, when the number (length) of the syllables constituting the word is defined as T and the torqueizing position is defined as t, T is 3, the
Thus, 'shopping cart' and 'stopped' also generate four root-close candidates and five root-close candidates respectively to correspond to the number of syllables.
Next, the
Statistical based root-macro separation algorithm according to an embodiment of the present invention is performed in the following steps.
First, the
At this time, the statistics-based score is calculated using the following equation (1).
Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x t ) is the root candidate is root. Probability, and affix (z t ) is the probability that the macro candidate is a macro.
That is, if the mean value s is greater than the threshold and there is one root-close candidate
Is a statistically-based score value, the mean (s) is greater than the threshold, and there are two or more root-close candidates. The value of becomes the statistics base value.On the other hand, when the average value s is very small, it is defined as wrong torqueizing, and the word corresponding to the threshold value or less is output as it is without separating the root-closer.
In addition, the average value (s) in the embodiment of the present invention is based on a correlation (Character n-gram) probability method, the association probability can be expressed as Equation 2 below.
That is, associative probability is a method of determining a conditional probability that letters appear in succession and determining that the word is a continuous word when the correlation value is large.
Where t is the torqueizing position and k is the length of the syllable.
However, Equation 2 cannot be used when the length of the syllable is 1, and according to the embodiment of the present invention, in order to calculate the correlation probability when the length of the syllable is 1, the following equation (3) is used. The t = T-1 case was added to Equation 4.
Hereinafter, using Equation 2,
First, the
Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.
In addition, the
Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.
As a result of the comparison, if the mean value s of the probability that the root candidate is a root (word (x t )) and the probability of the close candidate is affix (affix (z t )) is equal to or less than a threshold, the word is output as it is.
For example, assuming that the threshold value is 0.15, the mean value s of the probability that the root candidate is a root (word (x t )) and the probability that the close candidate is affix (affix (z t )) is 0.15 or less. It is defined as wrong tokenizing, and the word is output as it is without separating it into a root-closer.
That is, the
The
As a result of outputting a root-close candidate corresponding to the corresponding condition, (payment, value) is output as a result of the tokenization of the 'payment value'.
Finally, the
FIG. 5 is a diagram for describing a process of obtaining a macro type value in a Korean affix based tokenizing method according to an embodiment of the present invention.
For example, referring to 'stopped' using FIG. 5, a statistical-based score was calculated using
Here, affixVariety (x t ) is a number normalizing the number of affixes to be combined with the root candidate (x t ) with a probability value of [0, 1], and affixVarietyNum (x t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value of the number of affixes associated with all roots, and affixVarietyMax is the largest of the number of species of affixes that combine with all roots.
In other words, the root-macro candidate corresponding to the root-macro candidate having the most affix type among the root-macro candidates of the word calculated using Equation 5 is 'stopped', so the
Finally, the
That is, the
As described above, the Korean affix-based tokenizing apparatus according to an embodiment of the present invention loses meaning when analyzing in terms of morphemes by torqueing words around a macro by using affix lists and document statistical information. It can preserve the information of the words being made.
In addition, according to an embodiment of the present invention, since only the affix list and statistical information of the document to be analyzed are used, the word boundary can be accurately determined without defining a word dictionary and a tokenizing rule for all words.
In addition, according to an embodiment of the present invention, it can be used for a variety of purposes, such as meaning determination, emotional analysis, there is a huge range of applications.
Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. will be. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the following claims.
100: torqueizing device 110: input unit
120: separation unit 130: candidate generation unit
140: torqueizing unit 150: output unit
Claims (16)
Receiving a Korean sentence to be talked about;
Dividing the input Korean sentence into a word form based on a spacing criterion;
Outputting a word corresponding to a preset condition among the separated words;
Generating a root-close candidate for the non-output word corresponding to the number of syllables for each word;
Tokenizing by applying a statistical-based root-macro separation algorithm to each of the generated root-macro candidates; And
Outputting a final result using the output word and the tokenized root-close candidate;
The torqueizing step,
A first step of comparing a mean value of a probability that a root candidate is a root and a probability that the macro candidate is a macro with a threshold value by using a statistical-based score for each generated root-macro candidate for each word,
A second step of outputting the word as it is when the average value is less than or equal to a threshold as a result of the comparison;
A third step of outputting a root-macro candidate of a word having one root-macro candidate among the words whose average value is larger than the threshold value, and
And a fourth step of outputting a root-macro candidate having the most affix type among the root-macro candidates of a word having two or more root-macro candidates among the words whose average value is larger than the threshold.
The step of outputting the word,
And if the separated word belongs to any one of adverbs, tubular articles, or interjections, or if the last syllable does not belong to an investigation or a mother word, the word is output as it is.
Generating the root-macro candidates,
Tokenizing method that generates a root-close candidate for each word by dividing the boundary by shifting one syllable from the last syllable to the first syllable direction.
The torqueizing step,
Torqueizing method for calculating the statistical score based on the following equation:
Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x t ) is the root candidate is root. Probability, and affix (z t ) is the probability that the macro candidate is a macro.
The torqueizing step,
When the root-macro candidate is one, a torque wording method for calculating a probability word (x t ) of the root candidate and affix affix (z t ) using the following equation , respectively: :
Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.
The torqueizing step,
Tokenizing method for separating the root and affix based on the value calculated from the following equation when the root-macro candidate is two or more:
Here, affixVariety (x t ) is a number normalizing the number of affixes to be combined with the root candidate (x t ) with a probability value of [0, 1], and affixVarietyNum (x t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value among the number of affixes associated with all roots, and affixVarietyMax is the largest number of types of affixes associated with all roots.
Outputting the final result,
And outputting the word in the form of (word, P) when the word is output as it is, and outputting in the form of (root, macro) when the root-macro candidate is output.
A separator for separating the input Korean sentence into a word form based on a spacing criterion;
A candidate generator configured to generate a root-close candidate corresponding to the number of syllables for each word of a word that does not correspond to a preset condition among the separated words;
A torqueizing unit for torqueizing by applying a statistic-based root-macro separation algorithm to each of the generated root-macro candidates; And
A word corresponding to a preset condition among the separated words, and an output unit configured to output a final result by using the tokenized root-close candidate;
The torqueizing unit,
The average value of the probability that the root candidate is a root and the probability that the macro candidate is a macro is compared with a threshold value by using a statistical-based score for each root-close candidate for each word, and when the average value is less than or equal to a threshold value, the corresponding word Is output as it is, and a root-macro candidate of a word having one root-macro candidate is one of the words whose average value is larger than a threshold value, and a word having two or more root-macro candidates among words whose average value is larger than the threshold value. A Korean affix based tokenizing device that outputs a root-macro candidate having the most affix type among the root-macro candidates of.
The output unit,
When the separated word belongs to any one of adverbs, tubular or interjective words, or if the last syllable does not belong to an investigation or ending, it is determined that the word corresponds to the preset condition and outputs the word as it is. Nazing device.
The candidate generator,
A Korean affix based tokenizing device that generates a root-close candidate for each word by dividing the boundary by shifting one syllable from the last syllable to the first syllable direction.
The torqueizing unit,
Korean affix based tokenizing device for calculating the statistical score based on the following equation.
Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x t ) is the root candidate is root. Probability, and affix (z t ) is the probability that the macro candidate is a macro.
The torqueizing unit,
When the root-macro candidate is one, Korean affix based torque for calculating a probability word (x t ) and a probability affix (z t ) where the root candidate is an affix using the following equation: Nazing device.
Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.
The torqueizing unit,
Korean affix based tokenizing device that separates the root and the affix based on the value calculated from the following equation, if there are two or more root-macro candidates.
Here, affixVariety (x t ) is a number normalizing the number of affixes to be combined with the root candidate (x t ) with a probability value of [0, 1], and affixVarietyNum (x t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value among the number of affixes associated with all roots, and affixVarietyMax is the largest number of types of affixes associated with all roots.
The output unit,
When the word is output as it is (Korean word, P) form, and if the root-close candidate is output (Korean root, close-up) Korean affix based tokenizing device.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020170157461 | 2017-11-23 | ||
KR20170157461 | 2017-11-23 |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20190059826A KR20190059826A (en) | 2019-05-31 |
KR102042991B1 true KR102042991B1 (en) | 2019-11-11 |
Family
ID=66657125
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020180061920A KR102042991B1 (en) | 2017-11-23 | 2018-05-30 | Apparatus for tokenizing based on korean affix and method thereof |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR102042991B1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102352163B1 (en) | 2019-11-26 | 2022-01-19 | 고려대학교 산학협력단 | Method for diagnosing language proficiency using eeg technology |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100835706B1 (en) * | 2007-07-09 | 2008-06-05 | 한국과학기술정보연구원 | System and method for korean morphological analysis for automatic indexing |
KR100876319B1 (en) | 2007-08-13 | 2008-12-31 | 인하대학교 산학협력단 | Apparatus for providing document clustering using re-weighted term |
WO2017090051A1 (en) | 2015-11-27 | 2017-06-01 | Giridhari Devanathan | A method for text classification and feature selection using class vectors and the system thereof |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100910275B1 (en) * | 2007-10-25 | 2009-08-03 | 방정민 | Method and apparatus for automatic extraction of transliteration pairs in dual language documents |
KR20130098081A (en) * | 2012-02-27 | 2013-09-04 | 한국전자통신연구원 | Apparatus and method for korean morphological analysis based self learning |
-
2018
- 2018-05-30 KR KR1020180061920A patent/KR102042991B1/en active IP Right Grant
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100835706B1 (en) * | 2007-07-09 | 2008-06-05 | 한국과학기술정보연구원 | System and method for korean morphological analysis for automatic indexing |
KR100876319B1 (en) | 2007-08-13 | 2008-12-31 | 인하대학교 산학협력단 | Apparatus for providing document clustering using re-weighted term |
WO2017090051A1 (en) | 2015-11-27 | 2017-06-01 | Giridhari Devanathan | A method for text classification and feature selection using class vectors and the system thereof |
Non-Patent Citations (1)
Title |
---|
심광섭. '통계 기반 한국어 형태소 분석기의 성능 개선'. 성신여자대학교 인문과학연구소 인문과학연구 제34권, 2016.02., pp.285-316. |
Also Published As
Publication number | Publication date |
---|---|
KR20190059826A (en) | 2019-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110222349B (en) | Method and computer for deep dynamic context word expression | |
CN107291693B (en) | Semantic calculation method for improved word vector model | |
CN107480143B (en) | Method and system for segmenting conversation topics based on context correlation | |
Boudin et al. | Keyphrase extraction for n-best reranking in multi-sentence compression | |
CN108446271B (en) | Text emotion analysis method of convolutional neural network based on Chinese character component characteristics | |
Levitan et al. | Automatic identification of gender from speech | |
KR101715118B1 (en) | Deep Learning Encoding Device and Method for Sentiment Classification of Document | |
CN109325229B (en) | Method for calculating text similarity by utilizing semantic information | |
JP2004355483A (en) | Morpheme analysis device, morpheme analysis method and morpheme analysis program | |
Yildiz et al. | A morphology-aware network for morphological disambiguation | |
CN106844348B (en) | Method for analyzing functional components of Chinese sentences | |
Tan et al. | phi-LSTM: a phrase-based hierarchical LSTM model for image captioning | |
Silfverberg et al. | FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish | |
Chordia | PunKtuator: A multilingual punctuation restoration system for spoken and written text | |
이동준 | Morpheme-based efficient Korean word embedding | |
Yuwana et al. | On part of speech tagger for Indonesian language | |
CN113065350A (en) | Biomedical text word sense disambiguation method based on attention neural network | |
KR102042991B1 (en) | Apparatus for tokenizing based on korean affix and method thereof | |
Nambiar et al. | Attention based abstractive summarization of malayalam document | |
Ananth et al. | Grammatical tagging for the Kannada text documents using hybrid bidirectional long-short term memory model | |
Sakti et al. | Incremental sentence compression using LSTM recurrent networks | |
CN107729509A (en) | The chapter similarity decision method represented based on recessive higher-dimension distributed nature | |
Saini et al. | Disfluency correction using unsupervised and semi-supervised learning | |
Nambiar et al. | Abstractive summarization of Malayalam document using sequence to sequence model | |
JP6586055B2 (en) | Deep case analysis device, deep case learning device, deep case estimation device, method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant |