KR102042991B1 - Apparatus for tokenizing based on korean affix and method thereof - Google Patents

Apparatus for tokenizing based on korean affix and method thereof Download PDF

Info

Publication number
KR102042991B1
KR102042991B1 KR1020180061920A KR20180061920A KR102042991B1 KR 102042991 B1 KR102042991 B1 KR 102042991B1 KR 1020180061920 A KR1020180061920 A KR 1020180061920A KR 20180061920 A KR20180061920 A KR 20180061920A KR 102042991 B1 KR102042991 B1 KR 102042991B1
Authority
KR
South Korea
Prior art keywords
root
candidate
word
macro
affix
Prior art date
Application number
KR1020180061920A
Other languages
Korean (ko)
Other versions
KR20190059826A (en
Inventor
박영호
이지혜
임선영
Original Assignee
숙명여자대학교산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 숙명여자대학교산학협력단 filed Critical 숙명여자대학교산학협력단
Publication of KR20190059826A publication Critical patent/KR20190059826A/en
Application granted granted Critical
Publication of KR102042991B1 publication Critical patent/KR102042991B1/en

Links

Images

Classifications

    • G06F17/277

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention relates to a Korean affix based tokenizing device and method thereof.
Torqueizing method using the Korean affix-based torqueizing device according to the present invention comprises the steps of receiving a Korean sentence to be torqueized, separating the input Korean sentence in the form of a word based on the spacing standard, separated words Outputting a word corresponding to a preset condition, generating a root-close candidate corresponding to the number of syllables for each word, and based on statistics generated on each of the generated root-close candidates Torquening by applying a root-macro segregation algorithm and outputting a final result using the output word and the torqued root-macro candidate.
As described above, according to the present invention, by using the affix list and the statistical information of the document, the word is spoken at the boundary of the affix, so that information of a word whose meaning is lost when analyzing in a morpheme unit can be preserved.

Description

Korean Macro-based Torqueizing Device and Its Method {APPARATUS FOR TOKENIZING BASED ON KOREAN AFFIX AND METHOD THEREOF}

The present invention relates to a Korean affix-based tokenizing method, and more specifically, to a word affix based on affixes using a macro list and document statistical information to preserve word information. A gong apparatus and a method thereof.

Since 2013, word embedding method and deep learning method have been proposed, and dense vector expression-based neural networks show excellent results in various natural language processing tasks. This trend can also be seen in the fact that the percentage of deep learning articles published in major natural fish processing academies such as ACL, EMNLP, EACL, and NAACL has grown more than 20% compared to 2012.

In Korean natural language processing, word embedding and neural network are increasing and performance is improved. In order to use neural network model, preprocessing to convert Korean into vector representation is essential, and word embedding models such as Skip-gram, CBOW, and GloVe are used.

However, word embedding models such as Skipgram, CBOW, and GloVe were developed to analyze English data. Korean and English are different in structure and grammar and cannot be analyzed using the same rules. Therefore, there is a problem that the known word embedding models do not reflect Korean characteristics.

Word embedding also has a big difference between Korean and English. Words used by the existing word embedding model are tokenized based on spacing. In this case, tokenizing means dividing a sentence into semantic units. That is, since English has meanings separated from each other by words, even if it is tonicated based on spacing, it may be a minimum meaning itself.

However, Korean is more frequently used with words and words such as surveys and endings. Therefore, if the word embedding models are applied to Korean, compound nouns or behavioral information may be classified, resulting in loss of information. have.

The background technology of the present invention is disclosed in Republic of Korea Patent Publication No. 10-2006-0064447 (August 22, 2006).

SUMMARY OF THE INVENTION The present invention has been made in an effort to provide a Korean affix-based tokenizing apparatus and method for tokenizing words around a macro by using affix lists and document statistical information to preserve word information. .

According to an embodiment of the present invention for achieving the above technical problem, the method of torqueizing using the Korean affix-based talkizing apparatus includes: receiving a Korean sentence to be talked about; Dividing the input Korean sentence into a word form based on a spacing criterion; Outputting a word corresponding to a preset condition among the separated words; Generating a root-close candidate for the non-output word corresponding to the number of syllables for each word; Tokenizing by applying a statistical-based root-macro separation algorithm to each of the generated root-macro candidates; And outputting a final result using the output word and the tokenized root-close candidate.

The outputting of the word may include outputting the word as it is if the separated word belongs to any one of adverbs, tubular articles, and interjections, or if the last syllable of the word does not belong to an investigation or a mother.

The generating of the root-macro candidate may include generating a root-macro candidate for each word by dividing the boundary by shifting one syllable in the direction of the first syllable from the last syllable of the word. have.

In addition, the torqueizing step may be performed by comparing a mean value of a probability of a root candidate and a probability of affixing a macro with a threshold value by using a statistical score for each root-close candidate for each word. Step 1, as a result of the comparison, when the average value is less than or equal to the threshold value, outputting the word as it is; and outputting a root-close candidate of a word having one root-close candidate among words with the average value larger than the threshold value. And a fourth step of outputting a root-macro candidate with the most affix type among the root-macro candidates of a word having two or more root-macro candidates among the words whose average value is greater than the threshold. have.

In addition, the torqueizing step may calculate the statistics-based score using the following equation.

Figure 112018053265633-pat00001

Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x t ) is the root candidate is root. Probability, and affix (z t ) is the probability that the macro candidate is a macro.

Further, in the tokenizing step, when the root-macro candidate is one, the probability word (x t ) and the probability that the macro candidate is affix are affix (z t ) using the following equation. ) Can be calculated separately.

Figure 112018053265633-pat00002

Figure 112018053265633-pat00003

Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.

In addition, in the torqueizing step, when two root-macro candidates are two or more, the root and affix may be separated based on a value calculated from the following equation.

Figure 112018053265633-pat00004

Here, affixVariety (x t ) is a number normalizing the number of affixes to be combined with the root candidate (x t ) with a probability value of [0, 1], and affixVarietyNum (x t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value among the number of affixes associated with all roots, and affixVarietyMax is the largest number of types of affixes associated with all roots.

The outputting of the final result may be performed in the form of (word, P) when the word is output as it is, and in the form of (root, macro) when the root-close candidate is output.

In addition, the Korean affix-based talkizing apparatus according to an embodiment of the present invention, the input unit for receiving a Korean sentence to talk about; A separator for separating the input Korean sentence into a word form based on a spacing criterion; A candidate generator configured to generate a root-close candidate corresponding to the number of syllables for each word of a word that does not correspond to a preset condition among the separated words; A torqueizing unit for torqueizing by applying a statistic-based root-macro separation algorithm to each of the generated root-macro candidates; And an output unit for outputting a final result by using a word corresponding to a preset condition among the separated words and the tokenized root-close candidate.

According to the present invention, by using the affix list and the statistical information of the document, the word is spoken at the affix boundary to preserve word information whose meaning is lost when analyzed in morpheme units.

In addition, according to the present invention, since only the affix list and statistical information of the document to be analyzed are used, the word boundary can be accurately determined without defining a word dictionary and a tokenizing rule for all words.

In addition, according to the present invention, it can be used for a variety of purposes, such as meaning determination, emotional analysis, there is a huge range of applications.

1 is a block diagram illustrating a Korean affix based tokenizing apparatus according to an embodiment of the present invention.
2 is a flowchart illustrating an operation flow of a Korean affix based tokenizing method according to an embodiment of the present invention.
3 is an exemplary diagram for explaining a Korean affix based tokenizing method according to an embodiment of the present invention.
FIG. 4 is a diagram for describing root-macro candidates according to torqueing positions in an embodiment of the present invention.
FIG. 5 is a diagram for describing a process of obtaining a macro type value in a Korean affix based tokenizing method according to an embodiment of the present invention.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this process, the thickness of the lines or the size of the components shown in the drawings may be exaggerated for clarity and convenience of description.

In addition, terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to a user's or operator's intention or custom. Therefore, the definitions of these terms should be made based on the contents throughout the specification.

Hereinafter, the Korean affix based talkizing method according to an embodiment of the present invention will be described in detail with reference to FIG. 1.

1 is a block diagram illustrating a Korean affix based tokenizing apparatus according to an embodiment of the present invention.

According to an embodiment of the present invention, the torqueizing device 100 includes an input unit 110, a separation unit 120, a candidate generation unit 130, a torqueizing unit 140, and an output unit 150. .

First, the input unit 110 receives a Korean sentence to talk about.

The separating unit 120 separates the Korean sentence input from the input unit 110 into a word form based on the spacing.

The candidate generator 130 generates a root-close candidate corresponding to the number of syllables for each word, for the words that do not correspond to the preset condition among the words separated by the separator 120.

In this case, the candidate generator 130 corresponds to a preset condition when a word separated by the separation unit 120 belongs to any one of adverbs, tubular articles, and interjections, or when the last syllable of the word does not belong to an investigation or a mother. The word is output as it is.

In addition, the candidate generator 130 divides the boundary by shifting one syllable from the last syllable in the first syllable direction to generate a root-close candidate corresponding to the number of syllables of each word.

In addition, the torqueizing unit 140 applies a statistical-based root-macro separation algorithm to each of the root-macro candidates generated by the candidate generator 130 to torqueize.

At this time, the tokenizing unit 140 thresholds the average value of the probability that the root candidate is a root and the probability that the macro candidate is a macro by using a statistical base score for each root-close candidate for each word generated by the candidate generator 130. Compared with the value, if the average value is less than the threshold value, the word is output as it is.

Then, the root-macro candidate of a word having a root-macro candidate close to a word whose average value is larger than a threshold value is output, and the macro-macro candidates of a root word having two or more root-macro candidates having a mean value larger than a threshold value are affixed. Output the root-macro candidate with the most kinds.

Finally, the output unit 150 outputs a final result by using a word corresponding to a preset condition among the words separated by the separating unit 120 and a root-close candidate that is torqued by the torqueizing unit 140. .

At this time, the output unit 150 outputs in the form of (word, P) when the word is output, and outputs in the form of (root, macro) when the root-close candidate is output.

Hereinafter, the Korean affix based tokenizing method according to an embodiment of the present invention will be described with reference to FIGS. 2 to 5.

FIG. 2 is a flowchart illustrating an operation flow of a Korean affix based tokenizing method according to an embodiment of the present invention, and FIG. 3 is an exemplary diagram for describing a Korean affix based tonicizing method according to an embodiment of the present invention. With reference to this, it will be described the specific operation of the present invention.

According to the exemplary embodiment of the present invention, first, the input unit 110 of the torqueizing device 100 receives a Korean sentence to be torqued (S210).

In the embodiment of the present invention, as illustrated in FIG. 3, the description of the tokenizing method will be described using the sentence “Suddenly, the shopping mall shopping cart has stopped” as an example.

When the sentence "Suddenly, the shopping mall shopping cart has stopped" is input in step S210, the separating unit 120 separates the input Korean sentence into a word form based on the spacing (S220).

That is, the separating unit 120 separates the sentences input at step S210 into word forms such as 'suddenly', 'shopping mall', 'shopping cart', 'payment', 'stopped'.

Next, the toning device 100 outputs the word corresponding to the preset condition among the words separated in step S220 as it is (S230).

That is, if the word separated in step S220 belongs to any one of adverbs, tubular articles, or interjections, or if the last syllable of the word does not belong to the survey or the mother, the word is output as it is.

That is, in Figure 3 'suddenly' is the last syllable of the word belongs to the investigation, but 'suddenly' is an adverb, so the word is output without separating the word, 'shopping mall' is a separate word adverb, tubular adjective, interjection Although it does not belong to any of the words, the last syllable of the word 'mall' does not belong to the survey or the ending, so the word is output without being separated.

However, since the last syllable of the shopping cart, payment, and stop has belonged to the investigation or the ending, the words are not output in step S230.

When a word is output as it is, it is output in the form of (word, P), that is, (suddenly, P) and (shopping mall, P), where P is a word-adjunct structure of words without affixes such as adverbs or proper nouns. When separated, the padding token used to fill in affixes.

Therefore, (Suddenly, P) and (Shopping Mall, P) are respectively output in step S230.

Subsequently, the candidate generator 130 generates a root-close candidate corresponding to the number of syllables for each word for words not output in step S230 (S240).

For more details, in the step S230, 'shopping cart', 'payment' and 'stopped' are shifted by one syllable from the last syllable in the direction of the first syllable to the number of syllables. A corresponding root-macro candidate is generated for each word.

FIG. 4 is a diagram for describing root-macro candidates according to torqueing positions in an embodiment of the present invention.

For example, referring to a word composed of three syllables called 'payment value' with reference to FIG. 4, when the number (length) of the syllables constituting the word is defined as T and the torqueizing position is defined as t, T is 3, the candidate generator 130 generates three root-macro candidates for the 'payment price'. That is, a total of three candidates are generated, such as 'payment' corresponding to t = 0, 'payment' corresponding to 't', 'payment value' corresponding to t = 1, and 'payment value' corresponding to t = T-1.

Thus, 'shopping cart' and 'stopped' also generate four root-close candidates and five root-close candidates respectively to correspond to the number of syllables.

Next, the torqueizing unit 140 applies a statistical-based root-macro separation algorithm to each of the root-macro candidates generated in step S240 to torque the unit (S250).

Statistical based root-macro separation algorithm according to an embodiment of the present invention is performed in the following steps.

First, the tokenizing unit 140 compares the average value of the probability that the root candidate is a root and the probability that the macro candidate is affix with a threshold value by using a statistical-based score for each root-close candidate for each word generated in step S240. do.

At this time, the statistics-based score is calculated using the following equation (1).

Figure 112018053265633-pat00005

Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x t ) is the root candidate is root. Probability, and affix (z t ) is the probability that the macro candidate is a macro.

That is, if the mean value s is greater than the threshold and there is one root-close candidate

Figure 112018053265633-pat00006
Is a statistically-based score value, the mean (s) is greater than the threshold, and there are two or more root-close candidates.
Figure 112018053265633-pat00007
The value of becomes the statistics base value.

On the other hand, when the average value s is very small, it is defined as wrong torqueizing, and the word corresponding to the threshold value or less is output as it is without separating the root-closer.

In addition, the average value (s) in the embodiment of the present invention is based on a correlation (Character n-gram) probability method, the association probability can be expressed as Equation 2 below.

Figure 112018053265633-pat00008

That is, associative probability is a method of determining a conditional probability that letters appear in succession and determining that the word is a continuous word when the correlation value is large.

Where t is the torqueizing position and k is the length of the syllable.

However, Equation 2 cannot be used when the length of the syllable is 1, and according to the embodiment of the present invention, in order to calculate the correlation probability when the length of the syllable is 1, the following equation (3) is used. The t = T-1 case was added to Equation 4.

Hereinafter, using Equation 2, Equation 1

Figure 112018053265633-pat00009
The process of calculating will be described.

First, the torqueizing unit 140 calculates a probability word (x t ) of a root candidate by using Equation 3 below.

Figure 112018053265633-pat00010

Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.

In addition, the torqueizing unit 140 calculates a probability affix (z t ) that the macro candidate is a macro by using Equation 4 below.

Figure 112018053265633-pat00011

Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.

As a result of the comparison, if the mean value s of the probability that the root candidate is a root (word (x t )) and the probability of the close candidate is affix (affix (z t )) is equal to or less than a threshold, the word is output as it is.

For example, assuming that the threshold value is 0.15, the mean value s of the probability that the root candidate is a root (word (x t )) and the probability that the close candidate is affix (affix (z t )) is 0.15 or less. It is defined as wrong tokenizing, and the word is output as it is without separating it into a root-closer.

That is, the tokenizing unit 140 of the root-close candidates for 'shopping cart', 'payment', 'stopped' generated in step S240, 'shopping cart, knee' does not correspond to the last syllable of the word Since the score is calculated based on Equation 1, the average value s is less than or equal to the threshold and is output as (shopping basket, P).

The tokenizing unit 140 performs root-macro of the words where the mean value s of the probability that the root candidate is a root (word (x t )) and the probability of the close candidate is affix (affix (z t )) is larger than a threshold. The root-macro candidate of the word having one candidate is output.

As a result of outputting a root-close candidate corresponding to the corresponding condition, (payment, value) is output as a result of the tokenization of the 'payment value'.

Finally, the tokenizing unit 140 determines the root of the word where the mean value s of the probability that the root candidate is a root (word (x t )) and the probability that the macro candidate is affix (affix (z t )) is larger than the threshold. The root-macro candidate with the most affix type among the root-macro candidates of words having two or more macro candidates is output.

FIG. 5 is a diagram for describing a process of obtaining a macro type value in a Korean affix based tokenizing method according to an embodiment of the present invention.

For example, referring to 'stopped' using FIG. 5, a statistical-based score was calculated using Equation 1 for each of the root-macro candidates for 'stopped' generated in step S240. ) The root-close candidates for words larger than the threshold are 'stopped', 'stopped', 'stopped', 'stopped,' and the roots based on the calculated value as shown in Equation 5 below. Separate macros.

Figure 112018053265633-pat00012

Here, affixVariety (x t ) is a number normalizing the number of affixes to be combined with the root candidate (x t ) with a probability value of [0, 1], and affixVarietyNum (x t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value of the number of affixes associated with all roots, and affixVarietyMax is the largest of the number of species of affixes that combine with all roots.

In other words, the root-macro candidate corresponding to the root-macro candidate having the most affix type among the root-macro candidates of the word calculated using Equation 5 is 'stopped', so the tokenizing unit 140 stops. Will output (* j), as a final torqueizing result.

Finally, the output unit 150 outputs the final result by using the word output in step S230 and the root-close candidate candidate that is torqued in step S250 (S260).

That is, the output unit 150, as shown in FIG. 3, the Korean sentence “Suddenly, the payment of the shopping cart has stopped”, which is input in step S210 by the embodiment of the present invention, (suddenly, P), Output in the form of (shopping mall, P), (shopping cart, P), (payment, going), (stopped).

As described above, the Korean affix-based tokenizing apparatus according to an embodiment of the present invention loses meaning when analyzing in terms of morphemes by torqueing words around a macro by using affix lists and document statistical information. It can preserve the information of the words being made.

In addition, according to an embodiment of the present invention, since only the affix list and statistical information of the document to be analyzed are used, the word boundary can be accurately determined without defining a word dictionary and a tokenizing rule for all words.

In addition, according to an embodiment of the present invention, it can be used for a variety of purposes, such as meaning determination, emotional analysis, there is a huge range of applications.

Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. will be. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the following claims.

100: torqueizing device 110: input unit
120: separation unit 130: candidate generation unit
140: torqueizing unit 150: output unit

Claims (16)

In the tokenizing method using Korean affix based torqueizing device,
Receiving a Korean sentence to be talked about;
Dividing the input Korean sentence into a word form based on a spacing criterion;
Outputting a word corresponding to a preset condition among the separated words;
Generating a root-close candidate for the non-output word corresponding to the number of syllables for each word;
Tokenizing by applying a statistical-based root-macro separation algorithm to each of the generated root-macro candidates; And
Outputting a final result using the output word and the tokenized root-close candidate;
The torqueizing step,
A first step of comparing a mean value of a probability that a root candidate is a root and a probability that the macro candidate is a macro with a threshold value by using a statistical-based score for each generated root-macro candidate for each word,
A second step of outputting the word as it is when the average value is less than or equal to a threshold as a result of the comparison;
A third step of outputting a root-macro candidate of a word having one root-macro candidate among the words whose average value is larger than the threshold value, and
And a fourth step of outputting a root-macro candidate having the most affix type among the root-macro candidates of a word having two or more root-macro candidates among the words whose average value is larger than the threshold.
The method of claim 1,
The step of outputting the word,
And if the separated word belongs to any one of adverbs, tubular articles, or interjections, or if the last syllable does not belong to an investigation or a mother word, the word is output as it is.
The method of claim 1,
Generating the root-macro candidates,
Tokenizing method that generates a root-close candidate for each word by dividing the boundary by shifting one syllable from the last syllable to the first syllable direction.
delete The method of claim 1,
The torqueizing step,
Torqueizing method for calculating the statistical score based on the following equation:
Figure 112019090126248-pat00013

Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x t ) is the root candidate is root. Probability, and affix (z t ) is the probability that the macro candidate is a macro.
The method of claim 5,
The torqueizing step,
When the root-macro candidate is one, a torque wording method for calculating a probability word (x t ) of the root candidate and affix affix (z t ) using the following equation , respectively: :
Figure 112018053265633-pat00014

Figure 112018053265633-pat00015

Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.
The method of claim 5,
The torqueizing step,
Tokenizing method for separating the root and affix based on the value calculated from the following equation when the root-macro candidate is two or more:
Figure 112018053265633-pat00016

Here, affixVariety (x t ) is a number normalizing the number of affixes to be combined with the root candidate (x t ) with a probability value of [0, 1], and affixVarietyNum (x t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value among the number of affixes associated with all roots, and affixVarietyMax is the largest number of types of affixes associated with all roots.
The method of claim 1,
Outputting the final result,
And outputting the word in the form of (word, P) when the word is output as it is, and outputting in the form of (root, macro) when the root-macro candidate is output.
An input unit for receiving a Korean sentence to be talked about;
A separator for separating the input Korean sentence into a word form based on a spacing criterion;
A candidate generator configured to generate a root-close candidate corresponding to the number of syllables for each word of a word that does not correspond to a preset condition among the separated words;
A torqueizing unit for torqueizing by applying a statistic-based root-macro separation algorithm to each of the generated root-macro candidates; And
A word corresponding to a preset condition among the separated words, and an output unit configured to output a final result by using the tokenized root-close candidate;
The torqueizing unit,
The average value of the probability that the root candidate is a root and the probability that the macro candidate is a macro is compared with a threshold value by using a statistical-based score for each root-close candidate for each word, and when the average value is less than or equal to a threshold value, the corresponding word Is output as it is, and a root-macro candidate of a word having one root-macro candidate is one of the words whose average value is larger than a threshold value, and a word having two or more root-macro candidates among words whose average value is larger than the threshold value. A Korean affix based tokenizing device that outputs a root-macro candidate having the most affix type among the root-macro candidates of.
The method of claim 9,
The output unit,
When the separated word belongs to any one of adverbs, tubular or interjective words, or if the last syllable does not belong to an investigation or ending, it is determined that the word corresponds to the preset condition and outputs the word as it is. Nazing device.
The method of claim 9,
The candidate generator,
A Korean affix based tokenizing device that generates a root-close candidate for each word by dividing the boundary by shifting one syllable from the last syllable to the first syllable direction.
delete The method of claim 9,
The torqueizing unit,
Korean affix based tokenizing device for calculating the statistical score based on the following equation.
Figure 112019090126248-pat00017

Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x t ) is the root candidate is root. Probability, and affix (z t ) is the probability that the macro candidate is a macro.
The method of claim 13,
The torqueizing unit,
When the root-macro candidate is one, Korean affix based torque for calculating a probability word (x t ) and a probability affix (z t ) where the root candidate is an affix using the following equation: Nazing device.
Figure 112018053265633-pat00018

Figure 112018053265633-pat00019

Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.
The method of claim 13,
The torqueizing unit,
Korean affix based tokenizing device that separates the root and the affix based on the value calculated from the following equation, if there are two or more root-macro candidates.
Figure 112018053265633-pat00020

Here, affixVariety (x t ) is a number normalizing the number of affixes to be combined with the root candidate (x t ) with a probability value of [0, 1], and affixVarietyNum (x t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value among the number of affixes associated with all roots, and affixVarietyMax is the largest number of types of affixes associated with all roots.
The method of claim 9,
The output unit,
When the word is output as it is (Korean word, P) form, and if the root-close candidate is output (Korean root, close-up) Korean affix based tokenizing device.
KR1020180061920A 2017-11-23 2018-05-30 Apparatus for tokenizing based on korean affix and method thereof KR102042991B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020170157461 2017-11-23
KR20170157461 2017-11-23

Publications (2)

Publication Number Publication Date
KR20190059826A KR20190059826A (en) 2019-05-31
KR102042991B1 true KR102042991B1 (en) 2019-11-11

Family

ID=66657125

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020180061920A KR102042991B1 (en) 2017-11-23 2018-05-30 Apparatus for tokenizing based on korean affix and method thereof

Country Status (1)

Country Link
KR (1) KR102042991B1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102352163B1 (en) 2019-11-26 2022-01-19 고려대학교 산학협력단 Method for diagnosing language proficiency using eeg technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100835706B1 (en) * 2007-07-09 2008-06-05 한국과학기술정보연구원 System and method for korean morphological analysis for automatic indexing
KR100876319B1 (en) 2007-08-13 2008-12-31 인하대학교 산학협력단 Apparatus for providing document clustering using re-weighted term
WO2017090051A1 (en) 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100910275B1 (en) * 2007-10-25 2009-08-03 방정민 Method and apparatus for automatic extraction of transliteration pairs in dual language documents
KR20130098081A (en) * 2012-02-27 2013-09-04 한국전자통신연구원 Apparatus and method for korean morphological analysis based self learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100835706B1 (en) * 2007-07-09 2008-06-05 한국과학기술정보연구원 System and method for korean morphological analysis for automatic indexing
KR100876319B1 (en) 2007-08-13 2008-12-31 인하대학교 산학협력단 Apparatus for providing document clustering using re-weighted term
WO2017090051A1 (en) 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
심광섭. '통계 기반 한국어 형태소 분석기의 성능 개선'. 성신여자대학교 인문과학연구소 인문과학연구 제34권, 2016.02., pp.285-316.

Also Published As

Publication number Publication date
KR20190059826A (en) 2019-05-31

Similar Documents

Publication Publication Date Title
CN110222349B (en) Method and computer for deep dynamic context word expression
CN107291693B (en) Semantic calculation method for improved word vector model
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
Boudin et al. Keyphrase extraction for n-best reranking in multi-sentence compression
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
Levitan et al. Automatic identification of gender from speech
KR101715118B1 (en) Deep Learning Encoding Device and Method for Sentiment Classification of Document
CN109325229B (en) Method for calculating text similarity by utilizing semantic information
JP2004355483A (en) Morpheme analysis device, morpheme analysis method and morpheme analysis program
Yildiz et al. A morphology-aware network for morphological disambiguation
CN106844348B (en) Method for analyzing functional components of Chinese sentences
Tan et al. phi-LSTM: a phrase-based hierarchical LSTM model for image captioning
Silfverberg et al. FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish
Chordia PunKtuator: A multilingual punctuation restoration system for spoken and written text
이동준 Morpheme-based efficient Korean word embedding
Yuwana et al. On part of speech tagger for Indonesian language
CN113065350A (en) Biomedical text word sense disambiguation method based on attention neural network
KR102042991B1 (en) Apparatus for tokenizing based on korean affix and method thereof
Nambiar et al. Attention based abstractive summarization of malayalam document
Ananth et al. Grammatical tagging for the Kannada text documents using hybrid bidirectional long-short term memory model
Sakti et al. Incremental sentence compression using LSTM recurrent networks
CN107729509A (en) The chapter similarity decision method represented based on recessive higher-dimension distributed nature
Saini et al. Disfluency correction using unsupervised and semi-supervised learning
Nambiar et al. Abstractive summarization of Malayalam document using sequence to sequence model
JP6586055B2 (en) Deep case analysis device, deep case learning device, deep case estimation device, method, and program

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant