KR102042991B1

KR102042991B1 - Apparatus for tokenizing based on korean affix and method thereof

Info

Publication number: KR102042991B1
Application number: KR1020180061920A
Authority: KR
Inventors: 박영호; 이지혜; 임선영
Original assignee: 숙명여자대학교산학협력단
Priority date: 2017-11-23
Filing date: 2018-05-30
Publication date: 2019-11-11
Also published as: KR20190059826A

Abstract

The present invention relates to a Korean affix based tokenizing device and method thereof.
Torqueizing method using the Korean affix-based torqueizing device according to the present invention comprises the steps of receiving a Korean sentence to be torqueized, separating the input Korean sentence in the form of a word based on the spacing standard, separated words Outputting a word corresponding to a preset condition, generating a root-close candidate corresponding to the number of syllables for each word, and based on statistics generated on each of the generated root-close candidates Torquening by applying a root-macro segregation algorithm and outputting a final result using the output word and the torqued root-macro candidate.
As described above, according to the present invention, by using the affix list and the statistical information of the document, the word is spoken at the boundary of the affix, so that information of a word whose meaning is lost when analyzing in a morpheme unit can be preserved.

Description

Korean Macro-based Torqueizing Device and Its Method {APPARATUS FOR TOKENIZING BASED ON KOREAN AFFIX AND METHOD THEREOF}

The present invention relates to a Korean affix-based tokenizing method, and more specifically, to a word affix based on affixes using a macro list and document statistical information to preserve word information. A gong apparatus and a method thereof.

Since 2013, word embedding method and deep learning method have been proposed, and dense vector expression-based neural networks show excellent results in various natural language processing tasks. This trend can also be seen in the fact that the percentage of deep learning articles published in major natural fish processing academies such as ACL, EMNLP, EACL, and NAACL has grown more than 20% compared to 2012.

In Korean natural language processing, word embedding and neural network are increasing and performance is improved. In order to use neural network model, preprocessing to convert Korean into vector representation is essential, and word embedding models such as Skip-gram, CBOW, and GloVe are used.

However, word embedding models such as Skipgram, CBOW, and GloVe were developed to analyze English data. Korean and English are different in structure and grammar and cannot be analyzed using the same rules. Therefore, there is a problem that the known word embedding models do not reflect Korean characteristics.

Word embedding also has a big difference between Korean and English. Words used by the existing word embedding model are tokenized based on spacing. In this case, tokenizing means dividing a sentence into semantic units. That is, since English has meanings separated from each other by words, even if it is tonicated based on spacing, it may be a minimum meaning itself.

However, Korean is more frequently used with words and words such as surveys and endings. Therefore, if the word embedding models are applied to Korean, compound nouns or behavioral information may be classified, resulting in loss of information. have.

The background technology of the present invention is disclosed in Republic of Korea Patent Publication No. 10-2006-0064447 (August 22, 2006).

SUMMARY OF THE INVENTION The present invention has been made in an effort to provide a Korean affix-based tokenizing apparatus and method for tokenizing words around a macro by using affix lists and document statistical information to preserve word information. .

According to an embodiment of the present invention for achieving the above technical problem, the method of torqueizing using the Korean affix-based talkizing apparatus includes: receiving a Korean sentence to be talked about; Dividing the input Korean sentence into a word form based on a spacing criterion; Outputting a word corresponding to a preset condition among the separated words; Generating a root-close candidate for the non-output word corresponding to the number of syllables for each word; Tokenizing by applying a statistical-based root-macro separation algorithm to each of the generated root-macro candidates; And outputting a final result using the output word and the tokenized root-close candidate.

The outputting of the word may include outputting the word as it is if the separated word belongs to any one of adverbs, tubular articles, and interjections, or if the last syllable of the word does not belong to an investigation or a mother.

The generating of the root-macro candidate may include generating a root-macro candidate for each word by dividing the boundary by shifting one syllable in the direction of the first syllable from the last syllable of the word. have.

In addition, the torqueizing step may be performed by comparing a mean value of a probability of a root candidate and a probability of affixing a macro with a threshold value by using a statistical score for each root-close candidate for each word. Step 1, as a result of the comparison, when the average value is less than or equal to the threshold value, outputting the word as it is; and outputting a root-close candidate of a word having one root-close candidate among words with the average value larger than the threshold value. And a fourth step of outputting a root-macro candidate with the most affix type among the root-macro candidates of a word having two or more root-macro candidates among the words whose average value is greater than the threshold. have.

In addition, the torqueizing step may calculate the statistics-based score using the following equation.

Where s is the average of the probability that the root candidate is a root and the probability that the macro candidate is a macro, threshold is the threshold of s, candidate is the number of root-close candidates, and word (x _t ) is the root candidate is root. Probability, and affix (z _t ) is the probability that the macro candidate is a macro.

Further, in the tokenizing step, when the root-macro candidate is one, the probability word (x _t ) and the probability that the macro candidate is affix are affix (z _t ) using the following equation. ) Can be calculated separately.

Here, W is the total number of words, t is the tokenizing position, and k is the length of the syllable of the root or macro candidate, respectively.

In addition, in the torqueizing step, when two root-macro candidates are two or more, the root and affix may be separated based on a value calculated from the following equation.

Here, affixVariety (x _t ) is a number normalizing the number of affixes to be combined with the root candidate (x _t ) with a probability value of [0, 1], and affixVarietyNum (x _t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value among the number of affixes associated with all roots, and affixVarietyMax is the largest number of types of affixes associated with all roots.

The outputting of the final result may be performed in the form of (word, P) when the word is output as it is, and in the form of (root, macro) when the root-close candidate is output.

In addition, the Korean affix-based talkizing apparatus according to an embodiment of the present invention, the input unit for receiving a Korean sentence to talk about; A separator for separating the input Korean sentence into a word form based on a spacing criterion; A candidate generator configured to generate a root-close candidate corresponding to the number of syllables for each word of a word that does not correspond to a preset condition among the separated words; A torqueizing unit for torqueizing by applying a statistic-based root-macro separation algorithm to each of the generated root-macro candidates; And an output unit for outputting a final result by using a word corresponding to a preset condition among the separated words and the tokenized root-close candidate.

According to the present invention, by using the affix list and the statistical information of the document, the word is spoken at the affix boundary to preserve word information whose meaning is lost when analyzed in morpheme units.

In addition, according to the present invention, since only the affix list and statistical information of the document to be analyzed are used, the word boundary can be accurately determined without defining a word dictionary and a tokenizing rule for all words.

In addition, according to the present invention, it can be used for a variety of purposes, such as meaning determination, emotional analysis, there is a huge range of applications.

1 is a block diagram illustrating a Korean affix based tokenizing apparatus according to an embodiment of the present invention.
2 is a flowchart illustrating an operation flow of a Korean affix based tokenizing method according to an embodiment of the present invention.
3 is an exemplary diagram for explaining a Korean affix based tokenizing method according to an embodiment of the present invention.
FIG. 4 is a diagram for describing root-macro candidates according to torqueing positions in an embodiment of the present invention.
FIG. 5 is a diagram for describing a process of obtaining a macro type value in a Korean affix based tokenizing method according to an embodiment of the present invention.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this process, the thickness of the lines or the size of the components shown in the drawings may be exaggerated for clarity and convenience of description.

In addition, terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to a user's or operator's intention or custom. Therefore, the definitions of these terms should be made based on the contents throughout the specification.

Hereinafter, the Korean affix based talkizing method according to an embodiment of the present invention will be described in detail with reference to FIG. 1.

1 is a block diagram illustrating a Korean affix based tokenizing apparatus according to an embodiment of the present invention.

According to an embodiment of the present invention, the torqueizing device 100 includes an input unit 110, a separation unit 120, a candidate generation unit 130, a torqueizing unit 140, and an output unit 150. .

First, the input unit 110 receives a Korean sentence to talk about.

The separating unit 120 separates the Korean sentence input from the input unit 110 into a word form based on the spacing.

The candidate generator 130 generates a root-close candidate corresponding to the number of syllables for each word, for the words that do not correspond to the preset condition among the words separated by the separator 120.

In this case, the candidate generator 130 corresponds to a preset condition when a word separated by the separation unit 120 belongs to any one of adverbs, tubular articles, and interjections, or when the last syllable of the word does not belong to an investigation or a mother. The word is output as it is.

In addition, the candidate generator 130 divides the boundary by shifting one syllable from the last syllable in the first syllable direction to generate a root-close candidate corresponding to the number of syllables of each word.

In addition, the torqueizing unit 140 applies a statistical-based root-macro separation algorithm to each of the root-macro candidates generated by the candidate generator 130 to torqueize.

At this time, the tokenizing unit 140 thresholds the average value of the probability that the root candidate is a root and the probability that the macro candidate is a macro by using a statistical base score for each root-close candidate for each word generated by the candidate generator 130. Compared with the value, if the average value is less than the threshold value, the word is output as it is.

Then, the root-macro candidate of a word having a root-macro candidate close to a word whose average value is larger than a threshold value is output, and the macro-macro candidates of a root word having two or more root-macro candidates having a mean value larger than a threshold value are affixed. Output the root-macro candidate with the most kinds.

Finally, the output unit 150 outputs a final result by using a word corresponding to a preset condition among the words separated by the separating unit 120 and a root-close candidate that is torqued by the torqueizing unit 140. .

At this time, the output unit 150 outputs in the form of (word, P) when the word is output, and outputs in the form of (root, macro) when the root-close candidate is output.

Hereinafter, the Korean affix based tokenizing method according to an embodiment of the present invention will be described with reference to FIGS. 2 to 5.

FIG. 2 is a flowchart illustrating an operation flow of a Korean affix based tokenizing method according to an embodiment of the present invention, and FIG. 3 is an exemplary diagram for describing a Korean affix based tonicizing method according to an embodiment of the present invention. With reference to this, it will be described the specific operation of the present invention.

According to the exemplary embodiment of the present invention, first, the input unit 110 of the torqueizing device 100 receives a Korean sentence to be torqued (S210).

In the embodiment of the present invention, as illustrated in FIG. 3, the description of the tokenizing method will be described using the sentence “Suddenly, the shopping mall shopping cart has stopped” as an example.

When the sentence "Suddenly, the shopping mall shopping cart has stopped" is input in step S210, the separating unit 120 separates the input Korean sentence into a word form based on the spacing (S220).

That is, the separating unit 120 separates the sentences input at step S210 into word forms such as 'suddenly', 'shopping mall', 'shopping cart', 'payment', 'stopped'.

Next, the toning device 100 outputs the word corresponding to the preset condition among the words separated in step S220 as it is (S230).

That is, if the word separated in step S220 belongs to any one of adverbs, tubular articles, or interjections, or if the last syllable of the word does not belong to the survey or the mother, the word is output as it is.

That is, in Figure 3 'suddenly' is the last syllable of the word belongs to the investigation, but 'suddenly' is an adverb, so the word is output without separating the word, 'shopping mall' is a separate word adverb, tubular adjective, interjection Although it does not belong to any of the words, the last syllable of the word 'mall' does not belong to the survey or the ending, so the word is output without being separated.

However, since the last syllable of the shopping cart, payment, and stop has belonged to the investigation or the ending, the words are not output in step S230.

When a word is output as it is, it is output in the form of (word, P), that is, (suddenly, P) and (shopping mall, P), where P is a word-adjunct structure of words without affixes such as adverbs or proper nouns. When separated, the padding token used to fill in affixes.

Therefore, (Suddenly, P) and (Shopping Mall, P) are respectively output in step S230.

Subsequently, the candidate generator 130 generates a root-close candidate corresponding to the number of syllables for each word for words not output in step S230 (S240).

For more details, in the step S230, 'shopping cart', 'payment' and 'stopped' are shifted by one syllable from the last syllable in the direction of the first syllable to the number of syllables. A corresponding root-macro candidate is generated for each word.

FIG. 4 is a diagram for describing root-macro candidates according to torqueing positions in an embodiment of the present invention.

For example, referring to a word composed of three syllables called 'payment value' with reference to FIG. 4, when the number (length) of the syllables constituting the word is defined as T and the torqueizing position is defined as t, T is 3, the candidate generator 130 generates three root-macro candidates for the 'payment price'. That is, a total of three candidates are generated, such as 'payment' corresponding to t = 0, 'payment' corresponding to 't', 'payment value' corresponding to t = 1, and 'payment value' corresponding to t = T-1.

Thus, 'shopping cart' and 'stopped' also generate four root-close candidates and five root-close candidates respectively to correspond to the number of syllables.

Next, the torqueizing unit 140 applies a statistical-based root-macro separation algorithm to each of the root-macro candidates generated in step S240 to torque the unit (S250).

Statistical based root-macro separation algorithm according to an embodiment of the present invention is performed in the following steps.

First, the tokenizing unit 140 compares the average value of the probability that the root candidate is a root and the probability that the macro candidate is affix with a threshold value by using a statistical-based score for each root-close candidate for each word generated in step S240. do.

At this time, the statistics-based score is calculated using the following equation (1).

That is, if the mean value s is greater than the threshold and there is one root-close candidate

Is a statistically-based score value, the mean (s) is greater than the threshold, and there are two or more root-close candidates.

The value of becomes the statistics base value.

On the other hand, when the average value s is very small, it is defined as wrong torqueizing, and the word corresponding to the threshold value or less is output as it is without separating the root-closer.

In addition, the average value (s) in the embodiment of the present invention is based on a correlation (Character n-gram) probability method, the association probability can be expressed as Equation 2 below.

That is, associative probability is a method of determining a conditional probability that letters appear in succession and determining that the word is a continuous word when the correlation value is large.

Where t is the torqueizing position and k is the length of the syllable.

However, Equation 2 cannot be used when the length of the syllable is 1, and according to the embodiment of the present invention, in order to calculate the correlation probability when the length of the syllable is 1, the following equation (3) is used. The t = T-1 case was added to Equation 4.

Hereinafter, using Equation 2, Equation 1

The process of calculating will be described.

First, the torqueizing unit 140 calculates a probability word (x _t ) of a root candidate by using Equation 3 below.

In addition, the torqueizing unit 140 calculates a probability affix (z _t ) that the macro candidate is a macro by using Equation 4 below.

As a result of the comparison, if the mean value s of the probability that the root candidate is a root (word (x _t )) and the probability of the close candidate is affix (affix (z _t )) is equal to or less than a threshold, the word is output as it is.

For example, assuming that the threshold value is 0.15, the mean value s of the probability that the root candidate is a root (word (x _t )) and the probability that the close candidate is affix (affix (z _t )) is 0.15 or less. It is defined as wrong tokenizing, and the word is output as it is without separating it into a root-closer.

That is, the tokenizing unit 140 of the root-close candidates for 'shopping cart', 'payment', 'stopped' generated in step S240, 'shopping cart, knee' does not correspond to the last syllable of the word Since the score is calculated based on Equation 1, the average value s is less than or equal to the threshold and is output as (shopping basket, P).

The tokenizing unit 140 performs root-macro of the words where the mean value s of the probability that the root candidate is a root (word (x _t )) and the probability of the close candidate is affix (affix (z _t )) is larger than a threshold. The root-macro candidate of the word having one candidate is output.

As a result of outputting a root-close candidate corresponding to the corresponding condition, (payment, value) is output as a result of the tokenization of the 'payment value'.

Finally, the tokenizing unit 140 determines the root of the word where the mean value s of the probability that the root candidate is a root (word (x _t )) and the probability that the macro candidate is affix (affix (z _t )) is larger than the threshold. The root-macro candidate with the most affix type among the root-macro candidates of words having two or more macro candidates is output.

FIG. 5 is a diagram for describing a process of obtaining a macro type value in a Korean affix based tokenizing method according to an embodiment of the present invention.

For example, referring to 'stopped' using FIG. 5, a statistical-based score was calculated using Equation 1 for each of the root-macro candidates for 'stopped' generated in step S240. ) The root-close candidates for words larger than the threshold are 'stopped', 'stopped', 'stopped', 'stopped,' and the roots based on the calculated value as shown in Equation 5 below. Separate macros.

Here, affixVariety (x _t ) is a number normalizing the number of affixes to be combined with the root candidate (x _t ) with a probability value of [0, 1], and affixVarietyNum (x _t ) is a macro to be combined with the root candidate. The number of species, affixVarietyMin is the smallest value of the number of affixes associated with all roots, and affixVarietyMax is the largest of the number of species of affixes that combine with all roots.

In other words, the root-macro candidate corresponding to the root-macro candidate having the most affix type among the root-macro candidates of the word calculated using Equation 5 is 'stopped', so the tokenizing unit 140 stops. Will output (* j), as a final torqueizing result.

Finally, the output unit 150 outputs the final result by using the word output in step S230 and the root-close candidate candidate that is torqued in step S250 (S260).

That is, the output unit 150, as shown in FIG. 3, the Korean sentence “Suddenly, the payment of the shopping cart has stopped”, which is input in step S210 by the embodiment of the present invention, (suddenly, P), Output in the form of (shopping mall, P), (shopping cart, P), (payment, going), (stopped).

As described above, the Korean affix-based tokenizing apparatus according to an embodiment of the present invention loses meaning when analyzing in terms of morphemes by torqueing words around a macro by using affix lists and document statistical information. It can preserve the information of the words being made.

In addition, according to an embodiment of the present invention, since only the affix list and statistical information of the document to be analyzed are used, the word boundary can be accurately determined without defining a word dictionary and a tokenizing rule for all words.

In addition, according to an embodiment of the present invention, it can be used for a variety of purposes, such as meaning determination, emotional analysis, there is a huge range of applications.

Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. will be. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the following claims.

100: torqueizing device 110: input unit
120: separation unit 130: candidate generation unit
140: torqueizing unit 150: output unit

Claims

In the tokenizing method using Korean affix based torqueizing device,
Receiving a Korean sentence to be talked about;
Dividing the input Korean sentence into a word form based on a spacing criterion;
Outputting a word corresponding to a preset condition among the separated words;
Generating a root-close candidate for the non-output word corresponding to the number of syllables for each word;
Tokenizing by applying a statistical-based root-macro separation algorithm to each of the generated root-macro candidates; And
Outputting a final result using the output word and the tokenized root-close candidate;
The torqueizing step,
A first step of comparing a mean value of a probability that a root candidate is a root and a probability that the macro candidate is a macro with a threshold value by using a statistical-based score for each generated root-macro candidate for each word,
A second step of outputting the word as it is when the average value is less than or equal to a threshold as a result of the comparison;
A third step of outputting a root-macro candidate of a word having one root-macro candidate among the words whose average value is larger than the threshold value, and
And a fourth step of outputting a root-macro candidate having the most affix type among the root-macro candidates of a word having two or more root-macro candidates among the words whose average value is larger than the threshold.

The method of claim 1,
The step of outputting the word,
And if the separated word belongs to any one of adverbs, tubular articles, or interjections, or if the last syllable does not belong to an investigation or a mother word, the word is output as it is.

The method of claim 1,
Generating the root-macro candidates,
Tokenizing method that generates a root-close candidate for each word by dividing the boundary by shifting one syllable from the last syllable to the first syllable direction.

delete

The method of claim 1,
The torqueizing step,
Torqueizing method for calculating the statistical score based on the following equation:

The method of claim 5,
The torqueizing step,
When the root-macro candidate is one, a torque wording method for calculating a probability word (x _t ) of the root candidate and affix affix (z _t ) using the following _equation , respectively: :

The method of claim 5,
The torqueizing step,
Tokenizing method for separating the root and affix based on the value calculated from the following equation when the root-macro candidate is two or more:

The method of claim 1,
Outputting the final result,
And outputting the word in the form of (word, P) when the word is output as it is, and outputting in the form of (root, macro) when the root-macro candidate is output.

An input unit for receiving a Korean sentence to be talked about;
A separator for separating the input Korean sentence into a word form based on a spacing criterion;
A candidate generator configured to generate a root-close candidate corresponding to the number of syllables for each word of a word that does not correspond to a preset condition among the separated words;
A torqueizing unit for torqueizing by applying a statistic-based root-macro separation algorithm to each of the generated root-macro candidates; And
A word corresponding to a preset condition among the separated words, and an output unit configured to output a final result by using the tokenized root-close candidate;
The torqueizing unit,
The average value of the probability that the root candidate is a root and the probability that the macro candidate is a macro is compared with a threshold value by using a statistical-based score for each root-close candidate for each word, and when the average value is less than or equal to a threshold value, the corresponding word Is output as it is, and a root-macro candidate of a word having one root-macro candidate is one of the words whose average value is larger than a threshold value, and a word having two or more root-macro candidates among words whose average value is larger than the threshold value. A Korean affix based tokenizing device that outputs a root-macro candidate having the most affix type among the root-macro candidates of.

The method of claim 9,
The output unit,
When the separated word belongs to any one of adverbs, tubular or interjective words, or if the last syllable does not belong to an investigation or ending, it is determined that the word corresponds to the preset condition and outputs the word as it is. Nazing device.

The method of claim 9,
The candidate generator,
A Korean affix based tokenizing device that generates a root-close candidate for each word by dividing the boundary by shifting one syllable from the last syllable to the first syllable direction.

delete

The method of claim 9,
The torqueizing unit,
Korean affix based tokenizing device for calculating the statistical score based on the following equation.

The method of claim 13,
The torqueizing unit,
When the root-macro candidate is one, Korean affix based torque for calculating a probability word (x _t ) and a probability affix (z _t ) where the root candidate is an affix using the following equation: Nazing device.

The method of claim 13,
The torqueizing unit,
Korean affix based tokenizing device that separates the root and the affix based on the value calculated from the following equation, if there are two or more root-macro candidates.

The method of claim 9,
The output unit,
When the word is output as it is (Korean word, P) form, and if the root-close candidate is output (Korean root, close-up) Korean affix based tokenizing device.