KR101630436B1

KR101630436B1 - Method for extracting independent feature of language

Info

Publication number: KR101630436B1
Application number: KR1020150046082A
Authority: KR
Inventors: 최호진; 정영섭
Original assignee: 한국과학기술원
Priority date: 2015-04-01
Filing date: 2015-04-01
Publication date: 2016-06-15

Abstract

According to an embodiment of the present invention, a method for extracting a non-dependent attribute of a language includes: a text subdivision step of subdividing a text into letters to generate a candidate letter; a candidate letter set generation step of generating a candidate letter set including the candidate letter; a target text subdivision step of subdividing a target text based on the candidate letter set to generate a continuous letter; and a language non-dependent attribute extraction step of extracting a language non-dependent attribute from the continuous letter. Therefore, the present invention can extract a good attribute even when analyzing any document regardless of languages, and enable a document analysis model having a good performance to be learnt.

Description

{METHOD FOR EXTRACTING INDEPENDENT FEATURE OF LANGUAGE}

The present invention relates to a method for extracting non-dependent qualities of a language.

Typically, typical techniques for extracting qualities from a document include Part-Of-Speech (POS) tags and dependency parsing structures. Techniques for extracting qualities are widely used in research and business for analyzing documents. These qualities are very useful for analyzing documents, but the techniques for automatically extracting qualities have limitations that are language-dependent, so the range of values that qualities represent varies depending on the language.

These techniques extensively refer to other information and data, or commonly use labeled data to extract qualities from a document. For example, in order to develop a technique applicable to English, a vocabulary list constructed based on Korean and a vocabulary list constructed based on English are matched. In order to learn a model for extracting POS tags, Use the labeled data that is tagged.

However, existing techniques for extracting POS tags from Korean documents are not immediately applicable in other languages, and existing techniques for extracting POS (Part Of Speech) tags from other languages are not immediately applicable to Korean .

It is therefore necessary to study techniques that are not tied to language-dependent information and data, labeled data, or any other language-dependent condition.

The present invention provides a method for extracting non-dependent qualities of a language regardless of language.

A non-dependent feature extraction method of a language according to an embodiment of the present invention includes a text segmentation step of generating candidate characters by dividing text into characters; A candidate character set generation step of generating a candidate character set including the candidate character; A target text subdividing step of subdividing the target text based on the candidate character set to generate a continuous character; And a language non-dependent feature extraction step of extracting a language non-dependent feature from the continuous character.

According to the configuration of the present invention, the method for extracting the non-dependent qualities of the language extracts the non-dependent qualities from the documents of various languages, thereby extracting the qualities regardless of the analysis of any document regardless of the language, Document analysis model can be learned.

1 is a flow chart illustrating a method for extracting language-dependent qualities according to an embodiment of the present invention.
2 is a detailed flow diagram of refining text according to an embodiment of the present invention.
FIG. 3 and FIG. 4 are illustrations for explaining a method of subdividing text based on subdivision conditions in English text input according to an embodiment of the present invention.
5 is an exemplary diagram illustrating a candidate character set according to an embodiment of the present invention.
FIGS. 6 and 7 are illustrations for explaining a method of subdividing text based on subdivision conditions in Korean text input according to an embodiment of the present invention.
8 is an exemplary diagram illustrating a candidate character set according to an embodiment of the present invention.
FIGS. 9 and 10 are illustrations showing the result of extracting the language non-dependent qualities from the refined target text according to the embodiment of the present invention.

The following detailed description of the invention refers to the accompanying drawings, which illustrate, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different, but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with one embodiment. It is also to be understood that the position or arrangement of the individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is to be limited only by the appended claims, along with the full scope of equivalents to which such claims are entitled, if properly explained. In the drawings, like reference numerals refer to the same or similar functions throughout the several views.

Hereinafter, a method for extracting non-dependent qualities of a language according to an embodiment of the present invention will be described with reference to the accompanying drawings.

1 is a flow chart illustrating a method for extracting language-dependent qualities according to an embodiment of the present invention.

Referring to FIG. 1, the method for extracting the language non-dependent qualities comprises the steps of subdividing text into characters according to text input (S100), generating a candidate character set using the subdivided text (S110) , Subdividing the target text based on the candidate character set (S120), and extracting the language non-dependent character from the subdivided target text (S130).

The step of subdividing the text into characters according to the text input is to create a character token by dividing the text into characters when the text is input, generate a trie for the character token, Can be divided into candidate letters. At this time, the text may be an unlabeled document, that is, a text in which the POS tag is not displayed, and the subdivision condition may be a space, a space, or the like. For example, the text "Yeonhee likes to withdraw" can be divided into spaced letters such as "Yeonhee", "withdraw", and "likes." The operation of dividing the text in units of characters described above can be performed selectively in various languages in which a space is present and may not be performed in a language without spaces.

The step of generating a candidate character set using the subdivided text may generate a candidate character set including the generated candidate character.

In the step of subdividing the target text based on the candidate character set according to the target text input, when the target text for extracting the language non-dependent qualities is input, the candidate character included in the candidate character set is compared with the target text, One or more consecutive characters can be generated by dividing into characters.

The step of extracting the language non-dependent features from the refined target text may extract the classes and topics corresponding to the non-dependent features of the language from the generated consecutive characters.

2 is a detailed flow diagram of refining text according to an embodiment of the present invention.

Referring to FIG. 2, the step of subdividing the text includes a step S200 of dividing the input text into a character token, a step S210 of creating a triangle for the character token, a step S220 of dividing the character token based on the generated triangle ), And generating a candidate character according to the division result (S230).

The step of dividing the input text into a character token can divide the input text into one or more character tokens based on the division conditions such as spacing, space, and the like. In this case, the character token means the minimum unit of characters that can be divided based on the partitioning condition. For example, if you divide the text "I go home," based on spacing, one or more of the character tokens can be "I", "home", "go".

The step of creating a trie for a character token may generate a trie consisting of one or more letter nodes containing the letters. Here, the trie includes one character except for the root node, and according to the depth of the trie, the position of the node including the specific character can indicate the frequency with which the specific character is included in the text.

Also, the trie may include a forward trie generated by receiving each letter in the order of left to right, and a backward trie generated by receiving each letter in the right-to-left order. At this time, spaces (a period, a question mark, and the like) except spaces are separated and included in each node.

The step of dividing the character token based on the created trie can divide the nodes having a frequency difference of more than a preset threshold value for each node. For example, in the case of a character token such as "withdraw", if the frequency difference between "withdraw" and "E" is greater than the threshold value, "withdraw" and "E" can be split between. Here, the predetermined threshold value means a reference frequency difference set in advance for dividing each character.

In the step of setting the candidate character according to the division result, the character token generated by the division condition and the character token divided by the trie can be set as the candidate character. For example, you can set the character token "I" and the divided character tokens "g" and "o" as candidate characters. In this case, the candidate character is a character repeatedly appearing in the input text, and one or more candidate characters can be used as a candidate character set in order to extract the language non-dependent qualities. Here, the character token can be divided as much as possible by the trie.

FIG. 3 and FIG. 4 are illustrations for explaining a method of subdividing text based on subdivision conditions in English text input according to an embodiment of the present invention, and FIG. 5 is a flowchart illustrating a method for subdividing text according to an embodiment of the present invention, Fig.

Quot ;, " go ", " to ", " to ", and " quot; home ... "may be generated. Two trie-forward trie 320 and backward trie 330 for this generated character token can be generated.

Referring to FIG. 4, the generated two trials may be composed of one or more nodes each of "I", "go", "to", "home ...". It is possible to divide the number of nodes in which the letters repeatedly appear in the text for each node and then calculate the difference between the calculated frequencies and the nodes whose threshold is equal to or greater than the threshold value. For example, if the frequency difference between "g" and "o", "t" and "o", "hom" o ", "hom ", and" e ... " As described above, it is possible to generate the candidate character set 400 as shown in FIG. 5 by setting the character tokens generated by the division condition and the character tokens divided by the two trials as candidate characters.

FIGS. 6 and 7 are illustrations for explaining a method of subdividing text based on subdivision conditions in Korean text input according to an embodiment of the present invention, and FIG. Fig.

Referring to FIG. 6, when three Hangul texts 500 such as "Chul-soo likes Young-hee", "Chul-soo likes Young-hee" and " It is divided into the letter units and it likes "withdraw", "love", "likes", "withdraw", "love", "likes", " Character tokens 510 such as "? " A forward trie 520 and a backward trie 530 for the generated character tokens may be generated.

Referring to FIG. 7, the two trials that are generated are the same as in the first embodiment, except that the generated trie is "I love you", "I love you", "I love you" "," Do you like? " Each of which may be composed of one or more nodes. It is possible to divide the number of nodes in which the letters repeatedly appear in three texts for each node, and then to divide the nodes having a difference of the calculated frequency by more than a threshold value.

For example, if the threshold for partitioning in the forward trie 520 is greater than or equal to 1, then "pull down" and "will", "pull down" and " , The frequency difference between "liked" and "is.", "Liked" and "great", "liked" and "okay" Therefore, I would like to emphasize that "withdraw" and "I", "withdraw" and "I", "I" and "I" . "," I like "and" I do? "," I like "and" I? "

Also, in the case of the Backward Trie (530), the words "pull" and "", "" and "", "like", ".", "Like" and "." Can be divided between "pull" and "," "young" and "to", "like" and ".", "Like" and "." The character tokens generated by the division condition and the character tokens divided by the two trials are set as candidate characters to generate the candidate character set 600 as shown in FIG.

FIGS. 9 and 10 are illustrations showing the result of extracting the language non-dependent qualities from the refined target text according to the embodiment of the present invention.

If the target text for extracting the non-dependent qualities of the language is input, the target text is divided into candidate characters by using the candidate character set to generate one or more consecutive characters, and the generated consecutive characters are analyzed, Classes and topics can be classified into. Here, the class consists of a plurality of fixed classes and topics for classifying consecutive letters to represent each end of the sentence such as a period, a question mark, etc., a space, a number, a special character, And may include at least one class for classifying consecutive characters. These classes and topics can specify the number and can be set in advance. A topic can also be a criterion for classifying consecutive letters by relevance category within a class.

In order to classify the target text into classes and topics, it is possible to analyze the text in units of consecutive characters using the analysis model and extract the qualities in units of characters. This analysis model assigns a fixed class to a specific character among consecutive characters, and can set five fixed classes such as an end character, a blank, a number, a special character, and other language characters indicating the end of a sentence. Also, in addition to the five fixed classes, at least one class is set up to represent the topic, and at least one of these classes basically receives parameters of the number of classes and the number of topics. At this time, the number of classes is at least 6, and the number of topics can be one or more. In addition to these two parameters, there are parameters for the initial values required for the operation of the model, but they have the same structure as the existing models. Basically, they can be set by unifying the initial values of the parameters to arbitrary values.

The analysis model may be modeled according to a sequence of appearance of consecutive characters or a pattern of frequency, so that continuous characters can be classified into a plurality of classes. For example, if a lot of "to" and "to" appear after the word "school", letters such as "to" and "to" can be modeled as classes of the same function. In addition, letters such as "school" and "house" can be modeled as classes of the same function if the letters "home" are often found in addition to the letters "school" and "before"

In addition, the above analysis model can classify characters frequently appearing in a class into one topic, assuming that characters appear frequently for one topic. For example, letters such as "deposit", "withdrawal", "money", etc. can be classified as "bank" because they frequently appear repeatedly under the word "bank".

Thus, the analysis model provides class values and topic values as outputs according to each consecutive character, and extracts qualities in a character unit, so it is non-dependent on language and applicable to documents of any language. In addition, since it is possible to learn by simply receiving a lot of texts, rather than labeled data, it is more convenient to use.

For example, if the Hangul text, "I arrived at lunchtime," is entered, the Hangul text is set to "lunch", "when", "to", "to", "to" Quot ;, "arrived "," ji ", "yo ", and". &Quot;, and each divided successive character can be classified into classes and topics and output as shown in Fig. Here, each successive character grouped may be given a number corresponding to the class and the topic. At this time, since the class and the topic are extracted in units of consecutive characters, the method of extracting the language non-dependent qualities according to the present invention is language-independent and can be applied to documents of any language.

When the English text "We arrived at noon." Is input, the English text is divided into successive letters such as "We", "arriv", "ed", "at", "noon" Each divided successive character can be classified into a class and a topic and output as shown in FIG.

As described above, the present invention extracts language-non-dependent qualities from documents in various languages, so that it is possible to extract good qualities regardless of a language, and to learn a document analysis model having good performance.

The features, structures, effects and the like described in the embodiments are included in one embodiment of the present invention, and are not necessarily limited to only one embodiment. Further, the features, structures, effects, and the like illustrated in the embodiments can be combined and modified by other persons having ordinary skill in the art to which the embodiments belong. Therefore, it should be understood that the present invention is not limited to these combinations and modifications.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be taken by way of illustration, It can be seen that various modifications and applications are possible. For example, each component specifically shown in the embodiments can be modified and implemented. It is to be understood that all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

300, 500: text 310, 510: character token
320, 520: Forward trie 330, 530: Backward trie
400, 600: Candidate character set

Claims

A method for extracting non-dependent qualities of a language from an extraction device for extracting non-dependent qualities of a language from a document,
A text subdividing step of subdividing the input text into character units to generate candidate characters;
A candidate character set generation step of generating a candidate character set including the candidate character;
A target text subdividing step of subdividing the target text based on the candidate character set to generate a continuous character; And
And a language non-dependent feature extraction step of extracting language non-dependent features from the continuous character,
Wherein the text segmentation step comprises:
A tokenizing step of generating a character token by dividing the text into characters on the basis of a division condition;
A trie generation step of generating a trie for the character token; And
And dividing the character token based on the generated trie according to a frequency of repeatedly appearing in the text to generate the candidate character.

delete

The method according to claim 1,
A method for extracting non-dependent qualities of a language including at least one of a space, a space, and a space.

2. The method of claim 1,
And generating a plurality of trials composed of a character node including each letter of the character token.

5. The method of claim 4,
Determining whether a frequency difference between the character nodes is equal to or greater than a preset threshold value; And
And dividing the character nodes among the plurality of character nodes if the frequency difference is greater than or equal to the threshold value.

A method for extracting non-dependent qualities of a language from an extraction device for extracting non-dependent qualities of a language from a document,
A text subdividing step of subdividing the input text into character units to generate candidate characters;
A candidate character set generation step of generating a candidate character set including the candidate character;
A target text subdividing step of subdividing the target text based on the candidate character set to generate a continuous character; And
And a language non-dependent feature extraction step of extracting language non-dependent features from the continuous character,
Wherein the target text segmentation step comprises:
Comparing the candidate text included in the candidate character set with the target text; And
And dividing the target text by the candidate character to generate the continuous character.

A method for extracting non-dependent qualities of a language from an extraction device for extracting non-dependent qualities of a language from a document,
A text subdividing step of subdividing the input text into character units to generate candidate characters;
A candidate character set generation step of generating a candidate character set including the candidate character;
A target text subdividing step of subdividing the target text based on the candidate character set to generate a continuous character; And
And a language non-dependent feature extraction step of extracting language non-dependent features from the continuous character,
Wherein the language non-dependent feature extraction step comprises:
Classifying the continuous character into a class having the same function;
Classifying the characters repeatedly appearing in the class into one topic; And
And outputting the result of classification.