KR102037453B1

KR102037453B1 - Apparatus and Method for Numeral Classifier Disambiguation using Word Embedding based on Subword Information

Info

Publication number: KR102037453B1
Application number: KR1020190080206A
Authority: KR
Inventors: 권혁철; 김민호
Original assignee: 부산대학교 산학협력단
Priority date: 2018-11-29
Filing date: 2019-07-03
Publication date: 2019-10-29

Abstract

The present invention relates to an apparatus and method for solving the ambiguity in a measure word using word embedding based on partial word information to measure semantic association between coocurrence nouns of a measured word and nouns represented in an actual sentence for solving the ambiguity in a measured word. The apparatus comprises: a morpheme analysis unit analyzing an inputted sentence in morpheme units; a word extracting unit extracting a measured word and a co-occurring noun from a morpheme analysis result of the morpheme analysis unit; and an ambiguity resolving unit resolving the ambiguity of the measure word extracted by the word extracting unit.

Description

Apparatus and Method for Numeral Classifier Disambiguation using Word Embedding based on Subword Information}

본 발명은 수분류사 중의성 해소에 관한 것으로, 구체적으로 부분단어 정보(subword information)에 기반을 둔 워드임베딩을 이용하여 수분류사의 중의성을 해소하는 장치 및 방법에 관한 것이다.TECHNICAL FIELD The present invention relates to resolving neutrality of waterborne sand, and more particularly, to an apparatus and method for resolving neutrality of waterborne sand using word embedding based on subword information.

전 세계 모든 언어, 특히 문법체계를 가진 모든 언어에는 분류사(classifier)가 존재하는데 영어나 프랑스어 등에 비해 한국어, 일본어, 동남 아시아어, 아프리카어 등 소위 '분류사 언어(classifier language)'에서 더 정교한 체계를 보인다.Classifiers exist in all languages around the world, especially all languages with a grammar system, and are more sophisticated in the so-called 'classifier language', such as Korean, Japanese, Southeast Asian, and African, compared to English and French. Seems.

분류사는 수량사와 함께 수량사구를 이루어 명사의 수량을 한정하고 의미를 범주화 한다.Classifiers work together with water quantifiers to limit the number of nouns and categorize meanings.

따라서, 분류사에 의한 명사의 수량 표현에서 명사, 수량사, 분류사의 출현이 요구되므로, 분류사는 통사적으로 의무적인 요소가 된다. 분류사에 의한 명사의 수량 표현에서 명사는 수량 규정의 대상이 되고, 수량사는 명사의 수량을 구체적으로 한정한다.Therefore, the appearance of nouns, quantifiers, and taxonomy is required in the expression of the quantity of nouns by the taxonomy, so that the taxonomy is syntactically mandatory. In the expression of quantities of nouns by the taxonomy, nouns are subject to quantity provisions, and the quantityrs specifically limit the quantity of nouns.

그리고 분류사는 명사를 개체화하여 수량 표현을 가능케 하면서 유(類)적 속성을 명시해 준다. 이 때 분류사는 명사가 언급하는 대상물 또는 대상물의 집합체의 성격에 의존하여 선택된다. 분류사는 대상물을 개체화하고 그것들을 유형화하여 집단을 구성하는 것을 전제한다. 분류사는 구체적인 사물뿐만 아니라 추상적인 사물이나 사건, 동작 등의 사태도 개체화한다. Classifiers specify nominal attributes by enabling the expression of quantities by individualizing nouns. The classifier is chosen according to the nature of the object or collection of objects that the noun refers to. The classifier presupposes the formation of a group by individualizing objects and typing them. The classifier individualizes not only concrete things but also abstract things, events, and actions.

분류사는 크게 명사 분류사(nominal classifier), 수분류사(numeral classifier), 일치적 분류사(concordial classifier), 술어 분류사(predicate classifier) 등의 네 가지 유형으로 구분할 수 있는데, 한국어에는 수분류사가 복잡한 양상으로 실현된다.Classifiers can be classified into four types: nominal classifier, numeral classifier, concordial classifier, and predicate classifier. It is realized in a complex aspect.

수분류사의 두 가지 주요 기능은 사물 또는 사건의 부류화(classification) 또는 범주화(categorization)와 수량화(quantification)이다.The two main functions of pollinators are the classification or categorization and quantification of things or events.

수분류사는 수량사 또는 수량 표현과 공기하며, 특정 명사 범주에 대해 비교적 명확한 공기제약(co-occurrence restriction)을 갖는다.Water classifiers are quantitative or quantitative, and have a relatively clear co-occurrence restriction for certain noun categories.

이와 같이 수분류사(數分類詞)란 명사의 수량 표현의 단위를 표시하면서, 그 의미를 분류, 한정하는 기능을 가진 말이다. In this way, the term "water flow" is a word having a function of classifying and limiting the meaning while displaying the unit of the quantity expression of the noun.

예를 들어, 한국어에서는 '커피 한 잔'의 '잔'이, 영어에서는 'three cups of coffee'의 'cups'가 수분류사이다.For example, in Korean, 'cup' of 'a cup of coffee' and 'cups' of 'three cups of coffee' in English are pollinators.

수분류사는 텍스트에 나타난 '아라비아 숫자'를 어떻게 읽어야 하는지에 대한 단서가 된다.Hydrangeas provide clues on how to read the Arabic numerals in text.

예를 들어, 아라비아 숫자 '1'은 '버스 1대'에서는 '한'으로 읽지만, '1대 대통령'에는 '일'로 읽어야 한다. 이는 수분류사가 앞서는 수사의 성격이 어떠해야 하는지를 정의하기 때문이다.For example, the Arabic numeral '1' should be read as 'one' in 'one bus' but 'day' in 'one president'. This is because polluters define what the preceding investigation should be.

한국어에서 수분류사는 대부분 동형이의어(homograph)에 해당하며, 중의성(ambiguity)이 높다.In Korean, pollinators are mostly homographs and have high ambiguity.

중의성 수분류사가 문장에서 어떠한 의미로 사용되었는지에 따라 함께 나타나는 아라비아 숫자의 읽는 방법이 달라진다.The way the neutral numbers are used in a sentence depends on how the Arabic numerals appear together.

'1대 대통령'에서 '대'는 '가계나 지위를 이어받은 순서를 나타내는 단위'로서 '1'을 '일'로 읽히게 하지만, '자동차 1대'에서 '대'는 '차나 기계, 악기 따위를 세는 단위'로서 '1'을 '한'으로 읽히게 한다.In the first president, 'dae' is a unit that indicates the order of inheriting household or status, but '1' is read as 'work', but in 'one car' 'dae' is 'car, machine, musical instrument' '1' is read as 'one' as a counting unit.

이와 같이 수분류사의 중의성 해소는 TTS(text-to-speech)의 성능을 좌우하는 중요한 기술 중 하나이다.As such, neutralization of the waterborne sand is one of the important technologies that determine the performance of the text-to-speech (TTS).

그럼에도 종래 기술의 수분류사의 중의성 해소 기술에서는 수분류사의 중의성 해소에 한계가 있어 텍스트 음성 변환 분야에서의 효율적인 적용이 어렵다.Nevertheless, in the prior art neutralization technology of water-repellent sand, the neutralization of water-repellent sand is limited, and thus, it is difficult to apply it effectively in the field of text-to-speech.

따라서, 효과적인 수분류사의 중의성 해소를 위한 새로운 기술의 개발이 요구되고 있다.Therefore, there is a demand for the development of a new technology for effective neutralization of the water-repellent sand.

대한민국 등록특허 제10-0886688호Republic of Korea Patent No. 10-0886688 대한민국 공개특허 제10-2016-0124742호Republic of Korea Patent Publication No. 10-2016-0124742 대한민국 공개특허 제10-2012-0023387호Republic of Korea Patent Publication No. 10-2012-0023387

본 발명은 종래 기술의 수분류사 중의성 해소 기술의 문제점을 해결하기 위한 것으로, 부분단어 정보(subword information)에 기반을 둔 워드임베딩을 이용하여 수분류사의 중의성을 해소하는 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problems of the prior art water repellent neutralization technology, to provide an apparatus and method for resolving the water repellency by using word embedding based on subword information (subword information) Its purpose is to.

본 발명은 수분류사의 의미별 공기명사(cooccurrence nouns)와 실제 문장에 나타난 명사 간 의미적 연관성을 측정하여 수분류사의 중의성을 해소할 수 있도록 한 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법을 제공하는데 그 목적이 있다.The present invention uses a word embedding based on partial word information to solve the gravity of the pollutant by measuring the semantic association between the nouns of the pollinator and the noun shown in the actual sentence. It is an object of the present invention to provide an apparatus and method for resolving taxa neutrality.

본 발명은 수분류사의 의미별 공기명사와 실제 문장에 나타난 명사 간 의미적 연관성을 측정할 때, 부분단어 정보에 기반을 둔 워드임베딩을 통해 획득한 단어벡터를 이용하여 의미적 연관성을 측정하여 수분류사의 중의성을 해소할 수 있도록 한 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법을 제공하는데 그 목적이 있다.In the present invention, when measuring the semantic relation between the air nouns of the semantic and the nouns shown in the actual sentence, the semantic correlation is measured by using a word vector obtained through word embedding based on the partial word information. It is an object of the present invention to provide an apparatus and method for resolving waterborne sandiness using word embedding based on partial word information to solve the gravity of taxonomy.

본 발명은 중의성을 가진 수분류사와 함께 쓰인 아라비아 숫자를 어떻게 읽어야 할지 판단하여 TTS와 같은 시스템에서 자연스러운 발화 결과를 생성할 수 있도록 한 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to determine how to read the Arabic numerals used with the water-repellent material having a neutrality, the water-repellency using the word embedding based on the partial word information to generate a natural ignition result in a system such as TTS It is an object of the present invention to provide an apparatus and method for solving the problem.

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치는 입력한 문장을 형태소 단위로 분석하는 형태소 분석부;상기 형태소 분석부의 형태소 분석 결과로부터 수분류사와 공기명사를 추출하는 단어 추출부;상기 단어 추출부에서 추출한 수분류사의 중의성을 해소하는 중의성 해소부;를 포함하는 것을 특징으로 한다.In accordance with an aspect of the present invention, there is provided an apparatus for resolving polluted water using a word embedding based on partial word information. The morphological analysis unit analyzes an input sentence in a morpheme unit; Word extraction unit for extracting the water-like and air noun from the result; Neutral solution for resolving the neutrality of the water-flow sand extracted from the word extraction unit; characterized in that it comprises a.

여기서, 상기 중의성 해소부는 단어벡터 목록과 공기명사 목록을 이용하여 수분류사의 중의성을 해소하는 것을 특징으로 한다.Here, the neutral elimination unit is characterized by resolving the neutrality of the water-rejection by using the word vector list and the air noun list.

그리고 상기 중의성 해소부는 공기명사 목록에 없는 명사와 쓰인 수분류사는, 유사한 의미를 지닌 단어는 벡터공간에 가까이 위치하는 것을 이용하여 워드임베딩을 통해 단어를 벡터화한 단어벡터 목록을 함께 사용하여 중의성을 해소하는 것을 특징으로 한다.In addition, the neutral resolving unit uses a noun not included in the list of air nouns and a watery noun used in the word noun list. It characterized in that to solve.

그리고 상기 단어 추출부는 형태소 분석 결과로부터 중의성 해소 대상을 선정하기 위해 중의성을 가진 수분류사와 해당 수분류사와 문맥에서 함께 나타난 명사를 추출하는 것을 특징으로 한다.The word extracting unit may extract a noun that has a neutrality and a noun that appears in the context with the corresponding noun to determine the neutral resolution target from the morphological analysis result.

그리고 상기 중의성 해소부에서의 수분류사의 중의성을 해소하기 위하여 사용되는 단어벡터 목록을 생성하는 단어벡터 생성부를 더 포함하고, 단어벡터 생성부는 대규모 한국어 형태소 분석 말뭉치를 이용하여 워드임베딩을 통해 단어벡터 목록을 생성하는 것을 특징으로 한다.And a word vector generator for generating a word vector list used to solve the neutrality of water-rejection in the neutral resolver, and the word vector generator generates a word through word embedding using a large Korean morphological corpus. The vector list is generated.

그리고 상기 중의성 해소부에서의 수분류사의 중의성을 해소하기 위하여 사용되는 공기명사 목록을 생성하는 공기명사 생성부를 더 포함하고, 공기명사 생성부는 대규모 한국어 형태의미 분석 말뭉치를 이용하여 수분류사의 의미별 공기명사 목록을 생성하는 것을 특징으로 한다.And an air noun generator for generating a list of air nouns used to relieve the neutrality of the water-injection in the neutral resolving unit, and the air noun generating unit using a large-scale Korean analytical corpus. Characterized by generating a list of air nouns.

다른 목적을 달성하기 위한 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치는 대규모 한국어 형태소 분석 말뭉치를 이용하여 워드임베딩을 통해 단어벡터 목록을 생성하는 단어벡터 생성부;대규모 한국어 형태의미 분석 말뭉치를 이용하여 수분류사의 의미별 공기명사 목록을 생성하는 공기명사 생성부;입력한 문장을 형태소 단위로 분석하는 형태소 분석부;상기 형태소 분석부의 형태소 분석 결과로부터 중의성 해소 대상을 선정하기 위해 중의성을 가진 수분류사와 해당 수분류사와 문맥에서 함께 나타난 명사를 추출하는 단어 추출부;단어벡터 목록과 공기명사 목록을 이용하여 상기 단어 추출부에서 추출한 수분류사의 중의성을 해소하는 중의성 해소부;를 포함하는 것을 특징으로 한다.A device for resolving polluted waters using word embedding based on partial word information according to the present invention for achieving another object generates a vector of words by generating a word vector list through word embedding using a large Korean morphological analysis corpus. An air noun generator for generating a list of air nouns by meaning of water sperm; a morpheme analysis unit for analyzing input sentences in morpheme units; A word extracting unit for extracting a watery noun with neutrality and a noun appearing in context with the corresponding watery noun to select an object to be resolved; the neutrality of the watery noun extracted from the word extraction unit using a word vector list and an air noun list It characterized in that it comprises a; neutral solver to solve the.

다른 목적을 달성하기 위한 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 방법은 단어벡터 생성부에서 워드임베딩을 통해 단어벡터 목록을 생성하고, 공기명사 생성부에서 수분류사의 의미별 공기명사 목록을 생성하는 단계;문장이 입력되면 형태소 분석부에서 수분류사와 공기명사를 추출하기 위해 형태소 분석을 수행하는 단계;단어 추출부에서 형태소 분석 결과로부터 중의성 해소 대상을 선정하기 위해 중의성을 가진 수분류사와 해당 수분류사와 문맥에서 함께 나타난 명사를 추출하는 단계;중의성 해소부에서 단어벡터 목록과 공기명사 목록을 활용하여 수분류사의 중의성을 해소하는 단계;를 포함하는 것을 특징으로 한다.In order to solve the water class neutrality neutrality using word embedding based on the partial word information according to the present invention for achieving another object, the word vector generating unit generates a word vector list through the word embedding, Generating a list of air nouns by meaning of the water classifiers; if a sentence is input, performing a morphological analysis to extract the water nouns and the air nouns from the morpheme analysis unit; Extracting the nouns with neutrality and the nouns appearing in the context with the corresponding water nouns to select; resolving the neutrality of the moisture nouns using the word vector list and the air noun list in the neutral resolving unit; It is characterized by including.

여기서, 수분류사의 중의성을 해소하는 단계에서, 공기명사 목록에 없는 명사와 쓰인 수분류사는 유사한 의미를 지닌 단어는 벡터공간에 가까이 위치하는 것을 이용하여 워드임베딩을 통해 단어를 벡터화한 단어벡터 목록을 함께 사용하여 중의성을 해소하는 것을 특징으로 한다.Here, in the step of resolving the neutrality of the pollutants, a word vector list that is a vectorized word through word embedding using a noun that is not in the list of air nouns and the words that have similar meanings are located close to the vector space. It is characterized by using a combination of the neutrality.

그리고 중의성을 가진 수분류사와 해당 수분류사와 문맥에서 함께 나타난 명사를 추출하는 단계에서, 중의성이 없는 수분류사는 추출하지 않고, 중의성이 있는 수분류사는 수분류사와 공기하는 명사를 추출하는 것을 특징으로 한다.In the step of extracting the water-containing yarn with neutrality and the nouns shown in the context with the water-related yarn, the water-free yarn without neutrality is extracted, and the water-soluble yarn with neutrality extracts the water-like yarn and the air noun. It is characterized by.

이상에서 설명한 바와 같은 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법은 다음과 같은 효과가 있다.As described above, the apparatus and method for resolving water-in-the-water neutrality using word embedding based on partial word information according to the present invention has the following effects.

첫째, 부분단어 정보(subword information)에 기반을 둔 워드임베딩을 이용하여 효율적으로 수분류사의 중의성을 해소하는 장치 및 방법을 제공할 수 있다.First, it is possible to provide an apparatus and method for efficiently eliminating the neutrality of pollinated water by using word embedding based on subword information.

둘째, 수분류사의 의미별 공기명사(cooccurrence nouns)와 실제 문장에 나타난 명사 간 의미적 연관성을 측정하여 수분류사의 중의성을 효율적으로 해소할 수 있도록 한다.Second, by measuring the semantic association between the nouns of the pollinators and the nouns shown in the actual sentences, the gravity of the pollutants can be effectively resolved.

셋째, 수분류사의 의미별 공기명사와 실제 문장에 나타난 명사 간 의미적 연관성을 측정할 때, 부분단어 정보에 기반을 둔 워드임베딩을 통해 획득한 단어벡터를 이용하여 의미적 연관성을 측정하여 수분류사의 중의성을 효율적으로 해소할 수 있다.Third, when measuring the semantic relation between air nouns by semantic nouns and the nouns shown in actual sentences, water semantics by measuring semantic relations using word vectors acquired through word embedding based on partial word information. The neutrality of the company can be effectively eliminated.

넷째, 중의성을 가진 수분류사와 함께 쓰인 아라비아 숫자를 어떻게 읽어야 할지 판단하여 TTS와 같은 시스템에서 자연스러운 발화 결과를 생성할 수 있도록 한다.Fourth, by determining how to read the Arabic numerals used with the water-repellent yarn with neutrality, it is possible to generate a natural ignition result in a system such as TTS.

도 1은 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치의 구성도
도 2는 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 방법을 나타낸 플로우 차트
도 3은 중의성 해소부가 수분류사의 중의성을 해소하는 과정의 일 예를 나타낸 예시도1 is a block diagram of an apparatus for resolving water jet neutrality using word embedding based on partial word information according to the present invention.
2 is a flow chart illustrating a method for resolving neutrality among pollutants using word embedding based on partial word information according to the present invention.
Figure 3 is an exemplary view showing an example of the process of resolving the neutral dissolving portion neutrality of the water-repellent sand

이하, 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of the apparatus and method for resolving water-induced gravity using word embedding based on partial word information according to the present invention will be described in detail.

본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.The features and advantages of the apparatus and method for resolving polluted gravity using water embedding based on partial word information according to the present invention will become apparent from the detailed description of each embodiment below.

도 1은 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치의 구성도이다.1 is a block diagram of an apparatus for resolving waterborne gravity using word embedding based on partial word information according to the present invention.

본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법은 수분류사의 의미별 공기명사(cooccurrence nouns)와 실제 문장에 나타난 명사 간 의미적 연관성을 측정하여 수분류사의 중의성을 효율적으로 해소할 수 있도록 한 것이다.Apparatus and method for resolving polluted water content using word embedding based on partial word information according to the present invention can measure the semantic correlation between the nouns in the actual sentences and the nouns shown in actual sentences. This is to effectively solve the taxonomy's gravity.

이를 위하여 본 발명은 수분류사의 의미별 공기명사와 실제 문장에 나타난 명사 간 의미적 연관성을 측정할 때, 부분단어 정보에 기반을 둔 워드임베딩을 통해 획득한 단어벡터를 이용하여 의미적 연관성을 측정하여 수분류사의 중의성을 해소하는 구성을 포함한다.To this end, the present invention measures semantic relationships using word vectors obtained through word embedding based on partial word information when measuring the semantic relationships between semantic air nouns and the nouns shown in the actual sentences. To eliminate the neutrality of the waterborne sand.

본 발명은 중의성을 가진 수분류사와 함께 쓰인 아라비아 숫자를 어떻게 읽어야 할지 판단하여 TTS와 같은 시스템에서 자연스러운 발화 결과를 생성할 수 있도록 하기 위한 구성을 포함한다.The present invention includes a configuration for generating a natural ignition result in a system such as TTS by judging how to read the Arabic numerals used with the water-repellent yarn having neutrality.

본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치는 도 1에서와 같이, 입력한 문장을 형태소 단위로 분석하는 형태소 분석부(10)와, 형태소 분석부(10)의 형태소 분석 결과로부터 수분류사와 공기명사를 추출하는 단어 추출부(20)와, 단어 추출부(20)에서 추출한 수분류사의 중의성을 해소하는 중의성 해소부(30)를 포함한다.As shown in FIG. 1, the apparatus for resolving polluted water contention using word embedding based on partial word information may include a morpheme analysis unit 10 and a morpheme analysis unit that analyzes an input sentence in a morpheme unit. And a neutral word extracting unit 30 for extracting the water-like yarn and the air noun from the morpheme analysis result of 10), and a neutral dissolving unit 30 for removing the neutrality of the water-like yarn extracted from the word extracting unit 20.

여기서, 형태소 분석(Morphological Analysis)은 입력된 문자열을 분석하여 형태소(morpheme)라는 최소 의미 단위로 분리하는 것으로, 일반적으로 통용되는 형태소 분석 방법을 사용하는 것도 가능하다.Here, the morphological analysis (Morphological Analysis) is to analyze the input string to separate into a minimum semantic unit called morpheme (morpheme), it is also possible to use a commonly used morphological analysis method.

본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치는 대규모 한국어 형태소 분석 말뭉치를 이용하여 워드임베딩을 통해 단어벡터 목록(41)을 생성하는 단어벡터 생성부(40)를 더 포함한다.The apparatus for resolving polluted water pollutants using word embedding based on partial word information according to the present invention is a word vector generator 40 for generating a word vector list 41 through word embedding using a large-scale Korean morphological analysis corpus. More).

또한, 대규모 한국어 형태의미 분석 말뭉치를 이용하여 수분류사의 의미별 공기명사 목록(51)을 생성하는 공기명사 생성부(50)를 더 포함한다.The apparatus may further include an air noun generator 50 for generating a list of air nouns 51 for each semantic pollutant using a large-scale Korean analytic corpus.

그리고 상기 중의성 해소부(30)는 단어벡터 생성부(40)를 이용하여 기구축한 단어벡터 목록(41)과 공기명사 생성부(50)를 이용하여 기구축한 공기명사 목록(51)을 이용하여 수분류사의 중의성을 해소한다.In addition, the neutral resolving unit 30 uses the word vector generator 40 to construct the word vector list 41 and the air noun generator 50 to build the air noun list 51. To eliminate the neutrality of the waterborne sand.

본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 방법을 구체적으로 설명하면 다음과 같다.Referring to the method of resolving the waterborne gravity in the water using the word embedding based on the partial word information in detail as follows.

도 2는 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 방법을 나타낸 플로우 차트이다.2 is a flow chart illustrating a method for resolving waterborne gravity in waterborne word embedding based on partial word information according to the present invention.

먼저, 단어벡터 생성부(40)에서 대규모 한국어 형태소 분석 말뭉치를 이용하여 워드임베딩을 통해 단어벡터 목록(41)을 생성한다.(S201)First, the word vector generator 40 generates a word vector list 41 through word embedding using a large Korean morphological analysis corpus (S201).

이어, 공기명사 생성부(50)에서 대규모 한국어 형태의미 분석 말뭉치를 이용하여 수분류사의 의미별 공기명사 목록(51)을 생성한다.(S202)Subsequently, the air noun generator 50 generates a list of the air nouns 51 by the meanings of the water-injectors using the large-scale Korean analysis corpus.

그리고 문장이 입력되면 형태소 분석부(10)에서 수분류사와 공기명사를 추출하기 위해 형태소 분석을 수행한다.(S203)When the sentence is input, the morpheme analysis unit 10 performs the morpheme analysis to extract the water flow and air nouns (S203).

이어, 단어 추출부(20)에서 형태소 분석 결과로부터 중의성 해소 대상을 선정하기 위해 중의성을 가진 수분류사와 해당 수분류사와 문맥에서 함께 나타난 명사를 추출한다.(S204)Subsequently, the word extracting unit 20 extracts the nouns with neutrality and the nouns appearing in the context with the corresponding nouns in order to select the neutral solution target from the morphological analysis result.

그리고 중의성 해소부(30)에서 단어벡터 목록과 공기명사 목록을 활용하여 수분류사의 중의성을 해소한다.(S205)The neutrality resolving unit 30 uses the word vector list and the air noun list to solve the neutrality of the water-borne particles.

이와 같은 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법에서의 수분류사의 중의성을 해소하는 과정을 구체적으로 설명하면 다음과 같다.The process of solving the neutrality of the water-injection in the apparatus and method for resolving the water-injection neutrality using word embedding based on the partial word information according to the present invention will be described in detail.

도 3은 중의성 해소부가 수분류사의 중의성을 해소하는 과정의 일 예를 나타낸 예시도이다.3 is an exemplary view showing an example of a process for resolving neutrality of the neutral dissolving unit water content.

형태소 분석부(10)에서는 수분류사와 공기명사를 추출하기 위해 형태소 분석을 수행한다.The morpheme analysis unit 10 performs a morpheme analysis to extract the water flow and air nouns.

단어 추출부(20)에서는 형태소 분석 결과로부터 중의성 해소 대상을 선정하기 위해 중의성을 가진 수분류사와 해당 수분류사와 문맥에서 함께 나타난 명사를 추출한다.The word extracting unit 20 extracts the nouns with neutrality and the nouns that appear in the context with the corresponding nouns in order to select the neutral solution target from the morphological analysis result.

예를 들어, '꽃 1송이'에서 '송이'는 중의성이 없는 수분류사이기 때문에 추출하지 않는다.For example, '1 flower' is not extracted because it is a moisture stream with no neutrality.

반면에 '버스 1대'에서 '대'는 중의성이 있는 수분류사이기 때문에 수분류사 '대'와 공기하는 명사 '버스'를 추출한다.On the other hand, 'one' from 'bus' is a watery substance with neutrality, so it extracts the pollination company 'dae' and the air noun 'bus'.

중의성 여부 판단은 공기명사 생성부(50)에서 기구축한 공기명사 목록(51)을 이용한다.Neutrality determination is performed using the air noun list 51 instrumented by the air noun generator 50.

공기명사 목록(51)은 대규모 한국어 형태의미분석 말뭉치에서 중의성이 있는 수분류사에 대하여 의미 구분을 한 다음, 수분류사의 의미별 공기명사를 추출하여 목록화한 것이다.The air noun list 51 classifies the meanings of the neutral water-injected corpus in the large-scale Korean unanalyzed corpus, and then extracts and catalogs the air nouns by the meanings of the water-in-used corpus.

예를 들어, 수분류사 '대'가 '가계나 지위를 이어받은 순서를 나타내는 단위'로 해석될 때는, '남자, 여자, 노인' 등의 단어와 함께 쓰이고, '차나 기계, 악기 따위를 세는 단위'로 해석될 때는 '자동차, 컴퓨터' 등의 단어와 함께 쓰인다는 것을 목록화한 것이다.For example, when the pollinator 'large' is interpreted as 'a unit representing the order of inheritance of household or status', it is used with the words 'man, woman, old man' and the like When interpreted as 'unit,' it is listed with the words 'car, computer'.

중의성 해소부(30)는 중의성을 가진 수분류사와 함께 쓰인 명사를 이용하여 해당 수분류류사의 중의성을 해소한다.The neutral resolving unit 30 uses the nouns used together with the water-containing yarn having neutrality to solve the neutrality of the corresponding water-flowing yarn.

중의성 해소부(30)는 단어벡터 생성부(40)에서 생성한 단어벡터 목록(41)과 공기명사 생성부(50)에서 생성한 공기명사 목록(51)을 활용하여 수분류사의 중의성을 해소한다.The neutrality resolving unit 30 utilizes the word vector list 41 generated by the word vector generating unit 40 and the air noun list 51 generated by the air noun generating unit 50 to determine the neutrality of the water-flowing yarns. Eliminate

도 3에서 공기명사 목록(51a)은 중의성을 가진 수분류사의 의미별 공기명사 목록을 담고 있다.In FIG. 3, the air noun list 51a contains a list of air nouns by meanings of water-injectors having neutrality.

예를 들어, 수분류사 '대'는 4개의 의미로 분화되며, 의미번호 01로 쓰일 때는 '남자, 여자, 노인' 등의 단어와, 의미번호 02로 쓰일 때는 '자동차, 컴퓨터' 등의 단어와, 의미번호 03으로 쓰일 때는 '주사, 담배, 회초리' 등의 단어와, 의미번호 04로 쓰일 때는 '불가사의, 기업' 등의 단어와 함께 쓰인다.For example, pollinator 'large' is divided into four meanings, meaning 'man, woman, elderly' when used as meaning number 01, and 'car, computer' when used as meaning number 02. When used as a semantic number 03, words such as 'injection, tobacco, hoechori', and when used as a semantic number 04, it is used with words such as 'unexpected company.'

예문 1(51b)에서 '70대 노인'은 수분류사 '대'가 공기명사 목록(51a)에 나타난 '노인'과 함께 쓰였기 때문에 의미번호 01로 구분할 수 있다.In Example 1 (51b), the '70s old man' can be classified as a semantic number 01 because the pollinator 'dae' is used with the 'elderly' shown in the air noun list 51a.

예문 2(51c)에서 '버스 3대'는 수분류사 '대'가 명사 '버스'와 함께 쓰였지만, '버스'는 공기명사 목록(51a)에 없는 명사이다.In Example 2 (51c), 'three buses' are used in conjunction with the noun 'bus', while the 'bus' is a noun not included in the list of nouns 51a.

공기명사 목록(51a)에 없는 명사와 쓰인 수분류사는 단어벡터 목록(41)을 함께 활용하여 중의성을 해소한다.The nouns used in the noun list 51a and the pollinator used in the word vector list 41 solve the neutrality.

단어벡터 목록(41)은 워드임베딩을 통해 단어를 벡터화한 것으로 유사한 의미를 지닌 단어는 벡터공간에 가까이 위치한다.The word vector list 41 is a vectorized word through word embedding. A word having a similar meaning is located close to the vector space.

예를 들어, 단어 '버스'와 공기명사 목록에 나타난 모든 단어 간 의미상 거리 d를 측정하였을 때, 단어 '자동차'가 '버스'와 의미적으로 가장 유사한 단어가 되며, 이를 통해 수분류사 '대'의 의미를 '02'로 구분할 수 있다.For example, when the distance d between the word 'bus' and all words shown in the list of nouns is measured, the word 'car' becomes semantically the most similar to the word 'bus', whereby The meaning of 'large' can be divided into '02'.

이상에서 설명한 본 발명에 따른 부분단어 정보에 기반을 둔 워드임베딩을 이용한 수분류사 중의성 해소 장치 및 방법은 수분류사의 의미별 공기명사와 실제 문장에 나타난 명사 간 의미적 연관성을 측정할 때, 부분단어 정보에 기반을 둔 워드임베딩을 통해 획득한 단어벡터를 이용하여 의미적 연관성을 측정하여 수분류사의 중의성을 해소할 수 있도록 한 것이다.Apparatus and method for resolving water-injection neutrality using word embedding based on partial word information according to the present invention described above, when measuring the semantic relation between air nouns by meaning of water-injector and nouns shown in actual sentences, By using the word vector acquired through word embedding based on partial word information, semantic correlation is measured to solve the gravity of the water-rejection.

특히, 본 발명은 중의성을 가진 수분류사와 함께 쓰인 아라비아 숫자를 어떻게 읽어야 할지 판단하여 TTS와 같은 시스템에서 자연스러운 발화 결과를 생성할 수 있도록 한 것이다.In particular, the present invention is to determine how to read the Arabic numerals used with the water-repellent yarn having a neutrality to generate a natural ignition result in a system such as TTS.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.It will be understood that the present invention is implemented in a modified form without departing from the essential features of the present invention as described above.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the described embodiments should be considered in descriptive sense only and not for purposes of limitation, and the scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent scope are included in the present invention. It should be interpreted.

10. 형태소 분석부
20. 단어 추출부
30. 중의성 해소부
40. 단어벡터 생성부
50. 공기명사 생성부10. Morphological Analysis
20. Word Extraction
30. Neutral resolution
40. Word vector generator
50. Air noun generator

Claims

A morpheme analysis unit analyzing the input sentence in morpheme units;
A word extracting unit for extracting a water noun and an air noun from the morphological analysis result of the morphological analysis unit;
And a neutral resolving unit for relieving the neutrality of the water-injected yarn extracted by the word extracting unit.
The neutral resolving unit using the word vector list and the air noun list to resolve the neutrality of the water-rejection, water-repellent neutralization device using the word embedding based on the partial word information, characterized in that.

delete

The method of claim 1, wherein the neutral solver is a noun in the list of air nouns and the water-injector used,
Words with similar meanings use word embeddings based on partial word information, which eliminates neutrality by using word vector lists that vectorize words through word embeddings by using them located close to the vector space. Neutral solution for water-repellent yarn.

According to claim 1, wherein the word extracting unit based on the partial word information, characterized in that to extract the nouns appearing in the context with the water-injection with neutral to select the neutral solution target from the morphological analysis results Device for neutralization of waterborne sand using word embedding.

The apparatus of claim 1, further comprising: a word vector generator for generating a word vector list used to solve the neutrality of the water flow in the neutral canceller;
Word vector generation unit using a large-scale Korean morphological analysis corpus is a word vector based on the partial word information, characterized in that the word vector list is generated through the word embedding.

The air noun generating unit of claim 1, further comprising an air noun generating unit for generating a list of air nouns used to remove the neutrality of the water flow sand in the neutral resolving unit.
Air noun generating unit using a large-scale Korean analysis of the corpus of water, using a word embedding based on the partial word information, characterized in that to generate a list of air noun by the meaning of the water injector.

A word vector generator for generating a word vector list through word embedding using a large-scale Korean morphological analysis corpus;
An air noun generator for generating a list of air nouns by semantic nouns by means of large-scale Korean analytic corpus;
A morpheme analysis unit analyzing the input sentence in morpheme units;
A word extracting unit for extracting a water-containing yarn having a neutrality and a noun appearing in the context with the water-containing yarn to select a neutral solution target from the morphological analysis result of the morphological analysis unit;
A water releasing part using word embedding based on partial word information, comprising: a neutral releasing part releasing the neutrality of the water releasing water extracted from the word extracting part using the word vector list and the air noun list. Neutral dissolution device.

Generating a word vector list through word embedding in the word vector generator, and generating a list of air nouns by meaning of the water-injectors in the air noun generator;
Performing a morpheme analysis to extract a water noun and an air noun from the morpheme analyzer if a sentence is input;
Extracting the nouns with neutrality and the nouns appearing in the context with the corresponding nouns in order to select the neutral resolution target from the morpheme analysis result in the word extracting unit;
Resolving the neutrality of the water-injection using the word vector list and the air noun list in the neutral resolving unit; Method for resolving the water-injection neutrality using word embedding based on the partial word information, characterized in that it comprises a .

The method of claim 8, wherein in the step of releasing the neutrality of the water-injected sand,
Nouns not included in the list of nouns and pollutants used to have similar meanings are located close to the vector space, and the word vectoring is used together to solve the neutrality. A method for resolving pollutant neutrality using word embedding based on partial word information.

The method of claim 8, wherein in the step of extracting a water-repellent yarn having neutrality and a noun appearing together with the water-repellent yarn in context,
A method for resolving water-injection neutrality using word embedding based on partial word information, characterized by not extracting water-insoluble yarns without neutrality, and extracting water-injectors and nouns in the air.