KR101255957B1

KR101255957B1 - Method and apparatus for tagging named entity

Info

Publication number: KR101255957B1
Application number: KR1020110131504A
Authority: KR
Inventors: 이근배; 김석환; 김경덕; 이동현; 최준휘
Original assignee: 포항공과대학교 산학협력단
Priority date: 2011-12-09
Filing date: 2011-12-09
Publication date: 2013-04-24

Abstract

PURPOSE: An entity name tagging method and a device thereof are provided to improve the performance of a conversation system by tagging an accurate entity name to a word included in a corpus. CONSTITUTION: An acquisition unit(21) acquires an entity name candidate group from a word included in a corpus based on a dictionary for a predetermined domain. A tagging unit(22) tags an entity name to the entity name candidate group by applying an unsupervised learning method including a restriction condition to the entity name candidate group. The acquisition unit acquires the entity name candidate group according to a characteristic in which words included in a corpus are repeated. The restriction condition is a number which indicates the entity name in a sentence in which the words belong to the corpus. [Reference numerals] (21) Acquisition unit; (22) Tagging unit;

Description

METHOD AND APPARATUS FOR TAGGING NAMED ENTITY}

본 발명은 개체명 태깅 방법 및 장치에 관한 것으로, 더욱 상세하게는 대화 말뭉치에 포함된 어휘에 개체명을 태깅하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for tagging an entity name, and more particularly, to a method and apparatus for tagging an entity name in a vocabulary included in a conversation corpus.

일반적으로 개체명(Named Entity)이란, 인명, 기관 등의 이름, 곡명, 방송명 또는 지명과 같이 분류될 수 있는 단어 또는 일련의 단어들의 집합을 말한다.Generally, a named entity refers to a word or a set of words that may be classified as a name of a person, an institution, or the like, a song name, a broadcast name, or a place name.

종래 개체명을 태깅하는 방법으로, 규칙 기반의 개체명 태깅(Rule-Based Approach), 통계 기반의 개체명 태깅(Statistics-Based Approach), 상술한 두 가지 방법을 통합한 하이브리드 방식의 개체명 태깅(Hybrid Approach) 방법이 있다.As a method of tagging conventional entity names, rule-based entity tagging (Rule-Based Approach), statistics-based entity name tagging (Statistics-Based Approach), and hybrid entity entity tagging incorporating the above two methods ( Hybrid Approach).

규칙 기반의 개체명 태깅 방법은 개체명 태깅을 위한 규칙을 수동으로 구축하고, 고유명사 사전, 개체명 태깅의 단서가 되는 단어사전, 개체명의 문맥으로 나오는 단어사전 등 다양한 사전을 이용하여 개체명을 태깅하는 방법이다. 이러한 방법은, 사람의 직관에 의존하며 새로운 도메인에 적용될 때 규칙과 사전이 변경되어야 하므로 많은 시간과 비용이 요구되는 문제점이 있었다.Rule-based entity name tagging method manually constructs rules for entity name tagging and uses various dictionaries such as proper noun dictionaries, word dictionaries that lead to entity name tagging, and word dictionaries that appear in the context of entity names. How to tag. This method has a problem that requires a lot of time and money because the rules and dictionaries must be changed when applied to a new domain, depending on the intuition of the person.

통계 기반의 개체명 태깅 방법은 학습데이터로부터 개체명 태깅에 필요한 지식을 자동적으로 학습하는 방법으로, 주로 철자, 품사, 형태소로부터 얻어낸 정보를 이용하여 개체명 태깅을 위한 규칙을 학습한다. 통계 기반의 개체명 태깅 방법은 이미 개체명이 태깅된 학습데이터를 이용하는 교사 학습 방법과 아무런 처리가 되어 있지 않은 일반 문서들을 학습데이터로 이용하는 비교사 학습 방법으로 구분할 수 있다. 교사 학습 방법은 개체명이 태깅된 학습데이터를 생성하는데 많은 비용이 요구되며 구축할 수 있는 양 또한 제한적일 수 밖에 없는 문제점이 있었고, 비교사 학습 방법은 학습데이터의 생성이 용이하지만 개체명의 단순한 자질만으로는 개체명 태깅을 위한 규칙을 생성하기 어렵다는 문제점이 있었다.The statistics-based entity name tagging method automatically learns the knowledge necessary for the entity name tagging from the learning data, and mainly learns the rules for the entity name tagging using information obtained from spelling, parts of speech, and morphemes. The statistics-based entity name tagging method can be divided into a teacher learning method using learning data tagged with an object name and a comparative learning method using general documents that have not been processed as learning data. The teacher learning method requires a lot of cost to generate the learning data tagged with the object name, and the amount that can be constructed is limited. The comparative learning method is easy to generate the learning data, but the simple characteristics of the object name There is a problem that it is difficult to create a rule for tagging.

하이브리드 방식의 개체명 태깅 방법은 통계 기반의 모델에 규칙이나 어휘, 사전 등의 다양한 지식들을 결합하는 방식으로, 이러한 방법은 규칙 기반의 개체명 태깅 방법과 통계 기반의 개체명 태깅 방법의 문제점을 모두 가지고 있었다.The hybrid entity name tagging method combines a variety of knowledge such as rules, vocabularies, dictionaries, etc. in the statistical model, and this method overcomes the problems of the rule-based object tagging method and the statistical object tagging method. I had.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 말뭉치에 포함된 어휘에 정확한 개체명을 태깅하기 위한 개체명 태깅 방법을 제공하는데 있다.An object of the present invention for solving the above problems is to provide an entity name tagging method for tagging the correct entity name in the vocabulary included in the corpus.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 말뭉치에 포함된 어휘에 정확한 개체명을 태깅하기 위한 개체명 태깅 장치를 제공하는데 있다.Another object of the present invention for solving the above problems is to provide an entity name tagging device for tagging the correct entity name in the vocabulary included in the corpus.

상기 목적을 달성하기 위한 본 발명의 일 실시예는, 소정 도메인에 대한 어휘사전을 기반으로 상기 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하는 단계 및 상기 개체명 후보군에 제약 조건을 가지는 비교사 학습 방법을 적용하고, 그 결과에 따라 개체명 후보군에 개체명을 태깅하는 단계를 포함한다.An embodiment of the present invention for achieving the above object, the step of obtaining a subject name candidate group from the vocabulary included in the corpus based on the lexical dictionary for a given domain and the comparative history learning having a constraint on the subject name candidate group Applying the method and tagging the individual names to the individual names candidate group according to the result.

여기서, 상기 개체명 태깅 방법은, 개체명이 태깅된 개체명 후보군을 상기 소정 도메인에 대한 어휘사전에 포함시키는 단계를 더 포함할 수 있다.Here, the entity name tagging method may further include including an entity name tagged group of candidate names for the predetermined domain.

여기서, 상기 소정 도메인에 대한 어휘사전을 기반으로 상기 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하는 단계는, 상기 말뭉치에 포함된 어휘와 상기 어휘사전에 포함된 어휘의 유사도에 따라 개체명 후보군을 획득할 수 있다.Here, the obtaining of the individual name candidate group from the vocabulary included in the corpus on the basis of the lexical dictionary for the predetermined domain may include selecting the individual name candidate group according to the similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary. Can be obtained.

여기서, 상기 소정 도메인에 대한 어휘사전을 기반으로 상기 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하는 단계는, 상기 말뭉치에 포함된 어휘에 대한 벡터 및 상기 어휘사전에 포함된 어휘에 대한 벡터를 기반으로 상기 말뭉치에 포함된 어휘와 상기 어휘사전에 포함된 어휘에 대한 의미적 유사도를 획득하는 단계, 상기 말뭉치에 포함된 어휘를 상기 어휘사전에 포함된 어휘로 변환하기 위해 요구되는 문자열간의 거리를 기반으로 상기 말뭉치에 포함된 어휘와 상기 어휘사전에 포함된 어휘에 대한 어휘적 유사도를 획득하는 단계 및 상기 의미적 유사도 및 상기 어휘적 유사도를 기반으로 개체명 후보군을 획득하는 단계를 포함할 수 있다.Here, the step of obtaining the individual name candidate group from the vocabulary included in the corpus based on the vocabulary dictionary for the predetermined domain may be based on a vector for the vocabulary included in the corpus and a vector for the vocabulary included in the vocabulary dictionary. Acquiring semantic similarity between the vocabulary included in the corpus and the vocabulary included in the corpus, and based on the distance between the strings required to convert the vocabulary included in the corpus into a vocabulary included in the vocabulary dictionary. The method may include obtaining a lexical similarity between the vocabulary included in the corpus and the vocabulary included in the lexical dictionary, and obtaining a candidate name group based on the semantic similarity and the lexical similarity.

여기서, 상기 의미적 유사도를 획득하는 단계는, 상기 어휘가 특정 문서에 나타나는 횟수를 가중치로 하는 벡터를 기반으로 상기 말뭉치에 포함된 어휘와 상기 어휘사전에 포함된 어휘에 대한 의미적 유사도를 획득할 수 있다.The obtaining of the semantic similarity may include obtaining semantic similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary, based on a vector having a weighted number of times the vocabulary appears in a specific document. Can be.

여기서, 상기 의미적 유사도 및 상기 어휘적 유사도를 기반으로 개체명 후보군을 획득하는 단계는, 상기 의미적 유사도 및 상기 어휘적 유사도를 기반으로 개체명 스코어를 산출하고, 산출한 개체명의 스코어 중에서 소정 임계치 이상의 값을 가지는 개체명 스코어에 해당하는 어휘를 개체명 후보군으로 획득할 수 있다.In the obtaining of the individual name candidate group based on the semantic similarity and the lexical similarity, the individual name score is calculated based on the semantic similarity and the lexical similarity, and a predetermined threshold value is calculated from the calculated individual name scores. The vocabulary corresponding to the individual name score having the above value can be obtained as the individual name candidate group.

여기서, 상기 소정 도메인에 대한 어휘사전을 기반으로 상기 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하는 단계는, 상기 말뭉치에 포함된 어휘가 반복되는 특성에 따라 개체명 후보군을 획득할 수 있다.Here, in the obtaining of the individual name candidate group from the vocabulary included in the corpus based on the lexical dictionary for the predetermined domain, the individual name candidate group may be obtained according to the property of repeating the vocabulary included in the corpus.

여기서, 상기 소정 도메인에 대한 어휘사전을 기반으로 상기 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하는 단계는, N-gram count를 기반으로 상기 말뭉치에서 소정 횟수 이상 나타나는 어휘를 개체명 후보군으로 획득할 수 있다.Here, the obtaining of the individual name candidate group from the vocabulary included in the corpus based on the lexical dictionary for the predetermined domain may include obtaining the vocabulary appearing a predetermined number or more times from the corpus as the individual name candidate group based on the N-gram count. Can be.

여기서, 상기 소정 도메인에 대한 어휘사전을 기반으로 상기 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하는 단계는, LDA(Latent Dirichlet Allocation)를 기반으로 분류한 클러스터에서 소정 횟수 이상 나타나는 어휘를 개체명 후보군으로 획득할 수 있다.Here, the obtaining of the individual name candidate group from the vocabulary included in the corpus based on the lexical dictionary for the predetermined domain may include a vocabulary that appears more than a predetermined number of times in a cluster classified based on Latent Dirichlet Allocation (LDA). Can be obtained by.

여기서, 상기 개체명 후보군에 제약 조건을 가지는 비교사 학습 방법을 적용하고, 그 결과에 따라 개체명 후보군에 개체명을 태깅하는 단계는, 상기 개체명 후보군이 특정 개체명으로 태깅될 확률, 상기 말뭉치에 포함된 어휘가 속하는 문장의 의도에 따라 상기 개체명 후보군이 특정 개체명으로 태깅될 확률, 상기 말뭉치에 포함된 어휘가 속하는 문장에서 동일한 개체명이 나타나는 횟수 중 적어도 하나를 상기 제약 조건으로 하는 비교사 학습 방법을 적용할 수 있다.Here, the step of applying a non-comparative learning method having a constraint on the individual name candidate group, and tagging the individual name to the individual name candidate group according to the result, the probability that the individual name candidate group is tagged with a specific individual name, the corpus A comparison subject using the constraint as at least one of a probability that the individual group of candidate names is tagged with a specific individual name according to an intention of a sentence belonging to a vocabulary included in the word, and the number of times the same entity name appears in a sentence to which the vocabulary included in the corpus belongs Learning methods can be applied.

상기 목적을 달성하기 위한 본 발명의 다른 실시예는, 소정 도메인에 대한 어휘사전을 기반으로 상기 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하는 획득부 및 상기 개체명 후보군에 제약 조건을 가지는 비교사 학습 방법을 적용하고, 그 결과에 따라 개체명 후보군에 개체명을 태깅하는 태깅부를 포함한다.Another embodiment of the present invention for achieving the above object, a comparator having a constraint on the acquisition unit and the individual name candidate group to obtain a group of candidate names from the vocabulary included in the corpus based on the lexical dictionary for a given domain The method includes a tagging unit for applying the learning method and tagging the individual name to the individual name candidate group according to the result.

여기서, 상기 획득부는, 상기 말뭉치에 포함된 어휘와 상기 어휘사전에 포함된 어휘의 유사도에 따라 개체명 후보군을 획득하거나, 상기 말뭉치에 포함된 어휘가 반복되는 특성에 따라 개체명 후보군을 획득할 수 있다.Here, the obtaining unit may obtain the individual name candidate group according to the similarity between the vocabulary included in the corpus and the vocabulary included in the corpus, or may obtain the individual name candidate group according to the property of repeating the vocabulary included in the corpus. have.

여기서, 상기 태깅부는, 상기 개체명 후보군이 특정 개체명으로 태깅될 확률, 상기 말뭉치에 포함된 어휘가 속하는 문장의 의도에 따라 상기 개체명 후보군이 특정 개체명으로 태깅될 확률, 상기 말뭉치에 포함된 어휘가 속하는 문장에서 동일한 개체명이 나타나는 횟수 중 적어도 하나를 상기 제약 조건으로 하는 비교사 학습 방법을 적용할 수 있다.Here, the tagging unit, the probability that the individual name candidate group is tagged with a specific individual name, the probability that the individual name candidate group is tagged with a specific individual name according to the intention of the sentence to which the vocabulary included in the corpus belongs, included in the corpus A comparative learning method using at least one of the number of times that the same entity name appears in the sentence to which the vocabulary belongs may be applied.

본 발명에 의하면, 말뭉치 및 소정 도메인에 대한 어휘사전을 기반으로 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하고, 획득한 개체명 후보군에 소정 제약 조건을 가지는 비교사 학습 방법을 적용하고, 그 결과에 따라 상기 말뭉치에 개체명을 태깅함으로써, 말뭉치에 포함된 어휘에 보다 정확한 개체명을 태깅할 수 있다.According to the present invention, a subject name candidate group is obtained from a vocabulary included in a corpus based on a corpus and a vocabulary dictionary for a predetermined domain, and a comparative method learning method having a predetermined constraint is applied to the obtained subject name candidate group. By tagging the individual name to the corpus according to, a more accurate entity name can be tagged to the vocabulary included in the corpus.

이와 같이 말뭉치에 포함된 어휘에 정확한 개체명을 태깅할 수 있으므로, 대화 시스템의 전체 성능을 향상시킬 수 있다.In this way, the correct entity name can be tagged in the vocabulary included in the corpus, thereby improving the overall performance of the conversation system.

도 1은 본 발명에 따른 대화 시스템을 도시한 블럭도이다.
도 2는 본 발명에 따른 개체명 태깅 방법을 도시한 흐름도이다.
도 3은 본 발명에 따른 개체명 태깅 장치를 도시한 블럭도이다.1 is a block diagram illustrating a conversation system in accordance with the present invention.
2 is a flowchart illustrating an entity tagging method according to the present invention.
3 is a block diagram illustrating an entity tagging apparatus according to the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

도 1은 본 발명에 따른 대화 시스템을 도시한 블럭도이다.1 is a block diagram illustrating a conversation system in accordance with the present invention.

도 1을 참조하면, 대화 시스템(1)은 음성 인식부(10), 의미 분석부(20), 응답 결정부(30), 응답 생성부(40) 및 음성 변환부(50)를 포함한다.Referring to FIG. 1, the conversation system 1 includes a speech recognizer 10, a semantic analyzer 20, a response determiner 30, a response generator 40, and a speech converter 50.

본 발명에서 음성 인식부(10), 의미 분석부(20), 응답 결정부(30), 응답 생성부(40), 음성 변환부(50)는 서로 독립된 부분으로 개시되지만, 음성 인식부(10), 의미 분석부(20), 응답 결정부(30), 응답 생성부(40), 음성 변환부(50)는 하나의 단일한 형태, 하나의 물리적인 장치 또는 하나의 모듈로 구현될 수 있다. 이뿐만 아니라, 음성 인식부(10), 의미 분석부(20), 응답 결정부(30), 응답 생성부(40), 음성 변환부(50)는 각각 하나의 물리적인 장치 또는 집단이 아닌 복수의 물리적 장치 또는 집단으로 구현될 수 있다.In the present invention, the speech recognizer 10, the semantic analyzer 20, the response determiner 30, the response generator 40, and the speech converter 50 are disclosed as independent parts, but the speech recognizer 10 ), The semantic analyzer 20, the response determiner 30, the response generator 40, and the voice converter 50 may be implemented in one single form, one physical device, or one module. . In addition, the speech recognizer 10, the semantic analyzer 20, the response determiner 30, the response generator 40, and the speech converter 50 may each have a plurality of physical devices or groups. It can be implemented as a physical device or a group of.

대화 시스템(1)은 사용자 발화의 의미를 분석하고, 분석한 의미에 따른 응답을 생성한다. 음성 인식부(10)는 사용자로부터 발화를 제공받고, 제공받은 사용자 발화를 텍스트로 변환한다. 의미 분석부(20)는 변환한 텍스트를 기반으로 사용자 발화의 발화 의도(또는 문장 의도)와 사용자 발화에 포함된 어휘의 개체명을 분석하고, 분석한 결과에 따른 발화 의도와 개체명을 태깅한다. 응답 결정부(30)는 의미 분석부(20)에서 분석한 결과에 따라 가장 적절한 시스템 응답을 결정하고, 응답 생성부(40)는 응답 결정부(30)에서 결정한 시스템 응답에 대한 발화 텍스트를 생성하고, 음성 변환부(50)는 응답 생성부(40)에서 생성한 발화 텍스트를 음성으로 변환한다.The conversation system 1 analyzes the meaning of user speech and generates a response according to the analyzed meaning. The speech recognition unit 10 receives a speech from the user and converts the received speech into text. The semantic analysis unit 20 analyzes the speech intent (or sentence intent) of the user speech and the entity name of the vocabulary included in the user speech based on the converted text, and tags the speech intent and the entity name according to the analysis result. . The response determiner 30 determines the most appropriate system response according to the result analyzed by the semantic analyzer 20, and the response generator 40 generates the spoken text for the system response determined by the response determiner 30. The voice converter 50 converts the spoken text generated by the response generator 40 into voice.

이러한 대화 시스템(1)은 해당 도메인에 대한 어휘사전을 가지고 있으며, 이를 이용하여 사용자 발화의 발화 의도와 사용자 발화에 포함된 어휘에 개체명을 태깅한다. 아래 표 1은 여행 도메인에 대한 어휘사전으로 이를 이용하여 사용자 발화의 발화 의도와 사용자 발화에 포함된 어휘에 개체명을 태깅하는 방법에 대해 상세하게 설명한다.
The conversation system 1 has a lexical dictionary for the corresponding domain, and uses this to tag the object name on the intention of the user's speech and the vocabulary included in the user's speech. Table 1 below is a lexical dictionary for the travel domain, which describes in detail the intention of uttering user speech and a method of tagging an entity name in a vocabulary included in the user speech.

TourTour PlacePlace TimeTime RouteRoute Downtown TourDowntown Tour NY City Bus Tour Center in Times Square(where the conversation is taking place)NY City Bus Tour Center in Times Square (where the conversation is taking place) -every 15 minutes from 8:00 am to 10:00 pm
-two-day-every 15 minutes from 8:00 am to 10:00 pm
-two-day -Times Square
-Empire State Building
-Chinatown
-Site of the World Trade Center
-Statue of Liberty
-Rockefeller Center-Times square
-Empire State Building
-Chinatown
-Site of the World Trade Center
-Statue of Liberty
-Rockefeller Center All Around Town TourAll Around Town Tour NY City Bus Tour Center in Times Square(where the conversation is taking place)

NY City Bus Tour Center in Times Square (where the conversation is taking place)

-every 20 minutes from 8:00 am to 10:00 pm
-two-day-every 20 minutes from 8:00 am to 10:00 pm
-two-day -Times Square
-Empire State Building
-Chinatown
-Site of the World Trade Center
-Statue of Liberty
-Rockefeller Center
-Central Park
-Metropolitan Museum of Art
-Harlem Market-Times square
-Empire State Building
-Chinatown
-Site of the World Trade Center
-Statue of Liberty
-Rockefeller Center
-Central Park
-Metropolitan Museum of Art
Harlem Market

표 1에서 가장 위쪽 행에 위치한 'Tour', 'Place', 'Time', 'Route' 는 개체명에 해당하고, 'Tour'와 같은 열에 속하는 'Downtown Tour', 'All Around Town Tour'는 개체명 'Tour'에 대한 어휘에 해당하고, 'Place'와 같은 열에 속하는 'NT City Bus Tour Center in Times Square'는 개체명 'Place'에 대한 어휘에 해당하고, 'Time'과 같은 열에 속하는 'every 15 minutes from 8:00 am to 10:00 pm', 'every 20 minutes from 8:00 am to 10:00 pm'은 개체명 'Time'에 대한 어휘에 해당하고, 'Route'와 같은 열에 속하는 'Times Square', 'Empire State Building', 'Chinatown', 'Site of the World Trade Center', 'Statue of Liberty, Rockefeller Center', 'Central Park, Metropolitan Museum of Art', 'Harlem Market'는 개체명 'Route'에 대한 어휘에 해당한다.In the top row of Table 1, 'Tour', 'Place', 'Time' and 'Route' correspond to the object names, and 'Downtown Tour' and 'All Around Town Tour' belong to the same column as 'Tour'. 'NT City Bus Tour Center in Times Square', which corresponds to the vocabulary for the name 'Tour', and belongs to the same column as 'Place', corresponds to the vocabulary for the object name 'Place', and 'every', which belongs to the same column as 'Time' 15 minutes from 8:00 am to 10:00 pm ',' every 20 minutes from 8:00 am to 10:00 pm 'correspond to the vocabulary for the object name' Time 'and belong to the same column as' Route' Times Square ',' Empire State Building ',' Chinatown ',' Site of the World Trade Center ',' Statue of Liberty, Rockefeller Center ',' Central Park, Metropolitan Museum of Art ', and' Harlem Market ' Corresponds to the vocabulary for Route.

대화 말뭉치는 아래와 같이 구성되어 있으며, 여기서 U는 사용자 발화를 의미하고 S는 시스템 발화를 의미한다.
The conversation corpus is composed as follows, where U means user speech and S means system speech.

U : Do either of them give two-day tickets?U: Do either of them give two-day tickets?

S : Yes, the All Around Town Tour provides a two-day ticket.S: Yes, the All Around Town Tour provides a two-day ticket.

U : Can I see the Statue of Liberty on All Around Town Tour?
U: Can I see the Statue of Liberty on All Around Town Tour?

대화 시스템(1)의 의미 분석부(20)는 여행 도메인에 대한 어휘사전(표 1)을 이용하여 대화 말뭉치에 대한 발화 의도 및 개체명을 아래와 같이 태깅한다. 여기서, + 이후에 태깅된 것은 발화 의도이고, <>내에 태깅된 것은 개체명이다.
The semantic analysis unit 20 of the conversation system 1 tags the speech intent and the entity name for the conversation corpus as follows using the lexical dictionary (Table 1) for the travel domain. Here, tagged after + is intended to be spoken, and tagged within <> is an entity name.

U : Do either of them give <time>two-day</time> tickets? + confirm()U: Do either of them give <time> two-day </ time> tickets? + confirm ()

S : Yes, the <tour>All Around Town Tour</tour> provides a <time>two-day</time> ticket. + inform ()S: Yes, the <tour> All Around Town Tour </ tour> provides a <time> two-day </ time> ticket. + inform ()

U : Can I see the <place>Statue of Liberty</place> on <tour>All Around Town Tour</tour>? + confirm ()
U: Can I see the <place> Statue of Liberty </ place> on <tour> All Around Town Tour </ tour>? + confirm ()

이하 말뭉치에 포함된 어휘에 해당하는 개체명이 적어도 2개 이상인 경우, 개체명을 태깅하는 방법에 대해 설명한다. 아래 U는 사용자 발화이고, 표 2는 사용자 발화에 해당하는 도메인에 대한 어휘사전이다.
Hereinafter, a method of tagging an entity name when at least two entity names corresponding to the vocabulary included in the corpus will be described. U below is user utterance, and Table 2 is a lexicon for the domain corresponding to user utterance.

U : Welcome to the New York City Bus Tour Center.
U: Welcome to the New York City Bus Tour Center.

PlacePlace AddressAddress TransTrans New York City Bus Tour CenterNew York City Bus Tour Center New YorkNew york BusBus

표 2에서 'Place', 'Address', 'Trans'는 개체명에 해당하고, 'New York City Bus Tour Center'는 개체명 'Place'에 대한 어휘이고, 'New York'은 개체명 'Address'에 대한 어휘이고, 'Bus'는 'Trans'에 대한 어휘이다.In Table 2, 'Place', 'Address' and 'Trans' correspond to the entity name, 'New York City Bus Tour Center' is the vocabulary for the entity name 'Place' and 'New York' is the entity name 'Address' Vocabulary for 'Bus' is the vocabulary for 'Trans'.

여기서, 'New York City Bus Tour Center'는 개체명 'Place'에 해당하고, 동시에 'New York City Bus Tour Center' 중에서 'New York'은 개체명 'Address'에 해당하고 'Bus'는 'Trans'에 해당한다. 즉, 'New York'은 개체명 'Place'의 일부에 해당하는 동시에 개체명 'Address'에 해당하고, 'Bus'는 개체명 'Place'의 일부에 해당하는 동시에 개체명 'Trans'에 해당한다. 이와 같은 경우, 아래 수학식 1을 이용하여 개체명의 스코어를 산출하고, 스코어가 가장 높은 개체명을 태깅할 수 있다.
Here, 'New York City Bus Tour Center' corresponds to the entity name 'Place', and at the same time, 'New York' corresponds to the entity name 'Address' and 'Bus' is 'Trans' among the 'New York City Bus Tour Center'. Corresponds to That is, 'New York' corresponds to a part of the entity name 'Place' and at the same time the entity name 'Address', and 'Bus' corresponds to a part of the entity name 'Place' and at the same time the entity name 'Trans'. . In such a case, the score of the individual name may be calculated using Equation 1 below, and the individual name having the highest score may be tagged.

예를 들어, '<place>New York City Bus Tour Center</place>'로 개체명이 태깅된 경우에 개체명이 태깅된 어휘에 포함된 단어의 수는 '4'이고(New York, City, Bus, Tour Center), 어휘에 포함된 단어에 태깅된 개체명의 수는 '1'이므로(place), 개체명 'place'의 스코어는 '2'이다.For example, if an entity is tagged with "<place> New York City Bus Tour Center </ place>", the number of words in the vocabulary tagged with the entity name is "4" (New York, City, Bus, Tour Center), since the number of individual names tagged in the words included in the vocabulary is '1', the score of the individual name 'place' is '2'.

또한, '<address>New York</address> City <trans>Bus</trans> Tour Center'로 개체명이 태깅된 경우에 개체명이 태깅된 어휘에 포함된 단어의 수는 '2'이고(New York, Bus), 어휘에 포함된 단어에 태깅된 개체명의 수는 '2'이므로(address, trans), 개체명 'address', 'trans'의 스코어는 '0.5'이다.In addition, when the entity name is tagged with '<address> New York </ address> City <trans> Bus </ trans> Tour Center', the number of words included in the vocabulary tagged with the entity name is '2' (New York , Bus), since the number of individual names tagged in the words included in the vocabulary is '2' (address, trans), the score of the individual names 'address' and 'trans' is '0.5'.

따라서, 'New York City Bus Tour Center'에 개체명 'place'가 태깅된 경우가 개체명 'address', 'trans'가 태깅된 경우보다 개체명의 스코어가 높으므로, 'New York City Bus Tour Center'에 개체명 'place'를 태깅할 수 있다.
Therefore, when the name of the entity 'place' is tagged in the 'New York City Bus Tour Center', the score of the individual name is higher than that of the cases in which the entity names 'address' and 'trans' are tagged. You can tag the object name 'place' with.

이상 대화 말뭉치에 포함된 어휘가 해당 도메인에 대한 어휘사전에 포함되어 있는 경우에 개체명을 태깅하는 방법에 대해 상세하게 설명하였다. 이하 대화 말뭉치에 포함된 어휘가 해당 도메인에 대한 어휘사전에 포함되어 있지 않은 경우에 개체명을 태깅하는 방법에 대해 상세하게 설명한다.
In the above description, the method of tagging the entity name when the vocabulary included in the conversation corpus is included in the vocabulary dictionary for the corresponding domain has been described in detail. Hereinafter, a method of tagging an entity name when the vocabulary included in the conversation corpus is not included in the vocabulary dictionary for the corresponding domain will be described in detail.

도 2는 본 발명에 따른 개체명 태깅 방법을 도시한 흐름도이다.2 is a flowchart illustrating an entity tagging method according to the present invention.

도 2를 참조하면, 개체명 태깅 방법은 개체명 후보군을 획득하는 단계(S100) 및 획득한 개체명 후보군을 기반으로 말뭉치에 개체명을 태깅하는 단계(S200)를 포함할 수 있고, 개체명이 태깅된 개체명 후보군을 기반으로 어휘사전을 갱신하는 단계(S300)를 더 포함할 수 있다.Referring to FIG. 2, the method of tagging an individual name may include obtaining an individual name candidate group (S100) and tagging an individual name on a corpus based on the obtained individual name candidate group (S200), and tagging an individual name. The method may further include updating the vocabulary dictionary based on the received individual name candidate group (S300).

단계 S100은 의미적 유사도를 획득하는 단계(S110), 어휘적 유사도를 획득하는 단계(S111), 의미적 유사도 및 어휘적 유사도를 기반으로 개체명 후보군을 획득하는 단계(S112), N-gram count를 기반으로 개체명 후보군을 획득하는 단계(S120) 및 LDA(Latent Dirichlet Allocation)를 기반으로 개체명 후보군을 획득하는 단계(S121)를 포함할 수 있다. 단계 S200은 태깅될 확률에 따라 개체명을 태깅하는 단계(S210), 문장 의도에 따라 개체명을 태깅하는 단계(S220) 및 개체명의 횟수에 따라 개체명을 태깅하는 단계(S230)을 포함할 수 있다.In step S100, obtaining a semantic similarity (S110), obtaining a lexical similarity (S111), obtaining an individual name candidate group based on semantic similarity and lexical similarity (S112), and N-gram count The method may include obtaining the individual name candidate group based on S120 and obtaining the individual name candidate group based on LDA (Latent Dirichlet Allocation) (S121). Step S200 may include tagging the entity name according to the probability of being tagged (S210), tagging the entity name according to the sentence intention (S220), and tagging the entity name according to the number of entity names (S230). have.

단계 S100은 말뭉치에 포함된 어휘와 소정 도메인에 대한 어휘사전에 포함된 어휘의 유사도에 따라 개체명 후보군을 획득하는 방법, 말뭉치에 포함된 어휘가 반복되는 특성에 따라 개체명 후보군을 획득하는 방법 중 적어도 하나의 방법을 이용하여 개체명 후보군을 획득할 수 있다. 여기서, 소정 도메인이란 말뭉치의 주제(Topic)와 관련된 도메인을 의미하며, 예를 들어, 여행에 관한 말뭉치의 경우에 소정 도메인은 여행 도메인이고, 영화에 관한 말뭉치의 경우에 소정 도메인은 영화 도메인이다.Step S100 is a method of acquiring a candidate name for a subject according to the similarity between the vocabulary included in the corpus and the vocabulary included in the lexical dictionary for a predetermined domain, and among the methods of acquiring the subject name candidate according to the characteristics of repeating the vocabulary included in the corpus. At least one method may be used to obtain a candidate name group. Here, the predetermined domain means a domain related to the topic of the corpus. For example, in the case of corpus about travel, the predetermined domain is a travel domain, and in the case of corpus about a movie, the predetermined domain is a movie domain.

말뭉치에 포함된 어휘와 소정 도메인에 대한 어휘사전에 포함된 어휘의 유사도에 따라 개체명 후보군을 획득하는 방법은, 단계 S110, 단계 S111 및 단계 S112를 통해 개체명 후보군을 획득할 수 있다.The method for obtaining the individual name candidate group according to the similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary for the predetermined domain may obtain the individual name candidate group through steps S110, S111, and S112.

단계 S110은 말뭉치에 포함된 어휘에 대한 벡터 및 어휘사전에 포함된 어휘에 대한 벡터를 기반으로 말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘에 대한 의미적 유사도를 획득할 수 있다. 여기서, 벡터는 벡터 공간 모델(Vector Space Model)상에서 표현되며, 벡터 공간 모델은 텍스트 문서 또는 다른 객체를 식별자의 벡터로 나타내는 대수적인 모델이다.In operation S110, a semantic similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary may be obtained based on the vector of the vocabulary included in the corpus and the vector of the vocabulary included in the vocabulary. Here, the vector is represented on a vector space model, and the vector space model is an algebraic model representing a text document or another object as a vector of identifiers.

먼저, 형태소 분석을 통해 말뭉치와 어휘사전에 포함된 어휘로부터 명사(또는 단어)를 추출하고, 추출한 명사를 기반으로 웹 검색을 수행하여 해당 어휘에 대한 벡터를 산출한다. 이때, 말뭉치의 주제에 관련된 웹을 검색할 수 있다. 예를 들어, 여행에 관한 말뭉치의 경우에 여행에 관련된 웹을 검색할 수 있고, 영화에 관한 말뭉치의 경우에 영화에 관련된 웹을 검색할 수 있다. 여기서, 해당 어휘에 대한 벡터를 산출하는 방법으로, 추출한 명사가 웹에 나타나는지 여부를 기반으로 벡터를 산출할 수 있다. First, a noun (or word) is extracted from a vocabulary included in a corpus and vocabulary dictionary through morphological analysis, and a vector for the vocabulary is calculated by performing a web search based on the extracted noun. At this time, the web related to the subject of the corpus can be searched. For example, in the case of a corpus about travel, a web related to travel can be searched, and in the case of a corpus about a movie, a web related to a film can be searched. Here, as a method of calculating a vector for the corresponding vocabulary, the vector may be calculated based on whether the extracted noun appears on the web.

또한, 아래 수학식 2, 3, 4를 통해 추출한 명사에 대한 가중치를 산출할 수 있고, 가중치가 적용된 명사를 기반으로 해당 어휘에 대한 벡터를 산출할 수 있다.
In addition, the weights for the nouns extracted through Equations 2, 3, and 4 may be calculated, and the vector for the corresponding vocabulary may be calculated based on the nouns to which the weights are applied.

명사에 대한 가중치 'tfidf_ij'는 단어 빈도(tf_i,j)와 역 문서 빈도(idf_i)의 곱을 통해 산출할 수 있다. 여기서, 단어 빈도(tf_i,j)는 문서(d_j)에서 특정 단어(또는 명사)가 나타난 횟수(n_ij)를 문서(d_j)의 전체 단어 개수(

)로 나누어 산출할 수 있고, 역 문서 빈도(idf_i)는 전체 문서의 개수(│D│)를 특정 단어가 나타난 문서의 개수(│{d_j:t_i∈d_j}│)로 나눈값에 로그를 취하여 산출할 수 있다.The weight 'tfidf _ij ' for the noun may be calculated by multiplying the word frequency tf _{i, j} and the inverse document frequency idf _i . Here, the word frequency (tf _{i, j)} is a document, the document for a word (n _ij) (or n) the number of times shown in (d _j) the total number of words (d _j) (

), And the inverse document frequency (idf _i ) is the total number of documents (│D│) divided by the number of documents in which a particular word appears (│ {d _j : t _i ∈d _j } │). It can be calculated by taking the log in.

이와 같은 방법으로 말뭉치에 포함된 어휘에 대한 벡터 및 어휘사전에 포함된 어휘에 대한 벡터를 산출한 후, 산출한 벡터를 아래 수학식 5에 대입하여 말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘에 대한 의미적 유사도를 획득할 수 있다.
In this way, after calculating the vector of the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary, the calculated vector is substituted into Equation 5 below and the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary. Semantic similarity with respect to can be obtained.

여기서, 'V₁'은 말뭉치에 포함된 어휘에 대한 벡터이고, 'V₂'는 어휘사전에 포함된 어휘에 대한 벡터이다. 'cosθ'의 값이 '1'에 가까울 수록 말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘가 유사함을 나타내고, 'cosθ'의 값이 '0'이면 말뭉치에 포함된 어휘에 대한 벡터와 어휘사전에 포함된 어휘에 대한 벡터가 직각인 경우로 둘 간의 유사성이 전혀 없음을 나타낸다.Here, 'V ₁ ' is a vector for the vocabulary included in the corpus, and 'V ₂ ' is a vector for the vocabulary included in the lexicon. The closer the value of 'cosθ' is to '1', the more similar the vocabulary included in the corpus and the vocabulary dictionary. If the value of 'cosθ' is '0', the vector and vocabulary for the vocabulary included in the corpus The vector of the vocabulary included in the dictionary is at right angles, indicating no similarity between the two.

말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘의 의미적 유사도는 상술한 방법 이외에 단순 매칭 계수(Simple Matching Coefficient), 자카이드 계수(Jaccard Coefficient), 피어슨의 상관관계 계수(Pearson's Correlation Coefficient) 등을 이용하여 획득할 수 있다.The semantic similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary can be calculated using Simple Matching Coefficient, Jaccard Coefficient, and Pearson's Correlation Coefficient. Can be obtained by using.

단계 S111은 말뭉치에 포함된 어휘를 어휘사전에 포함된 어휘로 변환하기 위해 요구되는 문자열간의 거리를 기반으로 말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘에 대한 어휘적 유사도를 획득할 수 있다. 이때, 해밍거리(Hamming Distance), 편집거리(Edit Distance) 등을 이용하여 어휘적 유사도를 획득할 수 있다.In operation S111, a lexical similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary may be obtained based on a distance between strings required for converting the vocabulary included in the corpus into a vocabulary included in the vocabulary dictionary. In this case, the lexical similarity may be obtained using a hamming distance, an edit distance, or the like.

단계 S112는 단계 S110에서 획득한 의미적 유사도와 단계 S111에서 획득한 어휘적 유사도를 기반으로 개체명 후보군을 획득할 수 있다. 이때, 아래 수학식 6과 같이 의미적 유사도와 어휘적 유사도에 서로 다른 가중치를 주고, 이를 이용하여 개체명의 스코어를 산출하고, 산출한 개체명의 스코어 중에서 소정 임계치(threshold) 이상의 값을 가지는 개체명의 스코어에 해당하는 어휘를 개체명의 후보군으로 선택할 수 있다. 여기서, 'A'는 의미적 유사도에 대한 가중치이고, 'B'는 어휘적 유사도에 대한 가중치이다.
In operation S112, the candidate entity name may be acquired based on the semantic similarity obtained in operation S110 and the lexical similarity obtained in operation S111. In this case, as shown in Equation 6 below, different weights are given to the semantic similarity and the lexical similarity, and the score of the individual name is calculated using the same, and the score of the individual name having a value greater than or equal to a predetermined threshold value among the scores of the calculated individual name. The vocabulary corresponding to can be selected as a candidate group of individual names. Here, 'A' is a weight for semantic similarity, and 'B' is a weight for lexical similarity.

단계 S120은 N-gram count를 기반으로 말뭉치에서 소정 횟수 이상 나타나는 어휘를 개체명 후보군으로 획득할 수 있다. 즉, 도메인에 대한 어휘사전에는 나타나지 않으나, 사용자 발화(U)와 시스템 발화(S)가 연속적으로 나타나는 말뭉치에서 소정 횟수 이상 나타나는 어휘를 개체명 후보군으로 획득할 수 있다.In operation S120, the vocabulary that appears more than a predetermined number of times in the corpus may be obtained as the entity name candidate group based on the N-gram count. That is, although the word does not appear in the lexicon for the domain, the vocab that appears more than a predetermined number of times in the corpus in which the user utterance U and the system utterance S appear continuously may be acquired as the individual name candidate group.

단계 S121은 LDA를 기반으로 분류한 클러스터에서 소정 횟수 이상 나타나는 어휘를 개체명 후보군으로 획득할 수 있다. 즉, 도메인에 대한 어휘사전에는 나타나지 않으나, 사용자 발화(U)와 시스템 발화(S)가 연속적으로 나타나는 말뭉치를 비슷한 주제별로 클러스터링 한 클러스터에서 소정 횟수 이상 나타나는 어휘를 개체명 후보군으로 획득할 수 있다.
In operation S121, the vocabulary that appears more than a predetermined number of times in the cluster classified based on the LDA may be acquired as the entity name candidate group. That is, although a word that does not appear in the lexical dictionary for the domain, but appears in a cluster of clustered clusters of similar topics by user utterance (U) and system utterance (S), can be obtained as an entity name candidate group.

도 2의 단계 S100에 대한 블럭에서, 유사도에 따라 개체명 후보군을 획득하는 단계(S110, S111, S112)와 반복특성에 따라 개체명 후보군을 획득하는 단계(S120, S121)를 병렬적으로 도시하였으나, 상기 단계(S110, S111, S112)만을 수행하여 개체명 후보군을 획득할 수 있고, 상기 단계(S120, S121)만을 수행하여 개체명 후보군을 획득할 수 있고, 상기 단계(S110, S111, S112)와 상기 단계(S120, S121)를 순차적으로 수행하여 개체명 후보군을 획득할 수 있다.In the block for step S100 of FIG. 2, the steps of obtaining the individual name candidate group according to the similarity (S110, S111, S112) and the step of obtaining the individual name candidate group according to the repetition characteristics are illustrated in parallel. By performing only steps S110, S111, and S112, an individual candidate group may be obtained, and by performing only steps S120 and S121, an individual name candidate group may be obtained, and the steps S110, S111, and S112 may be performed. And performing the steps (S120, S121) sequentially to obtain the individual name candidate group.

또한, 도 2에서 N-gram count를 기반으로 개체명 후보군을 획득하는 단계(S120)와 LDA를 기반으로 개체명 후보군을 획득하는 단계(S121)를 순차적으로 도시하였으나, 단계 S120와 단계 S121을 동시에 수행하여 개체명 후보군을 획득할 수 있고, 단계 S120만을 수행하여 개체명 후보군을 획득할 수 있고, 단계 S121만을 수행하여 개체명 후보군을 획득할 수 있다.
In addition, in FIG. 2, the step S120 of obtaining the individual name candidate group based on the N-gram count and the step S121 of obtaining the individual name candidate group based on the LDA are sequentially illustrated, but the steps S120 and S121 are simultaneously performed. The entity name candidate group may be obtained by performing the operation, the entity name candidate group may be obtained by performing only step S120, and the entity name candidate group may be obtained by performing only the step S121.

단계 S200은 개체명 후보군이 특정 개체명으로 태깅될 확률, 말뭉치에 포함된 어휘가 속하는 문장의 의도에 따라 개체명 후보군이 특정 개체명으로 태깅될 확률, 말뭉치에 포함된 어휘가 속하는 문장에서 동일한 개체명이 나타나는 횟수 중 적어도 하나를 상기 제약 조건으로 하는 비교사 학습 방법을 적용하여 말뭉치에 개체명을 태깅할 수 있다. 여기서, 비교사 학습 방법으로 Constraint Driven Learning, Generalized Expectation Constraints, Posterior Regularization 등의 방법을 사용할 수 있다.In step S200, the probability that an individual group of candidates is tagged with a specific entity name, the probability that an individual group of candidates are tagged with a specific entity name according to the intention of the sentence to which the vocabulary included in the corpus belongs, and the same entity is included in the sentence to which the vocabulary included in the corpus belongs. The individual name may be tagged in the corpus by applying a non-training learning method using at least one of the number of times that the name appears as the constraint. Here, as the non-traditional learning method, methods such as Constraint Driven Learning, Generalized Expectation Constraints, and Posterior Regularization may be used.

단계 S210은 개체명 후보군이 특정 개체명으로 태깅될 확률을 제약 조건으로 하는 비교사 학습 방법을 적용하여 말뭉치에 개체명을 태깅한다. 단계 S210은 아래 수학식 7, 8을 이용하여 말뭉치에 개체명을 태깅할 수 있다.
In step S210, the individual name is tagged on the corpus by applying a non-training method for learning that has a probability that the individual name candidate group is tagged with a specific individual name. In step S210, the entity name may be tagged to the corpus using Equations 7, 8 below.

여기서, 'φ_wl(x,y)'는 개체명 후보군이 특정 개체명으로 태깅될 확률에 대한 제약 함수(Contraints Function)이고, 'X'는 말뭉치에 포함된 단어이고, 'W'는 단어를 포함하는 어휘(즉, 개체명 후보군)이고, 'Y'는 개체명 함수이고, 'l'은 특정 개체명이다. 수학식 7에서, 특정 어휘 'W'가 특정 개체명 'l'로 태깅되면 'φ_wl(x,y)'의 값은 '1'이고, 그렇지 않은 경우에 'φ_wl(x,y)'값은 '0'이다.Here, 'φ _wl (x, y)' is a constraint function on the probability that an individual group of candidates is tagged with a specific individual name, 'X' is a word included in the corpus, and 'W' is a word. The containing vocabulary (ie, the candidate name), 'Y' is the entity name function, and ' l ' is the specific entity name. In Equation 7, if a certain vocabulary 'W' is tagged with a specific entity name ' l ', the value of 'φ _wl (x, y)' is ' 1 ', otherwise 'φ _wl (x, y)' The value is '0'.

또한, 'E_pθ[φ_wl(x,y)]'는 개체명 후보군이 특정 개체명으로 태깅될 확률을 제약 조건으로 하는 비교사 학습에 관한 함수이고, 'C_w'는 말뭉치에서 어휘 'W'가 나타난 개수이고, 'P_θ(y│x)'는 단어 'X'가 주어졌을 때 개체명 'Y'가 나타날 확률이고, 'b'는 미리 정해진 확률이다. 수학식 8에서, 특정 어휘 'W'가 특정 개체명 'l'로 태깅될 확률이 'b'이면, 특정 어휘 'W'에 특정 개체명 'l'를 태깅한다.
In addition, 'E _pθ [φ _wl (x, y)]' is a function related to non-comparative learning with constraints on the probability that an individual group of candidates is tagged with a specific individual name, and 'C _w ' is a word 'W in the corpus. 'Is the number of occurrences,' P _θ (y│x) 'is the probability that the entity name' Y 'appears when the word' X 'is given, and' b 'is the predetermined probability. In Equation 8, if the probability that the specific vocabulary 'W' is tagged with the specific entity name ' l ' is 'b', the specific entity name ' l ' is tagged with the specific vocabulary 'W'.

단계 S220은 말뭉치에 포함된 어휘가 속하는 문장의 의도에 따라 개체명 후보군이 특정 개체명으로 태깅될 확률을 제약 조건으로 하는 비교사 학습 방법을 적용하여 말뭉치에 개체명을 태깅한다. 단계 S220은 아래 수학식 9, 10을 이용하여 말뭉치에 개체명을 태깅할 수 있다.
In step S220, the individual name is tagged to the corpus by applying a non-training learning method with a probability that the individual name candidate group is tagged with the specific individual name according to the intention of the sentence to which the vocabulary belongs. Step S220 may tag the individual name on the corpus using Equations 9 and 10 below.

여기서, 'φ_αl(x,y)'는 문장의 의도에 따라 개체명 후보군이 특정 개체명으로 태깅될 확률에 대한 제약 함수(Contraints Function)이고, 'X'는 말뭉치에 포함된 어휘(즉, 개체명 후보군)이고, 'act(X)'는 'X'를 포함하는 문장이고, 'α'는 문장의 의도이고, 'Y'는 개체명 함수이고, 'l'은 특정 개체명이다. 수학식 9에서, 어휘 'X'가 포함된 문장의 의도가 'α'인 경우에 어휘 'X'가 'l'로 태깅되면 'φ_αl(x,y)'의 값은 '1'이고, 그렇지 않은 경우에 'φ_αl(x,y)'값은 '0'이다.Here, 'φ _αl (x, y)' is a constraint function on the probability that an individual group of candidates is tagged with a specific entity name according to the intention of the sentence, and 'X' is a vocabulary included in the corpus (ie, Subject name candidate group), 'act (X)' is a sentence including 'X', 'α' is the intent of the sentence, 'Y' is the entity name function, and ' l ' is the specific entity name. In Equation 9, when the intention of the sentence including the vocabulary 'X' is 'α' and the vocabulary 'X' is tagged as ' l ', the value of 'φ _αl (x, y)' is ' 1 ', Otherwise, the value of φ _αl (x, y) is '0'.

또한, 'E_pθ[φ_αl(x,y)]'는 문장의 의도에 따라 개체명 후보군이 특정 개체명으로 태깅될 확률을 제약 조건으로 하는 비교사 학습에 관한 함수이고, 'C_α'는 말뭉치에 포함된 문장 중에서 문장 의도가 'α'인 문장의 개수이고, 'P_θ(y│x)'는 어휘 'X'가 주어졌을 때 개체명 'Y'가 나타날 확률이고, 'b'는 미리 정해진 확률이다. 수학식 10에서, 특정 어휘 'X'가 특정 개체명 'l'로 태깅될 확률이 'b'이면, 특정 어휘 'X'에 개체명 'l'를 태깅한다.
In addition, 'E _pθ [φ _αl (x, y)]' is a function related to non-training learning with constraints on the probability that an individual group of candidates is tagged with a specific individual name according to the intention of the sentence, and 'C _α ' is The number of sentences in the corpus with sentence intent 'α', 'P _θ (y│x)' is the probability that the entity name 'Y' appears when the vocabulary 'X' is given, and 'b' is It is a predetermined probability. In Equation 10, when the probability that the specific vocabulary 'X' is tagged with the specific entity name ' l ' is 'b', the entity name ' l ' is tagged in the specific vocabulary 'X'.

단계 S230은 말뭉치에 포함된 어휘가 속하는 문장에서 동일한 개체명이 나타나는 횟수를 제약 조건으로 하는 비교사 학습 방법을 적용하여 말뭉치에 개체명을 태깅한다. 단계 S230은 아래 수학식 11, 12를 이용하여 말뭉치에 개체명을 태깅할 수 있다.
In step S230, the object name is tagged to the corpus by applying a non-training learning method with a constraint on the number of times that the same entity name appears in a sentence belonging to the corpus. In step S230, the entity name may be tagged in the corpus using Equations 11 and 12 below.

여기서, 'φ_l(x,y)'는 문장에서 동일한 개체명이 나타나는 횟수에 대한 제약 함수(Contraints Function)이고, 'X'는 말뭉치에 포함된 문장이고, 'N'은 문장 'X'에 포함된 단어의 수이고, 'Y_i'는 문장 'X'에 포함된 'i'번째 단어에 태깅되는 개체명이고, 'l'은 특정 개체명이다. 수학식 11에서, 'φ_l(x,y)'는 문장 'X'에서 특정 개체명 'l'이 태깅된 개수를 나타낸다.Here, 'φ _l (x, y)' is a constraint function on the number of times the same entity name appears in a sentence, 'X' is a sentence included in the corpus, and 'N' is included in the sentence 'X'. Is the number of words, 'Y _i ' is the name of the entity tagged in the 'i' word included in the sentence 'X', ' l ' is the specific entity name. In Equation 11, 'φ _l (x, y)' represents a number tagged with a specific entity name ' l ' in the sentence 'X'.

또한, 'E_q[φ_l(x,y)]'는 문장에서 동일한 개체명이 나타나는 횟수를 제약 조건으로 하는 비교사 학습에 관한 함수이다. 수학식 12는, 하나의 문장에서 특정 개체명 'l'이 한번 이상 나타나지 않도록 개체명을 태깅한다.
In addition, 'E _q [φ _l (x, y)]' is a function related to non-training learning with a constraint on the number of occurrences of the same entity name in a sentence. In Equation 12, an object name is tagged so that a specific object name ' l ' does not appear more than once in one sentence.

도 2의 단계 S200에서, 태깅될 확률에 따라 개체명을 태깅하는 단계(S210), 문장 의도에 따라 개체명을 태깅하는 단계(S220), 개체명의 횟수에 따라 개체명을 태깅하는 단계(S230)를 병렬적으로 도시하였으나, 상기 단계(S210)만을 수행하여 말뭉치에 개체명을 태깅할 수 있고, 상기 단계(S220)만을 수행하여 말뭉치에 개체명을 태깅할 수 있고, 상기 단계(S230)만을 수행하여 말뭉치에 개체명을 태깅할 수 있고, 상기 단계(S210), 상기 단계(S220), 상기 단계(S230)를 순차적으로 수행하여 말뭉치에 개체명을 태깅할 수 있다.
In step S200 of FIG. 2, tagging the entity name according to the probability of being tagged (S210), tagging the entity name according to the sentence intention (S220), tagging the entity name according to the number of entity names (S230). Although shown in parallel, only by performing the step (S210) can tag the individual name on the corpus, by performing the step (S220) can only tag the individual name on the corpus, perform only the step (S230) The object name may be tagged to the corpus, and the object name may be tagged to the corpus by sequentially performing the step S210, the step S220, and the step S230.

단계 S300은 소정 도메인에 대한 어휘사전을 갱신하는 단계로, 단계 S100에서 획득한 개체명 후보군 중에서 단계 S200을 통해 개체명이 태깅된 개체명 후보군을 소정 도메인에 대한 어휘사전에 포함시킬 수 있다.
Step S300 is a step of updating the lexicon for the predetermined domain, and may include the lexical dictionary for the predetermined domain from the individual name candidate group obtained in step S100 through the step S200.

이상 본 발명에 따른 개체명 태깅 방법에 대해 상세하게 설명하였다. 이하 본 발명에 따른 개체명 태깅 장치에 대해 상세하게 설명한다.
As above, the method for tagging individual names according to the present invention has been described in detail. Hereinafter, the entity tagging apparatus according to the present invention will be described in detail.

도 3은 본 발명에 따른 개체명 태깅 장치를 도시한 블럭도이다.3 is a block diagram illustrating an entity tagging apparatus according to the present invention.

도 3을 참조하면, 개체명 태깅 장치는 말뭉치 및 소정 도메인에 대한 어휘사전을 기반으로 말뭉치에 포함된 어휘로부터 개체명 후보군을 획득하는 획득부(21) 및 개체명 후보군에 소정 제약 조건을 가지는 비교사 학습 방법을 적용하고, 그 결과에 따라 말뭉치에 개체명을 태깅하는 태깅부(22)를 포함하며, 개체명 태깅 장치는 의미 분석부(20) 내에 위치할 수 있다.Referring to FIG. 3, the apparatus for tagging an entity has a predetermined constraint on an acquisition unit 21 and an entity candidate group for obtaining an entity candidate group from a vocabulary included in the corpus based on a corpus and a vocabulary dictionary for a predetermined domain. It applies a four learning method, and according to the result includes a tagging unit 22 for tagging the entity name in the corpus, the entity name tagging device may be located in the semantic analysis unit 20.

본 발명에서 획득부(21), 태깅부(22)는 서로 독립된 부분으로 개시되지만, 획득부(21), 태깅부(22)는 하나의 단일한 형태, 하나의 물리적인 장치 또는 하나의 모듈로 구현될 수 있다. 이뿐만 아니라, 획득부(21), 태깅부(22)는 각각 하나의 물리적인 장치 또는 집단이 아닌 복수의 물리적 장치 또는 집단으로 구현될 수 있다.In the present invention, the acquisition unit 21 and the tagging unit 22 are disclosed as separate parts from each other, but the acquisition unit 21 and the tagging unit 22 are in one single form, one physical device, or one module. Can be implemented. In addition, the acquisition unit 21 and the tagging unit 22 may be implemented as a plurality of physical devices or groups instead of one physical device or group.

획득부(21)는, 말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘의 유사도에 따라 개체명 후보군을 획득하거나, 말뭉치에 포함된 어휘가 반복되는 특성에 따라 개체명 후보군을 획득할 수 있다.The acquisition unit 21 may acquire the individual name candidate group according to the similarity between the vocabulary included in the corpus and the vocabulary included in the corpus, or may obtain the individual name candidate group according to the characteristic that the vocabulary included in the corpus is repeated.

말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘의 유사도에 따라 개체명 후보군을 획득하는 경우, 상술한 단계 S110을 수행하여 말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘의 의미적 유사도를 획득하고, 상술한 단계 S111을 수행하여 말뭉치에 포함된 어휘와 어휘사전에 포함된 어휘의 어휘적 유사도를 획득하고, 상술한 단계 S112를 수행하여 의미적 유사도와 어휘적 유사도를 기반으로 개체명 후보군을 획득할 수 있다.When acquiring an individual name candidate group according to the similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary, performing step S110 described above to obtain the semantic similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary. , To obtain the lexical similarity between the vocabulary included in the corpus and the vocabulary included in the lexical dictionary by performing the above-described step S111, and obtaining the individual name candidate group based on the semantic similarity and the lexical similarity by performing the above-described step S112. can do.

또한, 말뭉치에 포함된 어휘가 반복되는 특성에 따라 개체명 후보군을 획득하는 경우, 상술한 단계 S120을 수행하여 N-gram count를 기반으로 개체명 후보군을 획득할 수 있고, 상술한 단계 S121을 수행하여 LDA를 기반으로 개체명 후보군을 획득할 수 있다.In addition, in the case of obtaining the individual name candidate group according to the characteristic that the vocabulary included in the corpus is repeated, performing the above-described step S120 to obtain the individual name candidate group based on the N-gram count, and performing the above-described step S121. By doing so, it is possible to obtain an individual name candidate group based on the LDA.

태깅부(22)는, 개체명 후보군이 특정 개체명으로 태깅될 확률, 말뭉치에 포함된 어휘가 속하는 문장의 의도에 따라 개체명 후보군이 특정 개체명으로 태깅될 확률, 말뭉치에 포함된 어휘가 속하는 문장에서 동일한 개체명이 나타나는 횟수 중 적어도 하나를 제약 조건으로 하는 비교사 학습 방법을 적용하여 말뭉치에 개체명을 태깅할 수 있다. 여기서, 비교사 학습 방법으로 Constraint Driven Learning, Generalized Expectation Constraints, Posterior Regularization 등의 방법을 사용할 수 있다.The tagging unit 22 includes a probability that an individual name candidate group is tagged with a specific individual name, a probability that an individual name candidate group is tagged with a specific individual name according to the intention of a sentence to which the vocabulary included in the corpus belongs, and a vocabulary included in the corpus belongs. The object name may be tagged in the corpus by applying a comparative learning method using at least one of the number of times that the same entity name appears in a sentence as a constraint. Here, as the non-traditional learning method, methods such as Constraint Driven Learning, Generalized Expectation Constraints, and Posterior Regularization may be used.

개체명 후보군이 특정 개체명으로 태깅될 확률을 제약 조건으로 하는 비교사 학습 방법을 적용하여 말뭉치에 개체명을 태깅하는 경우, 상술한 단계 S210을 수행하여 말뭉치에 개체명을 태깅할 수 있으며, 이때 수학식 7, 8을 이용하여 말뭉치에 개체명을 태깅할 수 있다.In the case of tagging an individual name to a corpus by applying a non-learning learning method with a probability that the individual name candidate group is tagged with a specific individual name, the individual name may be tagged to the corpus by performing step S210 described above. Equation 7, 8 can be used to tag the individual name to the corpus.

말뭉치에 포함된 어휘가 속하는 문장의 의도에 따라 개체명 후보군이 특정 개체명으로 태깅될 확률을 제약 조건으로 하는 비교사 학습 방법을 적용하여 말뭉치에 개체명을 태깅하는 경우, 상술한 단계 S220을 수행하여 말뭉치에 개체명을 태깅할 수 있으며, 이때 수학식 9, 10을 이용하여 말뭉치에 개체명을 태깅할 수 있다.According to the intention of the sentence to which the vocabulary included in the corpus is applied, when the subject name tag is tagged with a corpus learning method having a constraint on the probability of being tagged with a specific entity name, the above-mentioned step S220 is performed. The object name can be tagged to the corpus. In this case, the object name can be tagged to the corpus using Equations 9 and 10.

말뭉치에 포함된 어휘가 속하는 문장에서 동일한 개체명이 나타나는 횟수를 제약 조건으로 하는 비교사 학습 방법을 적용하여 말뭉치에 개체명을 태깅하는 경우, 상술한 단계 S230을 수행하여 말뭉치에 개체명을 태깅할 수 있으며, 이때 수학식 11, 12를 이용하여 말뭉치에 개체명을 태깅할 수 있다.
In the case where the object name is tagged in the corpus by applying a non-training learning method with a constraint on the number of times that the same entity name appears in the sentence belonging to the corpus, the object name may be tagged in the corpus by performing step S230 described above. In this case, the object names may be tagged in the corpus by using Equations 11 and 12.

이상 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. It will be possible.

1 : 대화 시스템
10 : 음성 인식부
20 : 의미 분석부
21 : 획득부
22 : 태깅부
30 : 응답 결정부
40 : 응답 생성부
50 : 음성 변환부1: conversation system
10: speech recognition unit
20: semantic analysis unit
21: acquisition unit
22: tagging
30: response determination unit
40: response generator
50: voice conversion unit

Claims

In the method of tagging an entity name in a vocabulary included in a conversation corpus in the entity tagging device,
Obtaining a group of candidate names from a vocabulary included in the corpus based on a vocabulary dictionary for a predetermined domain; And
And applying a non-comparative learning method having a constraint to the individual name candidate group, and tagging the individual name to the individual name candidate group according to the result.

The method according to claim 1, wherein the entity name tagging method,
The method of claim 1, further comprising the step of including an object name tagged candidate group in the lexicon for the predetermined domain.

The method of claim 1, wherein the obtaining of the individual name candidate group from the vocabulary included in the corpus is based on the vocabulary dictionary for the predetermined domain.
The individual name tagging method according to claim 1, wherein the individual name candidate group is obtained according to the similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary.

The method of claim 3, wherein the obtaining of the individual name candidate group from the vocabulary included in the corpus is based on the vocabulary dictionary for the predetermined domain.
Obtaining semantic similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary based on a vector for the vocabulary included in the corpus and a vector for the vocabulary included in the vocabulary dictionary;
Obtaining a lexical similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary based on a distance between strings required to convert the vocabulary included in the corpus into a vocabulary included in the vocabulary dictionary; And
Obtaining a subject name candidate group based on the semantic similarity and the lexical similarity.

The method of claim 4, wherein the acquiring semantic similarity comprises:
And a semantic similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary dictionary based on a vector whose weight is the number of times the vocabulary appears in a specific document.

The method of claim 4, wherein obtaining the individual name candidate group based on the semantic similarity and the lexical similarity,
The individual name score is calculated based on the semantic similarity and the lexical similarity, and the individual corresponding to the individual name score having a value equal to or greater than a predetermined threshold value among the calculated individual name scores is obtained as the individual name candidate group. People tagging method.

The method of claim 1, wherein the obtaining of the individual name candidate group from the vocabulary included in the corpus is based on the vocabulary dictionary for the predetermined domain.
The individual name tagging method according to claim 1, wherein the individual name candidate group is obtained according to a property in which the vocabulary included in the corpus is repeated.

The method of claim 7, wherein the obtaining of the individual name candidate group from the vocabulary included in the corpus based on the vocabulary dictionary for the predetermined domain comprises:
And a vocabulary that appears more than a predetermined number of times in the corpus based on an N-gram count as an entity name candidate group.

The method of claim 7, wherein the obtaining of the individual name candidate group from the vocabulary included in the corpus based on the vocabulary dictionary for the predetermined domain comprises:
A method for tagging an individual name, comprising: acquiring a vocabulary that appears more than a predetermined number of times in a cluster classified based on Latent Dirichlet Allocation (LDA) as an entity name candidate group.

The method of claim 1, wherein applying a non-comparative learning method having a constraint to the individual name candidate group, and tagging the individual name to the individual name candidate group according to the result,
The probability that the subject name candidate group is tagged with a specific subject name, the probability that the subject name candidate group is tagged with a specific subject name according to the intention of the sentence to which the vocabulary included in the corpus belongs, and the same sentence in the sentence to which the vocabulary included in the corpus belongs. An entity name tagging method, comprising applying a non-comparative learning method using at least one of the number of occurrences of the entity name as the constraint.

In a device for tagging an entity name in a vocabulary included in a conversation corpus,
An acquisition unit for acquiring an entity name candidate group from a vocabulary included in the corpus based on a vocabulary dictionary for a predetermined domain; And
And a tagging unit for applying a non-comparative learning method having a constraint to the individual name candidate group, and tagging the individual name to the individual name candidate group according to the result.

The method according to claim 11, wherein the obtaining unit,
Acquiring the individual name candidate group according to the similarity between the vocabulary included in the corpus and the vocabulary included in the vocabulary, or acquiring the individual name candidate group according to the characteristic that the vocabulary included in the corpus is repeated. Device.

The method according to claim 11, The tagging unit,
The probability that the subject name candidate group is tagged with a specific subject name, the probability that the subject name candidate group is tagged with a specific subject name according to the intention of the sentence to which the vocabulary included in the corpus belongs, and the same sentence in the sentence to which the vocabulary included in the corpus belongs. Apparatus for tagging subjects, characterized in that for applying at least one of the number of times that the entity name appears as a constraint learning method.