KR20230082457A

KR20230082457A - Unsupervised domain ontology extraction apparatus and method

Info

Publication number: KR20230082457A
Application number: KR1020210170392A
Authority: KR
Inventors: 노윤형; 권오욱
Original assignee: 한국전자통신연구원
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2023-06-08

Abstract

본 발명은 도메인 코퍼스에서 용어 및 이들간의 관계로 이루어진 온톨로지 정보를 자동으로 추출하기 위한 비지도 방식 도메인 온톨로지 추출 장치 및 추출 방법에 관한 것이다The present invention relates to an unsupervised domain ontology extraction apparatus and method for automatically extracting ontology information consisting of terms and their relationships from a domain corpus.

Description

Unsupervised domain ontology extraction apparatus and extraction method {UNSUPERVISED DOMAIN ONTOLOGY EXTRACTION APPARATUS AND METHOD}

본 발명은 비지도 방식 도메인 온톨로지 추출 장치 및 추출 방법에 관한 것으로서, 보다 구체적으로는, 도메인 코퍼스에서 용어 및 이들간의 관계로 이루어진 온톨로지 정보를 자동으로 추출하기 위한 비지도 방식 도메인 온톨로지 추출 장치 및 추출 방법에 관한 것이다.The present invention relates to an unsupervised domain ontology extraction apparatus and extraction method, and more particularly, to an unsupervised domain ontology extraction apparatus and extraction method for automatically extracting ontology information consisting of terms and their relationships from a domain corpus. It is about.

한국공개특허 제10-2019-0147203호Korean Patent Publication No. 10-2019-0147203

본 발명은 상기의 문제점을 해결하기 위함으로써, 도메인 코퍼스에서 용어 및 이들간의 관계로 이루어진 온톨로지 정보를 자동으로 추출하기 위한 비지도 방식 도메인 온톨로지 추출 장치 및 추출 방법을 제공하고자 한다.In order to solve the above problems, the present invention intends to provide an unsupervised domain ontology extraction apparatus and method for automatically extracting ontology information consisting of terms and relationships between them from a domain corpus.

본 발명의 일 실시예에 따른 비지도 방식 도메인 온톨로지 추출 장치는 도메인 코퍼스를 입력하여 형태소 분석, 의존 파싱 및 개체명 인식을 수행하는 전처리부, 문장내 도메인 용어후보를 추출하는 용어 추출부, 상기 의존 파싱 결과를 토대로 구문 트리플을 추출하는 구문 트리플 추출부, 추출된 구문 트리플에 대해 비개체명 및 개체명 구문패턴을 생성하는 클러스터링부, 생성된 상기 개체명 구문패턴에 대해 개체명 태그 별 연관성을 판단하여 가장 높은 연관성을 보이는 연관 키워드를 생성하는 연관 키워드 생성부 및 키워드 및 하위 용어로 이루어진 클래스를 생성하는 클래스 생성부를 포함할 수 있다.An unsupervised domain ontology extraction apparatus according to an embodiment of the present invention includes a pre-processing unit that inputs a domain corpus and performs morpheme analysis, dependency parsing, and entity name recognition, a term extraction unit that extracts domain term candidates in a sentence, and the dependency A syntax triple extraction unit extracting syntax triples based on the parsing result, a clustering unit generating non-entity names and entity name syntax patterns for the extracted syntax triples, and determining the association of each entity name tag with respect to the generated entity name syntax patterns. and a class generator that generates a class consisting of keywords and subterms and a related keyword generation unit that generates related keywords showing the highest correlation.

일 실시예에서, 상기 전처리부에서 인식하는 개체명 종류는 날짜, 시간, 장소, 인명, 금액, 전화번호, 인원수 및 개수를 포함할 수 있다.In one embodiment, the type of object name recognized by the pre-processing unit may include date, time, place, person, amount, phone number, number of people, and number.

일 실시예에서, 상기 전처리부는 입력된 상기 도메인 코퍼스에 대한 개체명이 인식되는 경우, 해당 인식된 개체명을 개체명 태그로 치환시킬 수 있다.In one embodiment, when the entity name of the input domain corpus is recognized, the pre-processing unit may replace the recognized entity name with an entity name tag.

일 실시예에서, 상기 용어 추출부는 도메인 용어 추출 방법 및 키워드 추출 방법을 이용할 수 있다.In one embodiment, the term extracting unit may use a domain term extracting method and a keyword extracting method.

일 실시예에서, 상기 클러스터링부는 생성된 상기 구문패턴에 대한 필터링을 수행하여 1차 구문패턴을 생성하고, 생성된 상기 1차 구문패턴 중 유사 어휘를 가지는 1차 구문패턴을 통합하여 2차 구문패턴을 생성할 수 있다.In one embodiment, the clustering unit performs filtering on the generated syntactic pattern to generate a primary syntactic pattern, and generates a secondary syntactic pattern by integrating primary syntactic patterns having similar vocabularies among the generated primary syntactic patterns. can create

일 실시예에서, 상기 클러스터링부는 추출된 구문 트리플로부터 1차 구문패턴을 생성하는 1차 구문패턴 생성부, 생성된 상기 1차 구문패턴에서 기 설정된 조건을 만족하지 못하는 1차 구문패턴을 제거하는 필터링부 및 상기 1차 구문패턴에 대해 동일 키워드를 가지면서 하위 인스턴스 집합이 유사한 구문패턴을 통합하는 2차 구문패턴 생성부를 포함할 수 있다.In one embodiment, the clustering unit includes a first syntax pattern generation unit that generates a first syntax pattern from the extracted syntax triple, and a filter that removes a first syntax pattern that does not satisfy a predetermined condition from the generated first syntax pattern. and a second syntax pattern generation unit for integrating syntax patterns similar to sub-instance sets while having the same keyword for the first syntax pattern.

일 실시예에서, 상기 연관 키워드 생성부는 상기 개체명 구문패턴에서 특정 개체명과 공기(co-occurrent)하는 빈도수 또는 점별 상호 정보(PMI)를 이용하여 연관성을 판단할 수 있다.In one embodiment, the related keyword generation unit may determine the association using the frequency of co-occurrent with a specific entity name in the entity name syntax pattern or point-by-point mutual information (PMI).

본 발명의 다른 실시예에 따른 비지도 방식 도메인 온톨로지 추출 장치는 도메인 코퍼스를 입력하여 형태소 분석, 의존 파싱 및 날짜, 시간, 장소, 인명, 금액, 전화번호, 인원수 및 개수를 포함하는 개체명 인식을 수행하며, 입력된 상기 도메인 코퍼스에 대한 개체명이 인식되는 경우, 해당 인식된 개체명을 개체명 태그로 치환시키는 전처리부, 도메인 용어 추출 방법 및 키워드 추출 방법을 이용하여, 문장 내 도메인 용어 후보를 추출하는 용어 추출부, 상기 의존 파싱 결과를 토대로 구문 트리플을 추출하는 구문 트리플 추출부, 추출된 구문 트리플에 대해 비개체명 및 개체명 구문패턴을 생성하는 클러스터링부, 생성된 상기 개체명 구문패턴에 대해 개체명 태그 별 연관성을 판단하여 가장 높은 연관성을 보이는 연관 키워드를 생성하는 연관 키워드 생성부 및 키워드 및 하위 용어로 이루어진 클래스를 생성하는 클래스 생성부를 포함할 수 있다.An unsupervised domain ontology extraction apparatus according to another embodiment of the present invention inputs a domain corpus and performs morphological analysis, dependency parsing, and object name recognition including date, time, place, person, amount, phone number, number of people, and number. and, if the entity name for the input domain corpus is recognized, domain term candidates are extracted from the sentence using a preprocessing unit that replaces the recognized entity name with entity name tags, a domain term extraction method, and a keyword extraction method. a term extraction unit that extracts syntax triples based on the dependency parsing result; a syntax triple extraction unit that extracts syntax triples; a clustering unit that generates non-entity name and entity name syntax patterns for the extracted syntax triple; It may include a related keyword generating unit that determines the relevance of each entity name tag and generates a related keyword showing the highest relevance, and a class generating unit that creates a class composed of keywords and subterms.

다른 실시예에서, 상기 클러스터링부는 추출된 구문 트리플로부터 1차 구문패턴을 생성하는 1차 구문패턴 생성부, 생성된 상기 1차 구문패턴에서 기 설정된 조건을 만족하지 못하는 1차 구문패턴을 제거하는 필터링부 및 상기 1차 구문패턴에 대해 동일 키워드를 가지면서 하위 인스턴스 집합이 유사한 구문패턴을 통합하는 2차 구문패턴 생성부를 포함할 수 있다.In another embodiment, the clustering unit includes a first syntax pattern generation unit that generates a first syntax pattern from the extracted syntax triples, and filtering that removes a first syntax pattern that does not satisfy a predetermined condition from the generated first syntax pattern. and a second syntax pattern generation unit for integrating syntax patterns similar to sub-instance sets while having the same keyword for the first syntax pattern.

본 발명의 일 실시예에 따른 비지도 방식 도메인 온톨로지 추출 방법은 전처리부에서 도메인 코퍼스를 입력하여 형태소 분석, 의존 파싱 및 개체명 인식을 수행하는 단계, 용어 추출부에서 문장내 도메인 용어후보를 추출하는 단계, 구문 트리플 추출부에서 상기 의존 파싱 결과를 토대로 구문 트리플을 추출하는 단계, 클러스터링부에서 추출된 구문 트리플에 대해 비개체명 및 개체명 구문패턴을 생성하는 단계, 연관 키워드 생성부에서 생성된 상기 개체명 구문패턴에 대해 개체명 태그 별 연관성을 판단하여 가장 높은 연관성을 보이는 연관 키워드를 생성하는 단계 및 클래스 생성부에서 키워드 및 하위 용어로 이루어진 클래스를 생성하는 단계를 포함할 수 있다.An unsupervised domain ontology extraction method according to an embodiment of the present invention includes the steps of inputting a domain corpus in a preprocessing unit to perform morpheme analysis, dependency parsing, and object name recognition, and extracting domain term candidates in a sentence in a term extraction unit. extracting a syntax triple based on the dependent parsing result in a syntax triple extractor, generating non-entity name and entity name syntax patterns for the syntax triple extracted in the clustering unit, It may include determining the association of each entity name tag with respect to the entity name syntax pattern, generating a keyword having the highest correlation, and generating a class composed of keywords and subterms in a class generator.

일 실시예에서, 상기 전처리부에서 도메인 코퍼스를 입력하여 형태소 분석, 의존 파싱 및 개체명 인식을 수행하는 단계는 입력된 상기 도메인 코퍼스에 대한 개체명이 인식되는 경우, 상기 전처리부에서 해당 인식된 개체명을 개체명 태그로 치환시키는 단계를 포함할 수 있다.In one embodiment, the step of inputting the domain corpus in the pre-processing unit and performing morpheme analysis, dependency parsing, and entity name recognition may include, when the entity name for the input domain corpus is recognized, the recognized entity name in the pre-processing unit. It may include a step of replacing with an object name tag.

일 실시예에서, 상기 용어 추출부에서 문장내 도메인 용어후보를 추출하는 단계는 도메인 용어 추출 방법 및 키워드 추출 방법을 이용하여 문장 내 도메인 용어 후보를 추출하는 단계를 포함할 수 있다.In an embodiment, the step of extracting the domain term candidates within the sentence by the term extractor may include extracting the domain term candidates within the sentence using a domain term extraction method and a keyword extraction method.

일 실시예에서, 상기 클러스터링부에서 추출된 구문 트리플에 대해 비개체명 및 개체명 구문패턴을 생성하는 단계는 상기 클러스터링부에서, 생성된 상기 구문패턴에 대한 필터링을 수행하여 1차 구문패턴을 생성하고, 생성된 상기 1차 구문패턴 중 유사 어휘를 가지는 1차 구문패턴을 통합하여 2차 구문패턴을 생성하는 단계를 포함할 수 있다.In one embodiment, the step of generating a non-entity name and entity name syntax pattern for the syntax triple extracted by the clustering unit generates a first syntax pattern by performing filtering on the generated syntax pattern in the clustering unit. and generating a second syntax pattern by integrating the first syntax patterns having a similar vocabulary among the generated first syntax patterns.

일 실시예에서, 상기 연관 키워드 생성부에서 생성된 상기 개체명 구문패턴에 대해 개체명 태그 별 연관성을 판단하여 가장 높은 연관성을 보이는 연관 키워드를 생성하는 단계는 상기 연관 키워드 생성부에서, 상기 개체명 구문패턴에서 특정 개체명과 공기(co-occurrent)하는 빈도수 또는 점별 상호 정보(PMI)를 이용하여 연관성을 판단하는 단계를 포함할 수 있다. In one embodiment, the step of determining the relevance of each entity name tag with respect to the entity name syntax pattern generated by the relevant keyword generation unit and generating a relevant keyword showing the highest correlation, in the relevant keyword generation unit, the entity name A step of determining correlation using a frequency of co-occurrent with a specific entity name in a syntax pattern or point-by-point mutual information (PMI) may be included.

본 발명의 일 측면에 따르면, 비지도 방식으로 도메인 코퍼스로부터 개념 및 관련 하위 용어 정보를 추출할 수 있고, 이를 이용해 사람이 도메인 지식을 용이하게 구축할 수 있는 이점을 가진다.According to one aspect of the present invention, concept and related sub-term information can be extracted from a domain corpus in an unsupervised manner, and a person can easily build domain knowledge using this.

또한 본 발명의 일 측면에 따르면, 구축된 정보를 이용하여 대화 코퍼스에 대해 비지도 방식으로 슬롯 태깅을 수행할 수 있고, 이러한 지식들은 대화 시스템 등을 구축하는데 사용될 수 있는 이점을 가진다.In addition, according to one aspect of the present invention, slot tagging can be performed on a conversation corpus in an unsupervised manner using the constructed information, and such knowledge has an advantage of being used to build a conversation system and the like.

도 1은 본 발명에 따른 비지도 방식 도메인 온톨로지 추출 장치(100)의 구성을 개략적으로 도시한 도면이다.
도 2는 의존 구문 분석 트리의 예를 도시한 도면이다.
도 3은 클러스터링부에서 구문 트리플을 클러스터링하는 과정을 도시한 도면이다.
도 4는 전체 도메인 온톨로지 추출 결과의 예를 나타낸 도면이다.1 is a diagram schematically showing the configuration of an unsupervised domain ontology extraction apparatus 100 according to the present invention.
2 is a diagram illustrating an example of a dependency parsing tree.
3 is a diagram illustrating a process of clustering syntax triples in a clustering unit.
4 is a diagram showing an example of an entire domain ontology extraction result.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다. Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms, only these embodiments make the disclosure of the present invention complete, and common knowledge in the art to which the present invention belongs. It is provided to fully inform the holder of the scope of the invention, and the present invention is only defined by the scope of the claims. Meanwhile, terms used in this specification are for describing the embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, "comprises" and/or "comprising" means that a stated component, step, operation, and/or element is the presence of one or more other components, steps, operations, and/or elements. or do not rule out additions.

도 1은 본 발명에 따른 비지도 방식 도메인 온톨로지 추출 장치(100)의 구성을 개략적으로 도시한 도면이다.1 is a diagram schematically showing the configuration of an unsupervised domain ontology extraction apparatus 100 according to the present invention.

도 1을 살펴보면, 본 발명의 일 실시예에 따른 비지도 방식 도메인 온톨로지 추출 장치(100)는 전처리부(110), 용어 추출부(120), 클러스터링부(130), 연관 키워드 생성부(140) 및 클래스 생성부(150)를 포함한다. 1, an unsupervised domain ontology extraction apparatus 100 according to an embodiment of the present invention includes a preprocessing unit 110, a term extraction unit 120, a clustering unit 130, and a related keyword generation unit 140. and a class generator 150.

전처리부(110)는 도메인 코퍼스를 입력으로 하여 형태소 분석, 의존 파싱 및 개체명 인식을 수행하는 역할을 한다. 이때, 전처리부(110)에서 인식하는 개체명 종류로는 date(날짜), time(시간), poi(장소), person(인명), price(금액), phone.number(전화번호), person.count(인원수), qt.count(개수) 등이 포함될 수 있다. 또한, 전처리부(110)는 입력된 도메인 코퍼스에 대한 개체명이 인식되는 경우, 해당 인식된 개체명을 개체명 태그로 치환시킬 수 있다.The pre-processing unit 110 serves to perform morpheme analysis, dependency parsing, and object name recognition by taking the domain corpus as an input. At this time, the types of object names recognized by the preprocessing unit 110 include date (date), time (time), poi (place), person (name of person), price (amount), phone.number (telephone number), person. Can include count(number of people), qt.count(number), etc. In addition, when the entity name for the input domain corpus is recognized, the pre-processing unit 110 may replace the recognized entity name with an entity name tag.

용어 추출부(120)는 문장내 도메인 용어후보를 추출하는 역할을 한다. 여기에서, 용어 추출은 일반적으로 사용하는 도메인 용어(terms) 및 키워드 추출 방법을 사용할 수 있다.The term extractor 120 serves to extract domain term candidates within a sentence. Here, term extraction may use domain terms and keyword extraction methods that are generally used.

구문 트리플 추출부(130)는 의존 파싱 결과를 토대로 구문 트리플을 추출하는 역할을 한다. 보다 구체적으로, 구문 트리플 추출부(130)는 의존 파싱 결과로부터 (a, rel, b)로 된 구문 트리플을 추출한다. 여기서 a는 의존 구문 트리상에서 의존소, rel은 구문의존관계경로, b는 지배소 또는 의존소를 나타낸다. 구문 트리플은 1차 의존관계인 부모관계뿐만 아니라, 2차 의존관계인 조손관계 및 형제관계도 포함할 수 있다. 이에 대해서는 도 2를 통해 보다 구체적으로 살펴보기로 한다.The syntax triple extractor 130 serves to extract syntax triples based on the dependency parsing result. More specifically, the syntax triple extractor 130 extracts a syntax triple of (a, rel, b) from the dependency parsing result. Here, a is a dependent element on the dependent syntax tree, rel is a syntax dependency relationship path, and b is a dominant or dependent element. Syntax triples may include not only parental relationships, which are primary dependency relationships, but also grandparent relationships and sibling relationships, which are secondary dependency relationships. This will be examined in more detail with reference to FIG. 2 .

도 2는 의존 구문 분석 트리의 예를 도시한 도면이다.2 is a diagram illustrating an example of a dependency parsing tree.

도 2를 살펴보면, "창가에 자리 있나요?"에 대한 의존 구문 트리를 나타내는데, 여기서 ('창가', 'vmod', '있'), ('자리', 'subj', '있')는 1차 의존관계 구문 트리플이 되고, ('창가', 'vmod:subj', '자리')는 2차 의존관계 구문 트리플이 된다.Referring to Figure 2, it shows the dependency syntax tree for "Is there a seat by the window?", where ('window', 'vmod', 'there'), ('seat', 'subj', 'there') are 1 It becomes the primary dependency syntax triple, and ('window', 'vmod:subj', 'Jari') becomes the secondary dependency syntax triple.

이때, 구문 트리플 추출부(130)는 의존관계의 경우 필요에 따라 더욱 세분화할 수 있다. 예를 들어, 부사격(vmod)인 경우 만일 어절 내에 기능어가 존재하면, 기능어를 추가로 표시할 수 있다. 그러면 상기 예에서 ('창가', 'vmod', '있')는 ('창가', 'vmod/에', '있')로 표시된다. 이러한 구문 트리플은 용어로 추출된 어휘에 대해서만 처리하기 때문에, 상기 예에서 실제로 추출되는 구문 트리플은 ('창가', 'vmod:subj', '자리') 하나뿐이다.In this case, the syntax triple extractor 130 may further subdivide the dependency relationship as needed. For example, in the case of an adverb (vmod), if a function word exists in a word, the function word may be additionally displayed. Then, in the above example, ('window', 'vmod', 'there') is displayed as ('window', 'vmod/', 'there'). Since these syntax triples are processed only for the vocabulary extracted as terms, only one syntax triple ('window', 'vmod:subj', 'place') is actually extracted in the above example.

다시 도 1로 돌아와서, 클러스터링부(140)는 추출된 구문 트리플에 대해 비개체명 및 개체명 구문패턴을 생성하는 역할을 한다. 이에 관해서는 도 3을 통해 보다 구체적으로 살펴보기로 한다.Returning to FIG. 1 again, the clustering unit 140 serves to generate non-entity name and entity name syntax patterns for the extracted syntax triples. This will be examined in more detail with reference to FIG. 3 .

도 3은 클러스터링부(140)에서 구문 트리플을 클러스터링하는 과정을 도시한 도면이다.FIG. 3 is a diagram illustrating a process of clustering syntax triples in the clustering unit 140 .

도 3을 살펴보면, 클러스터링부(140)는 생성된 상기 구문패턴에 대한 필터링을 수행하여 1차 구문패턴을 생성하고, 생성된 상기 1차 구문패턴 중 유사 어휘를 가지는 1차 구문패턴을 통합하여 2차 구문패턴을 생성할 수 있다. 이를 위하여, 클러스터링부(140)는 추출된 구문 트리플로부터 1차 구문패턴을 생성하는 1차 구문패턴 생성부(141), 생성된 상기 1차 구문패턴에서 기 설정된 조건을 만족하지 못하는 1차 구문패턴을 제거하는 필터링부(142) 및 상기 1차 구문패턴에 대해 동일 키워드를 가지면서 하위 인스턴스 집합이 유사한 구문패턴을 통합하는 2차 구문패턴 생성부(143)를 포함하여 구성될 수 있다.Referring to FIG. 3 , the clustering unit 140 performs filtering on the generated syntactic patterns to generate primary syntactic patterns, integrates primary syntactic patterns having similar vocabularies among the generated primary syntactic patterns, and produces a second syntactic pattern. Secondary syntax patterns can be created. To this end, the clustering unit 140 includes a primary syntactic pattern generation unit 141 that generates a primary syntactic pattern from the extracted syntactic triples, and a primary syntactic pattern that does not satisfy a preset condition in the generated primary syntactic pattern. It may include a filtering unit 142 that removes and a secondary syntax pattern generation unit 143 that integrates similar syntax patterns of sub-instance sets while having the same keyword for the primary syntax pattern.

1차 구문패턴 생성부(141)에서는 구문 트리플로부터 구문패턴을 생성한다. 구문패턴은 구문 트리플 (a, rel, b)에서 a 또는 b를 "*"로 치환하여 생성한다. ('시간', 'obj', '변경'), ('인원', 'obj', '변경')의 경우 다음과 같은 구문패턴을 생성할 수 있다.The primary syntax pattern generating unit 141 generates a syntax pattern from syntax triples. A syntax pattern is created by replacing a or b with "*" in the syntax triple (a, rel, b). In the case of ('time', 'obj', 'change') and ('person', 'obj', 'change'), the following syntax patterns can be created.

(*, 'obj', '변경'): {"시간", "인원"}(*, 'obj', 'change'): {"time", "person"}

위에서 "*"로 치환되지 않은 용어인 '변경'이 키워드가 되고, 치환된 용어들이 하위 인스턴스들이 된다.'Change', which is a term not substituted with "*" above, becomes a keyword, and the substituted terms become sub-instances.

필터링부(142)에서는 다음과 같은 조건을 만족하지 않는 구문패턴을 제거한다. 구문패턴에서 키워드와 하위 인스턴스 사이에는 의미적 상하위어 관계가 형성이 되거나, 연관성이 높거나 또는 하위 인스턴스간에 의미적 유사성이 있어야 한다.The filtering unit 142 removes syntax patterns that do not satisfy the following conditions. In the syntax pattern, semantic upper and lower word relationships must be formed between keywords and sub-instances, or there must be high correlation or semantic similarity between sub-instances.

상하위어 관계의 경우 상하위어 스코어의 평균이 threshold이상이 되어야 하고, 연관성은키워드와 인스턴스간 PMI 평균이 threshold값 이상이어야 하고, 의미적 유사성은 구문패턴의 인스턴스들에 대한 단어 임베딩 벡터의 variance가 threshold 이내여야 한다.In the case of upper and lower word relationships, the average of the upper and lower word scores must be above the threshold, and for association, the PMI average between the keyword and the instance must be above the threshold value, and for semantic similarity, the variance of the word embedding vector for the instances of the phrase pattern is the threshold should be within

상하위어 스코어의 경우 상하위어 분류기를 사용하여 추출할 수 있다. 상하위어 분류기는 상위어 사전으로부터 (하위어, 상위어) 및 (하위어, 임의단어) 쌍으로 된 학습 데이터로 분류기를 학습하여 구축할 수 있다. 분류기의 최종 출력 노드는 이진분류에 대한 확률 분포값이므로, 상위어인 경우에 대한 출력값을 스코어로 사용할 수 있다.In the case of high and low word scores, it can be extracted using a high and low word classifier. The upper and lower word classifier may be constructed by learning the classifier with training data of pairs of (lower word, upper word) and (lower word, random word) from the upper level word dictionary. Since the final output node of the classifier is a probability distribution value for binary classification, an output value for a high-level word can be used as a score.

이때 구문패턴 자체에 대한 필터링뿐만 아니라, 구문패턴안에서 개별적인 인스턴스에 대해서도 같은 기준으로 필터링을 수행한다.At this time, not only filtering for the syntax pattern itself, but also filtering for individual instances within the syntax pattern is performed based on the same criteria.

개체명인 경우, 같은 개체명 태그를 갖는 개체명들이 하나의 클래스가 되고, 개체명 태그를 포함하는 구문 트리플 자체로 하나의 구문패턴이 된다.In the case of entity names, entity names having the same entity name tag become one class, and the syntax triple itself including the entity name tag becomes one syntax pattern.

2차 구문패턴 생성부(143)에서는 1차 구문패턴에 대해, 키워드가 유사하고, 하위 인스턴스 집합이 의미적으로 유사한 경우 구문패턴을 통합한다. 키워드 유사성은 의미 유사도가 특정 threshold이상인 경우로 판단하고, 두 하위 인스턴스 집합의 유사성은 하위 인스턴스 집합간에 교집합이 존재하는 경우 또는 의미 유사도가 특정 threshold이상인 단어쌍이 두 집합간에 존재하는 경우로 판단한다. 2개의 구문패턴이 통합되면, (a, rel, "*")에서 다시 rel부분이 "*"로 치환되어 (a, "*", "*") 형태가 된다.The secondary syntax pattern generation unit 143 integrates the syntax patterns when keywords are similar to the primary syntax patterns and sub-instance sets are semantically similar. Keyword similarity is determined when the semantic similarity is higher than a certain threshold, and similarity between two sub-instance sets is determined when there is an intersection between the sub-instance sets or when a word pair with the semantic similarity higher than a certain threshold exists between the two sets. When the two syntax patterns are combined, the rel part in (a, rel, "*") is replaced with "*" again to form (a, "*", "*").

다시 도 1로 돌아와서, 연관 키워드 생성부(150)는 생성된 상기 개체명 구문패턴에 대해 개체명 태그 별 연관성을 판단하여 가장 높은 연관성을 보이는 연관 키워드를 생성하는 역할을 한다. 여기에서, 연관도는 구문패턴에서 특정 개체명과 공기하는(co-occurrent) 빈도수나 Point-wise Mutual Information(PMI)등을 통해 구할 수 있다.Returning to FIG. 1 again, the related keyword generation unit 150 determines the relevance of each entity name tag with respect to the generated entity name syntax pattern, and serves to generate a related keyword showing the highest relevance. Here, the degree of association can be obtained through co-occurrent frequency or point-wise mutual information (PMI) with a specific entity name in a syntactic pattern.

이러한 연관 키워드를 생성하는 이유는 개체명을 하나의 클래스로 처리할 때 모호성이 존재할 수 있기 때문에 이를 구분하기 위한 것이다. 예를 들어, "인천에서 뉴욕으로 가는 비행기 있나요?"에서 '인천'과 '뉴욕' 모두 도시에 해당하는 개체명 태그를 가지지만, '인천'은 출발도시가 되고, '뉴욕'은 도착도시가 되고, 이를 구분할 필요가 있을 수 있다. 따라서, 본 발명에서는 연관 키워드 생성부(150)를 통해 구문패턴에 대해 개체명 태그 별 연관성을 토대로 연관 키워드를 생성함으로써 이러한 선택을 할 수 있는 정보를 제공할 수 있다.The reason for generating these related keywords is to distinguish them because there may be ambiguity when processing entity names as one class. For example, in "Is there a flight from Incheon to New York?" both 'Incheon' and 'New York' have object name tags that correspond to cities, but 'Incheon' is the departure city and 'New York' is the destination city. and it may be necessary to distinguish them. Accordingly, in the present invention, information for making such a selection can be provided by generating a related keyword based on the relevance of each entity name tag with respect to the syntax pattern through the related keyword generation unit 150.

클래스 생성부(160)는 연관 키워드 생성부를 통해 연관 키워드를 생성하는 과정에서 비개체명이 존재하는 경우, 키워드 및 하위 용어로 이루어진 클래스를 생성하는 역할을 한다. 이때, 키워드는 클래스명이 될 수 있고, 하위 용어들은 인스턴스가 될 수 있으며, 관계는 '종류(type_of)'가 될 수 있다. 이때, 개체명인 경우, 개체명 태그가 클래스명이 되고, 연관 키워드가 하위 용어가 되며, 이때 관계는 '키워드(keyword_of)'가 된다.The class generating unit 160 serves to create a class composed of keywords and subordinate terms when a non-entity name exists in the process of generating a related keyword through the related keyword generating unit. In this case, the keyword can be a class name, the lower term can be an instance, and the relationship can be a 'type_of'. At this time, in the case of an entity name, the entity name tag becomes a class name, and a related keyword becomes a lower term, and at this time, the relationship becomes a 'keyword_of'.

이하에서는, 예문을 통해 본 발명에 따른 비지도 방식 도메인 온톨로지 추출 장치(100)의 실제 구동 예를 살펴보기로 한다.Hereinafter, an actual driving example of the unsupervised domain ontology extraction apparatus 100 according to the present invention will be described through examples.

<예문><example>

상기의 예문에서, 전처리부(110)는 다음과 같이 형태소 분석 및 개체명 인식을 수행하게 된다.In the above example, the preprocessor 110 performs morpheme analysis and object name recognition as follows.

여기에서, "<poi=놀부/NNG 옛날/NNG 통/XPN 닭/NNG>"는 "놀부옛날통닭"이 poi라는 개체명 태그로 태깅되었다는 것을 의미한다. Here, "<poi=Nolbu/NNG old days/NNG barrel/XPN chicken/NNG>" means that "Nolbu old whole chicken" is tagged with the entity name tag poi.

다음은 의존 파싱 결과의 예이다.The following is an example of a dependency parsing result.

상기의 예에서, 어절 정보는 "어절:n (지배소어절번호, 의존관계)"로 표시된다. 즉, "이번:0 (1, 'mod') 주:1 (2, 'mod')"은 0번째 어절 "이번"의 지배소가 1번째 어절 "주"임을 나타내고 "mod"라는 의존관계를 갖는다는 것을 의미한다.In the above example, word information is expressed as "word: n (dominant word number, dependent relationship)". That is, "this:0 (1, 'mod') week:1 (2, 'mod')" indicates that the dominion of the 0th word "this" is the 1st word "main" and establishes a dependency relationship of "mod". means to have

다음으로, 형태소 분석 및 의존파싱 결과에서 개체명은 하기와 같은 개체명 태그로 치환된다.Next, in the result of morpheme analysis and dependency parsing, the entity name is replaced with the entity name tag as follows.

- 형태소 분석 예: - Example of morphological analysis:

<time=6/SN 시/NNB> 부터/JX <time=6/SN 시/NNB 20/SN 분/NNB> 사이/NNG 로/JKB 바꾸/VV 어/EC 주/VX 세요/EC ./SFbetween <time=6/SN hour/NNB> to/JX between <time=6/SN hour/NNB 20/SN minute/NNB>/NNG to/JKB change/VV uh/EC week/VX s/EC ./SF

>> time 부터/JX time 사이/NNG 로/JKB 바꾸/VV 어/EC 주/VX 세요/EC ./SF>> Between time/JX time/NNG/JKB change/VV uh/EC share/VX please/EC ./SF

- 의존파싱 결과 예:- Example of dependency parsing result:

이번:0 (1, 'mod') 주:1 (2, 'mod') 토요일:2 (3, 'mod') 7시+에:3 (4, 'vmod') 예약+하+ㄴ:4 (5, 'vp_mod') 전길권+이+ㄴ데+요:5 (6, 'vnp') .:6 (-1, 'mod')This time: 0 (1, 'mod') Week: 1 (2, 'mod') Saturday: 2 (3, 'mod') 7:00+To:3 (4, 'vmod') Reservation+H+B:4 (5, 'vp_mod') Gil-kwon Jeon+Lee+Nde+Yo:5 (6, 'vnp') .:6 (-1, 'mod')

>> date:0 (1, 'mod') date:1 (2, 'mod') date:2 (3, 'mod') time+에:3 (4, 'vmod') 예약+하+ㄴ:4 (5, 'vp_mod') person+이+ㄴ데+요:5 (6, 'vnp') .:6 (-1, 'mod')>> date:0 (1, 'mod') date:1 (2, 'mod') date:2 (3, 'mod') time+to:3 (4, 'vmod') reserved+ha+b:4 (5, 'vp_mod') person+i+bde+yo:5 (6, 'vnp') .:6 (-1, 'mod')

다음으로, 용어 추출부(120)에서는, 체언 및 한정적 형용사에 대해 tf-idf를 사용하여 추출한다. tf-idf로 추출한 체언 및 한정적 형용사들을 빈도순으로 나열하면 다음과 같다.Next, in the term extraction unit 120, tf-idf is used to extract words and finite adjectives. The list of words and finite adjectives extracted by tf-idf in order of frequency is as follows.

예약:224, person.count:98, date:92, time:83, 변경:70, 자리:70, person:61, 룸:43, 성함:35, 창가:34, 번호:23, 시간:22, 취소:19, 인원:13, 방:13, 홀:11, 전화:11, 의자:10, 날짜:9, 테이블:9, 이름:8, qt.count:7, 일반:7, 연락:5, 완료:4, 식당:4, 휴무:2Reservation:224, person.count:98, date:92, time:83, change:70, seat:70, person:61, room:43, name:35, window:34, number:23, time:22, Cancel:19, pax:13, room:13, hall:11, phone:11, chair:10, date:9, table:9, name:8, qt.count:7, general:7, contact:5, Completed: 4, Restaurant: 4, Closed: 2

다음으로, 구문 트리플 추출부(130)에서 구문 트리플을 추출한 결과 예이다.Next, an example of a result of extracting a syntax triple in the syntax triple extractor 130 is provided.

다음으로, 클러스터링부(140)에서 코퍼스에 대한 1차 구문패턴을 상하위어 스코어 평균 순으로 나열한 결과 예이다.Next, it is an example of a result of arranging the primary phrase patterns for the corpus in the order of upper and lower word score averages in the clustering unit 140.

상기 예에서, 'freq'는 패턴의 빈도수, 'var'는 하위 용어들의 임베딩벡터 분산, 'sim'은 하위 용어들과 키워드간의 의미유사도 평균, 'pmi'는 하위 용어들과 키워드간의 PMI 평균, 'hyper'는 하위 용어들과 키워드간의 상하위어 스코어 평균을 의미한다.In the above example, 'freq' is the frequency of the pattern, 'var' is the variance of the embedding vector of sub-terms, 'sim' is the average semantic similarity between sub-terms and keywords, 'pmi' is the average PMI between sub-terms and keywords, 'hyper' means the average of upper and lower word scores between lower terms and keywords.

상기 구문패턴에 대한 필터링을 수행하면, ('*', 'vmod/에', '자리')와 ('*', 'vmod/에:subj', '자리'), ('*', 'mod', '자리')는 상하위어 스코어의 평균이 높아서 선택되고, ('*', 'vmod', '변경')와 ('*', 'obj', '변경')는 상하위어 스코어는 낮지만, PMI평균이 높아서 선택된다. ('예약', 'subj', '*')는 둘다 높지 않기 때문에 제거된다.When filtering on the syntax pattern is performed, ('*', 'vmod/to', 'place') and ('*', 'vmod/to:subj', 'place'), ('*', ' mod', 'place') is selected because the average of the upper and lower word scores is high, and ('*', 'vmod', 'change') and ('*', 'obj', 'change') are higher and lower word scores Although low, it is selected because the PMI average is high. ('reservation', 'subj', '*') are removed because they are both not high.

('*','vmod/에:subj','자리')에서 '시간'의 경우, 상하위 스코어 및 PMI값이 너무 낮기 때문에 해당 항목만 제거된다. ('*', 'mod', '자리')에서 "일반"의 경우 상하위 스코어는 낮지만, PMI값이 높아서 유지된다.In the case of 'time' in ('*','vmod/to:subj','digit'), only the corresponding item is removed because the upper and lower scores and PMI values are too low. In the case of "normal" in ('*', 'mod', 'digit'), the upper and lower scores are low, but the PMI value is high and maintained.

따라서 최종적으로, 다음과 같은 1차 구문패턴이 추출된다.Therefore, finally, the following primary syntax pattern is extracted.

상기 구문패턴에서 ('*', 'vmod/에', '자리')와 ('*', 'vmod/에:subj', '자리'), ('*', 'mod', '자리')는 키워드가 모두 "자리"로 동일하고, 하위 개체들은 "창가"를 공유하거나 '룸'과 '방'의 의미 유사도가 높은 단어를 공유하기 때문에 하나로 통합된다. ('*', 'vmod', '변경')와 ('*', 'obj', '변경')도 마찬가지로 하나로 통합된다.In the above syntax pattern, ('*', 'vmod/to', 'place') and ('*', 'vmod/to:subj', 'place'), ('*', 'mod', 'place') ) is integrated into one because the keywords are all the same as "seat", and sub-entities share "window" or words with high semantic similarity between 'room' and 'room'. ('*', 'vmod', 'change') and ('*', 'obj', 'change') are also merged into one.

그러면 최종적으로 아래 2개의 클래스가 생성된다.Then, the following two classes are finally created.

('*', '*', '자리'): {'홀', '창가', '룸', '일반'}('*', '*', 'seat'): {'hall', 'window', 'room', 'regular'}

('*', '*', '변경'): {'시간', '인원', '날짜'}('*', '*', 'change'): {'time', 'person', 'date'}

다음으로, 개체명에 대한 구문 트리플 추출 결과 예이다.Next is an example of a syntax triple extraction result for entity names.

구문패턴 결과에서 각 개체명 태그별로 공기하는 연관된 키워드를 빈도순으로 나열하면 다음과 같다.In the result of the syntax pattern, the related keywords aired by each entity name tag are listed in order of frequency as follows.

기본적으로 개체명들은 하나의 클래스를 형성하고, 필요한 경우에 대하여 연관 키워드를 결합하여 분할할 수 있다. 예를 들어, person.count의 경우, 아래 두 문장에서 person.count 용도는 서로 다르게 쓰이고 있고 구분할 필요가 있다.Basically, entity names form one class, and if necessary, they can be divided by combining related keywords. For example, in the case of person.count, in the two sentences below, the use of person.count is used differently and needs to be distinguished.

문장1: 혹시 예약을 화요일 1시경 <person.count=8>명 창가로 변경할 수 있을까요?Sentence 1: Could I change the reservation to <person.count=8> window seats around 1:00 on Tuesday?

문장2: 저희가 창가는 <person.count=4>명 자리밖에 없어요.Sentence 2: We only have seats for <person.count=4> by the window.

즉, 첫번째 person.count는 인원수를 의미하고, 두번째 person.count는 자리수를 의미한다. 따라서, 이 경우 person.count_자리, person.count_인원으로 구분할 필요가 있다.That is, the first person.count means the number of people, and the second person.count means the number of digits. Therefore, in this case, it is necessary to divide into person.count_ seat and person.count_ number of people.

또한 아래 문장에서도, 두개의 person.count는 다른 용도로 사용되고 있다.Also in the statement below, the two person.counts are used for different purposes.

문장: <person.count=6>명 예약했는데, <person.count=4>명으로 변경하려고요.Sentence: I made a reservation for <person.count=6> people, but I want to change it to <person.count=4> people.

첫번째 person.count는 예약한 인원수이고, 두번째 person.count는 변경할 인원수이다. 따라서 이런 경우를 위해 person.count_예약, person.count_변경 등으로 구분할 필요가 있을수 있고, 최종적으로 아래와 같은 구분을 해야 한다.The first person.count is the number of people booked, and the second person.count is the number of people to change. Therefore, for this case, it may be necessary to distinguish person.count_reservation, person.count_change, etc., and finally, the following classification should be made.

person.count: person.count_예약_인원, person.count_변경_인원, person.count_자리person.count: person.count_reservation_number of people, person.count_change_number of people, person.count_places

본 발명에 의한 키워드 정보는 이러한 구분을 사람이 판단할 수 있도록 정보를 제공할 수 있다. 이를 위해 상위 N개의 키워드를 제공하도록 한다.Keyword information according to the present invention can provide information so that a person can determine such classification. To this end, the top N keywords are provided.

앞서 살펴본 예시에 대한 전체 도메인 온톨로지 추출 결과는 도 4와 같이 나타날 수 있다.The result of extracting the entire domain ontology for the above example may appear as shown in FIG. 4 .

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art will variously modify and change the present invention within the scope not departing from the spirit and scope of the present invention described in the claims below. You will understand that it can be done.

100: 비지도 방식 도메인 온톨로지 추출 장치
110: 전처리부
120: 용어 추출부
130: 구문 트리플 추출부
140: 클러스터링부
150: 연관 키워드 생성부
160: 클래스 생성부100: unsupervised domain ontology extraction device
110: pre-processing unit
120: term extraction unit
130: syntax triple extraction unit
140: clustering unit
150: related keyword generation unit
160: class generator

Claims

a pre-processing unit that inputs the domain corpus and performs morpheme analysis, dependency parsing, and object name recognition;
a term extraction unit for extracting domain term candidates within a sentence;
a syntax triple extraction unit extracting a syntax triple based on the dependency parsing result;
a clustering unit generating non-entity name and entity name syntax patterns for the extracted syntax triple;
a related keyword generation unit for generating a related keyword showing the highest correlation by determining the relevance of each entity name tag with respect to the generated entity name syntax pattern; and
An unsupervised domain ontology extraction device comprising: a class generator for generating a class composed of keywords and subterms.