KR100918842B1

KR100918842B1 - Apparatus for substitute restoration and method thereof

Info

Publication number: KR100918842B1
Application number: KR1020070130721A
Authority: KR
Inventors: 최미란; 이창기; 왕지현; 장명길
Original assignee: 한국전자통신연구원
Priority date: 2007-12-14
Filing date: 2007-12-14
Publication date: 2009-09-28
Also published as: KR20090063384A

Abstract

본 발명에 따른 대용어 참조해결 방법은, 입력된 문자열을 분석하여 문자열에 포함된 대용어를 인식하는 대용어 인식단계; 대용어의 전체의미를 분석하는 의미 분석 단계; 문자열을 청킹(chunking)한 뒤 각각의 청크에 하나의 개념을 배정하고, 개념과 전체의미를 비교분석하여 대용어가 가리킬 수 있는 대상 후보들을 선정하는 대상 후보 선정단계; 대상 후보들을 순위화(ranking)하여 대용어 복원에 사용될 참조 대상을 선정하고, 참조 대상을 이용하여 대용어를 복원하는 대용어 복원단계를 포함한다. 이와 같은 구성에 의하여, 자연어 문장의 이해가 필요한 다양한 응용분야에서 효과적으로 활용될 수 있으며 길고 복잡한 문장에서 높은 정확도로 대용어를 복원할 수 있는데 그 효과가 있다.The alternative term reference solving method according to the present invention comprises: a substitute term recognition step of recognizing a substitute term included in the string by analyzing the input string; A semantic analysis step of analyzing the overall meaning of the substitute term; A target candidate selecting step of chunking a string and assigning a concept to each chunk, and comparing target concepts and overall meanings to select target candidates that can be represented by a substitute term; Ranking target candidates to select a reference target to be used for the restoration of the term, and reconstructing the term using the reference target. By such a configuration, it can be effectively used in various applications requiring the understanding of natural language sentences, and it is effective in restoring substitute words with high accuracy in long and complex sentences.

Description

Apparatus for substitute restoration and method

본 발명은 대용어 참조해결 장치 및 그 방법에 관한 것으로서, 더욱 상세하게는 자연어에서 대용어를 인식하여 자동으로 참조해결하는 대용어 참조해결 장치 및 그 방법에 관한 것이다.The present invention relates to a terminology reference solving apparatus and a method thereof, and more particularly, to a terminology reference solving apparatus and method for automatically resolving a reference term in natural language.

대용어는 자연어 문장의 반복을 피하고 문맥의 통일성을 이루기 위하여 많이 사용되고 그 종류가 다양하게 발생 되어서 전산 언어학 분야에서 오랫동안 다루어졌으나 해결이 쉽지 않은 문제이다. 대용어란 하나의 문장이나 여러 개의 문장 사이에서 같은 요소가 되풀이될 때, 되풀이되는 요소를 대명사, 지시사, 재귀 대명사, 생략, 및 일반명사를 사용하여 대치시키는 현상이다.Substitute terms are frequently used in order to avoid repetition of natural language sentences and achieve unity of context, and have been variously developed in the field of computational linguistics, but they are not easy to solve. Substitution is a phenomenon in which a repeated element is replaced by using a pronoun, a directive, a recursive pronoun, an omission, and a general noun when the same element is repeated between a sentence or several sentences.

대용어 참조해결을 자동으로 하는 장치 및 방법에 관한 종래기술을 살펴보면, 한국 출원번호: 10-1998-0048384인 "대화식 학습 보조 장치 및 그의 대화 분석 방법"에서 대용어를 처리하는 방법이 제시된바 있다.Referring to the related art of an apparatus and a method for automatically solving a terminology reference solution, a method of processing a terminology has been proposed in "Interactive Learning Aid Device and Its Conversation Analysis Method" of Korean Application No. 10-1998-0048384. .

도 1은 종래 대용어를 참조해결하는 방법을 설명하기 위한 도면이다.1 is a view for explaining a method for solving the conventional substitute term reference.

도 1에 도시된, 종래의 대화식 학습 보조 장치는 음성 입력장치(1), 음성/텍스트 변환장치(2), 형태소 해석기(3), 구문 분석기(4), 의미 해석기(5), 담화 분석기(6), 대화 관리기(7), 응답 생성기(8), 출력장치(9), 사전저장장치(10), 및 지식베이스 저장장치(11)로 구성된다.The conventional interactive learning aid shown in FIG. 1 includes a speech input device 1, a speech / text converter 2, a morpheme interpreter 3, a syntax analyzer 4, a semantic interpreter 5, a speech analyzer ( 6), conversation manager 7, response generator 8, output device 9, pre-storage device 10, and knowledge base storage device 11.

상기한 구성에 의한 종래 대용어 참조 해결 방식은, 사전저장장치(10)의 데이터를 참조하여 키워드 내용어와 대용어를 단순 매핑(Mapping)하는 방식을 사용하기 때문에 상세한 의미 분석의 결과를 이용한다고 보기 어렵다. 그리고, 대용어가 지시하는 대상 후보(즉, 이전 문맥 중에 대용어가 가리킬 수 있는 대상들)에서 순위를 정하는 과정이 없기 때문에 간단한 대화분석에는 적용할 수 있지만, 신문기사와 같이 길고 복잡한 문장에서는 정확도를 보장하기가 어렵다는 문제점이 발생하게 된다.The conventional term reference solution using the above-described configuration uses the result of detailed semantic analysis because it uses a method of simply mapping keyword content words and substitute terms with reference to the data of the pre-storage device 10. it's difficult. And since there is no process of ranking in the candidates indicated by a substitute term (ie, objects that can be pointed to in the previous context), it can be applied to simple conversational analysis, but it is accurate in long and complex sentences such as newspaper articles. The problem arises that it is difficult to guarantee.

이전 대용어 참조해결에 관련된 기술은 주로 문장 내에 한정된 대용어를 해결하기 위하여 한정된 숫자의 의미를 정한 후에 대용어의 종류에 따라서 단순한 매핑을 하는 방법을 사용하였으나 본 발명에서는 문서 전체에서 대상 후보들을 순위화하여 대용어의 참조대상을 찾아내는 방법을 사용해줌으로써 신문기사와 같이 길고 복잡한 문장에서 높은 정확도를 갖도록 한다.Previously, the technique related to the reference resolution of the term used a simple mapping method according to the type of the term after defining the meaning of the limited number to solve the limited term in the sentence, but the present invention ranks the candidate candidates in the entire document. By using the method of finding a reference object of a substitute term, it has high accuracy in long and complicated sentences such as newspaper articles.

본 발명은 상기와 같은 문제점을 해결하기 위한 것으로서, 의미분석과 개체명 인식에 의해 대상 후보를 선정하고, 선정된 대상 후보에 순위를 정하여 참조 대상을 선정하고, 선정된 참조 대상을 대용어로 복원해줌으로써, 길고 복잡한 문장에서 높은 정확도를 갖는 대용어 참조해결 장치 및 방법을 제공하는 것을 목적으로 한다.The present invention is to solve the above problems, by selecting the target candidate by means of semantic analysis and individual name recognition, ranking the selected target candidate to select the reference target, and restore the selected reference target in substitute terms It is an object of the present invention to provide a terminology reference solving apparatus and method having high accuracy in long and complex sentences.

본 발명에 따른 대용어 참조해결 장치는 문자열을 입력받아 대용어를 인식하는 대용어 인식부; 대용어가 갖는 전체의미를 분석하는 의미 분석부; 문자열의 개체명을 인식하는 개체명 인식부; 개체명 인식된 상기 문자열을 입력받아 청킹(chunking)한 뒤 각각의 청크에 하나의 개념을 배정하는 청킹부; 대용어의 전체의미와 각각의 청크에 배정된 개념을 비교분석하여 대용어 복원에 사용될 참조 대상을 선정하는 대용어 복원부를 포함한다.An alternative term reference solving apparatus according to the present invention comprises: a term term recognition unit that receives a string and recognizes a term term; A semantic analysis unit that analyzes the overall meaning of the substitute term; An entity name recognition unit recognizing an entity name of a character string; A chunking unit that receives the string recognized by the entity name and assigns a concept to each chunk after chunking; It includes a terminology restoration unit that selects a reference object to be used for terminology restoration by comparing and analyzing the whole meaning of the terminology and the concept assigned to each chunk.

바람직하게는, 대용어 인식부는 상기 문자열에서 '대명사' 및 '지시어+명사'를 대용어로 인식하는 것을 특징으로 한다.Preferably, the substitute word recognition unit may recognize 'pronoun' and 'directive + noun' as substitute words in the string.

또한, 의미 분석부는 계층적 의미 구조로 되어있는 다수 개의 의미코드에 기초하여 대용어가 갖는 전체의미를 파악하는 것을 특징으로 한다.In addition, the semantic analysis unit may grasp the overall meaning of the substitute term based on a plurality of semantic codes having a hierarchical semantic structure.

또한, 청킹부는 각각의 청크 안에 있는 개체명이나 명사의 의미를 기반으로 하여 각각의 청크에 하나의 개념을 배정하되, 각각의 청크에 배정되는 개념은 계층적 의미 구조를 갖는 다수 개의 의미코드 중 하나인 것을 특징으로 한다.In addition, the chunk assigns a concept to each chunk based on the meaning of an entity name or noun in each chunk, and the concept assigned to each chunk is one of a plurality of semantic codes having a hierarchical semantic structure. It is characterized by that.

한편, 본 발명에 따른 대용어 참조해결 방법은 입력된 문자열을 분석하여 문자열에 포함된 대용어를 인식하는 대용어 인식단계; 대용어의 전체의미를 분석하는 의미 분석 단계; 문자열을 청킹(chunking)한 뒤 각각의 청크에 하나의 개념을 배정하고, 개념과 전체의미를 비교분석하여 대용어가 가리킬 수 있는 대상 후보들을 선정하는 대상 후보 선정단계; 대상 후보들을 순위화(ranking)하여 대용어 복원에 사용될 참조 대상을 선정하고, 참조 대상을 이용하여 대용어를 복원하는 대용어 복원단계를 포함한다.On the other hand, a substitute word reference solution according to the present invention is a substitute word recognition step of recognizing a substitute word included in the character string by analyzing the input string; A semantic analysis step of analyzing the overall meaning of the substitute term; A target candidate selecting step of chunking a string and assigning a concept to each chunk, and comparing target concepts and overall meanings to select target candidates that can be represented by a substitute term; Ranking target candidates to select a reference target to be used for the restoration of the term, and reconstructing the term using the reference target.

바람직하게는, 대용어 인식단계는, 문자열을 형태소 분석하여 '대명사'와 '지시어+명사'를 대용어로 인식하는 것을 특징으로 한다.Preferably, the term recognition step is characterized by recognizing 'pronoun' and 'directive + noun' as a substitute by morphologically analyzing the character string.

또한, 의미 분석 단계는, 계층적 의미 구조를 갖는 다수 개의 의미코드를 이용하여 대용어에 전체 의미를 분석하는 것을 특징으로 한다.In addition, the semantic analysis step is characterized by analyzing the entire meaning in the substitute using a plurality of semantic codes having a hierarchical semantic structure.

또한, 대상 후보 선정단계는, 문자열의 개체명을 인식하는 단계; 및 각각의 청크 안에 있는 명사의 의미나 개체명을 기반으로 하여 각각의 청크에 하나의 개념을 배정하는 단계를 포함한다.In addition, the target candidate selection step, the step of recognizing the individual name of the character string; And assigning a concept to each chunk based on the meaning or entity name of the noun in each chunk.

또한, 대용어 복원단계는, 대상 후보의 개념이 대용어의 전체의미에서 하위의미에 해당할수록 높은 가중치를 부여하여 순위화(ranking)한다.In the substitute word reconstruction step, the concept of the target candidate corresponds to a lower meaning in the overall meaning of the substitute word, so that the weight is given a higher weight.

또한, 동일한 가중치를 부여받는 대상 후보가 있을 경우, 최근에 선정된 대상 후보를 우선으로 하고, 같은 조건일 때는 문장의 필수적인 구성요소를 우선으로 한다.In addition, when there are target candidates given the same weight, the recently selected target candidate is given priority, and under the same conditions, essential components of the sentence are given priority.

상기한 바와 같이, 본 발명은 대용어 전체의미, 개체명, 계층적 의미구조, 및 청킹을 이용하여 대용어를 인식하고 복원하여 줌으로써, 좀 더 정확하고 효율적으로 자연어 분석을 수행할 수 있다. 따라서, 자연어 문장의 이해가 필요한 다양한 응용분야에서 효과적으로 활용될 수 있으며 길고 복잡한 문장에서 높은 정확도로 대용어를 복원할 수 있는데 그 효과가 있다.As described above, the present invention can perform natural language analysis more accurately and efficiently by recognizing and reconstructing a substitute word by using a substitute word overall meaning, individual name, hierarchical semantic structure, and chunking. Therefore, it can be effectively used in various applications requiring the understanding of natural language sentences, and it is effective in restoring substitute words with high accuracy in long and complex sentences.

도 2는 본 발명에 따른 대용어 참조해결 장치를 설명하기 위한 블럭도이다.2 is a block diagram for explaining a terminology reference solving apparatus according to the present invention.

도 3은 본 발명에 따른 대용어 참조해결 방법을 설명하기 위한 도면이다.3 is a view for explaining a substitute term reference solution according to the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

110 : 대용어 인식부 120 : 의미 분석부110: term recognition unit 120: semantic analysis unit

130 : 개체명 인식부 140 : 청킹부130: object name recognition unit 140: chunking unit

150 : 대용어 복원부150: Terminology Restoration Unit

본 발명을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다. 여기서, 반복되는 설명, 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능, 및 구성에 대한 상세한 설명은 생략한다. 본 발명의 실시형태는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. Here, the repeated description, well-known functions and configurations that may unnecessarily obscure the subject matter of the present invention, and detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more completely describe the present invention to those skilled in the art. Accordingly, the shape and size of elements in the drawings may be exaggerated for clarity.

도 2는 본 발명에 따른 대용어 참조해결 장치를 설명하기 위한 구조도이다.2 is a structural diagram for explaining a terminology reference solving apparatus according to the present invention.

도 2를 참조하면, 본 발명에 따른 대용어 참조해결 장치는 대용어 인식부(110), 의미 분석부(120), 개체명 인식부(130), 청킹부(140), 및 대용어 복원부(150)를 포함한다.Referring to FIG. 2, the apparatus for solving a terminology reference according to the present invention includes a terminology recognition unit 110, a semantic analysis unit 120, an object name recognition unit 130, a chunking unit 140, and a terminology restoration unit. And 150.

대용어 인식부(110)는 문자열을 입력받아 형태소 분석을 수행하고, 각각의 형태소에 해당하는 품사정보를 태깅(Tagging)한다. 이어, 대용어 인식부(110)는 문자열에서 '대명사' 및 '지시어+체언'를 대용어로 인식한다. 즉, 다음과 같은 예에서 "그 회사"는 대용어로 인식된다.The substitute term recognition unit 110 receives a character string, performs morphological analysis, and tags parts of speech information corresponding to each morpheme. Subsequently, the substitute word recognition unit 110 recognizes the pronouns and the directive + correspondence as substitute words in the string. That is, in the following example, "the company" is recognized as a substitute.

"얼마 전에 마이크로소프트는 새로운 운영체계인 윈도우비스타를 판매하기 시작했다. 그 회사는 이전의 운영체계를 개선하여 사용자에게 보다 편리한 제품을 제공하는 것을 목표로 하고 있다.""Microsoft has just started selling a new operating system, Windows Vista. The company aims to improve its previous operating system to provide a more convenient product for its users."

지시어는 앞서 말한 내용을 다시 한번 지시하여 말할 때, 불필요한 반복을 피하기 위해서 대신 사용하는 말로서, '이', '저', '그', '다른’등이 있다. 지시어의 종류에는 명사구, 대명사, 재귀 대명사가 있으며 문서에 따라 나타나는 비율이 차이가 있다. IT문서의 경우에는 명사구가 많이 나타나며 대명사나 재귀 대명사는 거의 나타나지 않는다. 형태소는, 의미의 기능을 부여하는, 언어의 형태론적 수준에서의 최소단위를 지칭한다. Directives are used instead to avoid unnecessary repetition when referring to the above-mentioned contents once again, such as 'yi', 'low', 'he', and 'other'. Directive types include noun phrases, pronouns, and reflexive pronouns. In the case of IT documents, there are many noun phrases and almost no pronouns or reflexive pronouns. Morphology refers to the smallest unit at the morphological level of language, which confers the function of meaning.

의미 분석부(120)는 대용어 인식부(110)를 통해 인식된 대용어가 갖는 전체의미를 파악하고, 그 전체의미를 대용어에 모두 태깅한다. 보다 상세하게는, 모든 명사를 다수 개(예컨대, 약 400개)의 계층적 의미 구조로 분류한 의미코드에 기초하여 대용어가 갖는 상위의미도 모두 포함하여 태깅한다. The semantic analysis unit 120 grasps the overall meaning of the substitute term recognized through the substitute term recognition unit 110 and tags all the meanings in the substitute term. In more detail, all nouns are tagged based on the semantic codes classified into a plurality of hierarchical semantic structures (for example, about 400), including all upper meanings of the substitute terms.

기준>규범>제도Standards> Normal> Institution 모양>양식>문화>종교Shape> Forms> Culture> Religion 기준>규범>제도>보험Standard> norm> institution> insurance 모양>증세Shape> Sign 기준>규정Standard> Regulations 모양>형식Shape> Format 기준>규정>조약Standards> Regulations> Treaty 물건>고리Object> ring 기준>조리Standard> Cooking 물건>관Object> tube 기준>조리>이치>근거Basis> Cooking> Ichi> Base 물건>기Objects> 기준>조리>이치>원리Standards> Cooking> Ichi> Principle 물건>기계>기구Object> Machine> Utensil 기준>지침Standard> instructions 물건>기계>기구>계기Objects> Machines> Instruments> Instruments 기준>표준Standard> standard 물건>기계>기구>공구Objects> Machines> Tools> Tools 기호>언어Symbol> Language 물건>기계>기구>그릇Object> Machine> Utensil> Bowl 기호>이름Symbol> Name 물건>기계>기구>도구Stuff> Machine> Utensils> Tools 기호>표Symbol> Table 물건>기계>기구>도구>문방구Stuff> Machine> Utensils> Tools> Stationery 단위>세포Unit> cell 물건>기계>기구>도구>어구Object> Machine> Utensil> Tool> Phrase 대상>목표Target> Objective 물건>기계>기구>등Stuff> Machine> Apparatus> etc 대상>문제>과제Audience> Problem> Task 물건>기계>기구>무기Stuff> Machine> Utensils> Weapon 대상>문제>과제>문제Audience> Problem> Task> Problem 물건>기계>기구>세간>띠Objects> Machine> Utensils> Future> Birds 대상>물체Object> Object 물건>기계>기구>세간>살림살이Objects> Machine> Utensils> Furniture> House 대상>물체>결정체Object> Object> Crystal 물건>기계>기구>악기Stuff> Machine> Utensils> Musical Instruments 대상>물체>입자Target> Object> Particle 물건>기계>기구>운동기구Objects> Machines> Utensils> Fitness Equipment 대상>물체>천체Object> Object> Astro 물건>기계>기구>의료기Stuff> Machine> Utensils> Medical Device 대상>물체>투명체Object> Object> Transparent 물건>기계>기구>장난감Stuff> Machine> Utensils> Toy 대상>산물>예술Object> Product> Art 물건>기계>기구>집물Object> Machine> Utensil> House 대상>산물>창작물Object> Product> Creative 물건>꾸러미Stuff> Packs 대상>산물>창작물>저작물>글Grand Prize> Products> Creative Works> Artwork> Written 물건>끈Object> String 대상>상대Target> Relative 물건>노폐물Objects> Waste 모양>겉모양>모습>경치Shape> appearance> appearance> scenery 물건>덩어리Stuff> lump 모양>겉모양>모습>광경>영상Shape> Appearance> Appearance> Sight> Video 물건>먹을거리Stuff> Eating 모양>글씨Shape> Text 물건>먹을거리>양식>곡물Stuff> Eat> Food> Grain 모양>모양새>생김새>형세>지형Shape> Shape> Features> Details> Terrain 물건>먹을거리>음식물Object> Food> Food 모양>모양새>생김새>형체>상>영상>사진Shape> Shape> Features> Shape> Image> Picture> Photo 물건>먹을거리>음식물>장Object> Food> Food> Chapter 모양>모양새>짜임새>체계Shape> Shape> Texture> System 물건>먹을거리>음식물>장>고추장Object> Food> Food> jang> Red pepper paste 모양>무늬Shape> Pattern 물건>먹을거리>음식물>장>된장Article> eat> food> jang> Miso 모양>서체Shape> Font 물건>먹을거리>음식물>생선Object> Food> Food> Fish 모양>실상Shape> actual 물건>먹을거리>음식물>생선>고등어Object> Food> Food> Fish> Mackerel 모양>양식Shape> Form 물건>먹을거리>음식물>생선>삼치Object> Eat> Food> Fish> Swordfish

본 발명에 적용되는 계층적 의미 구조로 분류한 의미코드는, 표 1에 나타낸 바와 같이, 왼쪽으로 계층이 올라갈수록 의미의 추상성이 높아지며 오른쪽으로 계층이 내려갈수록 의미의 구체성이 높아진다. 예를 들어서 '그 고객'이라는 대용어에서 '고객'이라는 명사에는 '사람'이라는 의미가 부여되는데 이때 '사람'의 상위 의미도 포함한 '생물>동물>사람' 모두가 대용어에 태깅된다.As shown in Table 1, a semantic code classified into a hierarchical semantic structure applied to the present invention has a higher abstraction of meaning as the hierarchy goes up to the left and a higher specificity of meaning as the hierarchy goes down to the right. For example, in the term 'customer', the noun 'customer' is given the meaning of 'person', where all of 'biology> animal> human' including the upper meaning of 'person' is tagged.

개체명 인식부(130)는 의미 분석부(120)로부터 문자열을 입력받아, 사람, 기관 등의 이름, 곡명, 방송명 또는 지명과 같이 분류될 수 있는 단어 또는 일련의 단어들의 집합에 대해 개체명(named entity : NE)을 태깅한다. 예를 들어, "불교는 아시아 문화에서 많이 믿는 종교이다."라는 문자열에서 '불교'에는 'OGG_RELIGION', '아시아'에는 'LCG_CONTINENT'라는 개체명이 태깅된다. 개체명 인식부(130)는 공지된 것이며, 개체명 인식 처리(named entity processing)는 언어 분석의 중요한 단계로 알려져 있다.The entity name recognition unit 130 receives a string from the semantic analysis unit 120, and the entity name for a set of words or a set of words that can be classified such as a name of a person, an institution, a song name, a broadcast name, or a place name. Tag (named entity: NE). For example, in the string "Buddhism is a religion that many believes in Asian cultures," the individual names 'OGG_RELIGION' for 'Buddhism' and 'LCG_CONTINENT' for 'Asia' are tagged. The entity name recognition unit 130 is well known, and named entity processing is known as an important step in language analysis.

청킹부(140)는 개체명 인식부(130)에서 개체명 인식된 문자열을 입력받아 청킹(chunking : 예컨대, 복합명사, 수식절 등을 청크의 형태로 인식)하고, 청크 안에 있는 명사의 의미나 개체명을 기반으로 하여 각각의 청크에 하나의 개념을 배정한다. 이때 청크에 배정되는 하나의 개념은 표 1에 나타낸 바와 같은 계층적 의미구조를 갖는 다수 개의 의미코드 중 하나이다. 청크 안에 있는 명사의 의미가 청크의 개념을 정하는 기반이 되는 경우에는 계층적 의미구조를 갖는 다수 개의 의미코드 중 하나를 바로 배정하여 청크의 개념으로 사용하면 된다. 그리고, 청크 안에 있는 개체명이 청크의 개념을 정하는 기반이 되는 경우에는 개체명과 의미코드를 매핑(mapping)시켜주는 매핑 테이블(mapping table)을 이용하여 청크의 개념을 정하면 된다. 예를 들어, 청크 안에 "OGG_Business"라는 개체명이 있다고 가정하면, "OGG_Business"라는 개체명에는 "회사"라는 의미코드가 매핑되고, "회사"라는 의미코드가 청크의 개념으로 배정된다.The chunking unit 140 receives a string recognized by the entity name recognition unit 130 and receives a chunking (for example, recognizes a compound noun, a formula clause, etc. in the form of a chunk), and the meaning of the noun in the chunk or Assign a concept to each chunk based on the entity name. One concept assigned to the chunk is one of a plurality of semantic codes having a hierarchical semantic structure as shown in Table 1. When the meaning of a noun in a chunk is the basis for defining the concept of a chunk, one of a plurality of semantic codes having a hierarchical semantic structure can be directly assigned and used as the concept of a chunk. If the entity name in the chunk is the basis for defining the concept of the chunk, the concept of the chunk may be determined using a mapping table that maps the entity name and semantic code. For example, assuming that the entity name "OGG_Business" is in the chunk, the entity code "OGG_Business" is mapped to the meaning code "company", and the meaning code "company" is assigned to the concept of the chunk.

대용어 복원부(150)는 대용어에 태깅된 전체의미와 청크에 배정된 개념을 비교분석하여 대용어 복원에 사용될 청크(이하, 참조 대상)를 선정하게 된다. 청크를 참조 대상으로 이용하는 이유는 대용어가 지시하는 참조 대상이 하나의 개체명일 경우도 있지만 복합명사나 복합절 전체가 될 수도 있기 때문이다. 먼저, 대용어 복원부(150)는 대용어에 태깅된 전체의미에 포함되는 개념을 갖는 청크가 있으면, 그 청크를 대상 후보로 선정한다. 그리고, 대용어 복원부(150)는 대상 후보로 선정된 청크들을 비교하고 순위화(ranking)하여 참조 대상을 선정하게 된다. 이하, 선정된 대상 후보를 순위화하여 참조 대상을 선정하는 방법을 자세히 설명하면 다음과 같다. 먼저, 대상 후보의 개념이 대용어에 태깅된 전체의미에서 하위에 해당할수록 큰 가중치가 부여되고 순위화 시 더 높은 순위의 대상 후보가 된다. 예를 들어 대용어에 태깅된 전체의미가 '모양>양식>문화>종교'라고 가정하였을 경우, '모양>양식>문화>종교'중에서 가장 하위에 있는 '종교'가 가장 구체적인 의미이므로, '종교'와 일치하는 개념을 갖는 대상 후보에 높은 가중치가 부여된다. 반대로, '모양>양식>문화>종교'중에서 가장 상위에 있는 의미인 '모양'이 가장 추상적인 의미이므로, '모양'과 일치하는 개념을 갖는 대상 후보에 낮은 가중치가 부여된다. 전술한 과정을 거친 후 대상 후보의 가중치가 동일할 경우에는, 최근에 입력된 문자열을 우선으로 하며, 이 역시 동일할 경우에는 문장의 필수 구성요소(예를 들면, 주어, 목적어)를 우선으로 하여 참조 대상을 선정하게 된다.The substitute term restoring unit 150 selects a chunk (hereinafter, referred to as a reference object) to be used for restoring a term by comparing and analyzing the whole meaning tagged with the term and the concept assigned to the chunk. The reason for using a chunk as a reference object is that a reference object indicated by a substitute term may be a single entity name but may be a compound noun or a whole clause. First, if there is a chunk having a concept included in the overall meaning tagged in the substitute term, the substitute term restoration unit 150 selects the chunk as a candidate candidate. The term recovery unit 150 selects a reference object by comparing and ranking chunks selected as target candidates. Hereinafter, a method of selecting a reference target by ranking the selected target candidates will be described in detail. First, as the concept of the target candidate is lower in the overall meaning tagged in the substitute term, the greater weight is given and the higher the target candidate is in ranking. For example, if the overall meaning tagged in the terminology is 'shape> style> culture> religion', the 'religion' at the bottom of 'shape> style> culture> religion' is the most specific meaning. Higher weights are given to target candidates with concepts that match '. On the contrary, since 'shape', which is the highest meaning of 'shape> style> culture> religion', is the most abstract meaning, low weight is given to the target candidate having the concept of 'shape'. If the weights of the target candidates are the same after the above-described process, the recently inputted string is given priority. If the same is the same, the essential components (eg, subject and object) of the sentence are given priority. A reference object will be selected.

이하, 본 발명에 따른 대용어 참조해결 장치에서 계층적 의미 구조를 갖는 의미코드를 이용하여 대용어를 참조해결하는 과정을 일실시예를 통해 자세히 설명하면 다음과 같다.Hereinafter, a process of referencing a substitute term using a semantic code having a hierarchical semantic structure in the term reference solving apparatus according to the present invention will be described in detail with reference to the following.

먼저, 사용자를 통해 "불교는 아시아 문화에서 많이 믿는 종교이다. 그 종교는 살생을 금하는 규율을 가지고 있다."라는 문자열을 입력받는다는 가정하에 설명하기로 한다.First, assuming that a user receives a string of "Buddhism is a religion that is much believed in Asian culture. It has rules that prohibit killing."

대용어 인식부(100)는 상기 문자열을 입력받아 형태소 분석을 수행하고, 각각의 형태소에 해당하는 품사정보를 태깅을 수행한다(S10). 그 결과는 다음과 같다.The substitute term recognition unit 100 receives the character string, performs morpheme analysis, and performs tagging of parts of speech information corresponding to each morpheme (S10). the results are as follow.

"불교/nc는 아시아/nc 문화/nc에서 많이/mag 믿/pv는 종교/nc이다. 그/mm 종교/nc는 살생/nc을 금하/pv는 규율/nc을 가지/pv고 있다(표 2참조)."Buddhism / nc is much in Asia / nc culture / nc / mag believe / pv is religion / nc. He / mm religion / nc prohibits killing / nc / pv has discipline / nc / pv (table 2).

계층1Tier 1 계층2Tier 2 계층3Tier 3 1. s(기호):211.s (symbol): 21 2. f(외국어)2. f (foreign language) 3. n(명사)3.n (noun) 3.1 nc(자립명사) :03.1 nc (independent noun): 0 3.2 nb(의존명사) :23.2 nb (dependent noun): 2 4. np(대명사):14. np (pronoun): 1 5. nn(수사):35.nn (investigation): 3 6. pv(동사):46.pv (verb): 4 7. pa(형용사):67.pa (adjective): 6 8. px(보조용언):228.px (22): 22 9. co(지정사):13Co (designated company): 13 10. ma(부사)10. ma (adverb) 10.1 mag(일반부사):810.1 mag (general adverb): 8 10.2 maj(접속부사):2710.2 maj (connection adverb): 27 11. mm(관형사):2311.mm (tubular thread): 23 12. ii(감탄사):1112.ii (interjection): 11 13. x(접사)13.x (macro) 13.1 xp(접두사):1613.1 xp (prefix): 16 13.2 xs(접미사)13.2 xs (suffix) 13.2.1 xsn(명사 파생 접미사):20,1713.2.1 xsn (noun-derived suffix): 20, 17 13.2.2 xsv(동사 파생 접미사):1813.2.2 xsv (verb-derived suffix): 18 13.2.3 xsm(형용사 파생 접미사):1913.2.3 xsm (adjective derived suffix): 19 14. j(조사)14.j (investigation) 14.1 jc(격조사):1214.1 jc: 12 14.2 jx(보조사):2814.2 jx: 28 14.3 jj(접속조사):2514.3 jj (connection survey): 25 14.4 jm(속격조사):2614.4 jm: 26 15. ep(선어말어미):15Ep (fresh ending): 15 16. e(어말어미)16.e (the ending) 16.1 ef(종결어미):1416.1 ef (end ending): 14 16.2 ec(연결어미):2416.2 ec (connection mother): 24 16.3 et(전성어미)16.3 et 16.3.1etn(명사형어미):3016.3.1 etn: 30 16.3.2etm(관형형어미):2916.3.2etm (tubular mother): 29 17. uk(미등록어):3117.uk (unregistered word): 31 18. nk(사용자사전등록명사):3218.nk (user pre-registered noun): 32 19. nr(고유명사):3319.nr (unique noun): 33

이어, 대용어 인식부(110)는 각각의 형태소별로 품사정보가 태깅된 문자열 "불교/nc는 아시아/nc 문화/nc에서 많이/mag 믿/pv는 종교/nc이다. 그/ mm 종교/ nc는 살생/nc을 금하/pv는 규율/nc을 가지고 있다."에서, '지시어+명사'로 구성된 '그/ mm 종교/ nc'를 대용어로 인식한다(S20).Subsequently, the terminology recognition unit 110 is a string tagged with the part-of-speech information for each morpheme "Buddhism / nc is a lot / mag believe / pv is a religion / nc in Asia / nc culture / nc. / Mm religion / nc Is banning killing / nc / pv has discipline / nc ”, and recognizes that he / mm religion / nc consists of 'directive + noun' as a substitute (S20).

의미 분석부(120)는 표 1에 나타낸 바와 같은 계층적 의미 구조로 되어있는 다수 개의 의미코드에 기초하여 '그/mm 종교/nc'가 갖는 전체의미 '모양>양식>문화>종교'를 파악하고, 다음과 같이 전체의미 '모양>양식>문화>종교'를 대용어 '그/mm 종교/nc'에 태깅한다(S30).The semantic analysis unit 120 grasps the entire meaning 'shape> style> culture> religion' of 'mm / mm religion / nc' based on a plurality of semantic codes having a hierarchical meaning structure as shown in Table 1. Then, the entire meaning 'shape> style> culture> religion' is tagged with the term 'he / mm religion / nc' (S30).

"불교/nc는 아시아/nc 문화/nc에서 많이/mag 믿/pv는 종교/nc이다. 그/mm 종교/nc[모양>양식>문화>종교]는 살생/nc을 금하/pv는 규율/nc을 가지고 있다.""Buddhism / nc is a lot in Asia / nc culture / nc / mag believe / pv is religion / nc. He / mm religion / nc [shape>style>culture> religion] is killing / nc ban / pv is discipline / I have nc. "

이어, 개체명 인식부(130)를 통해 문자열에 개체명(NE)이 태깅된다(S40). "불교는 아시아문화에서 많이 믿는 종교이다. 그 종교는 살생을 금하는 규율을 가지고 있다."라는 문자열에서 '불교' 및 '아시아'에는 다음과 같이 개체명이 태깅된다.Subsequently, the entity name NE is tagged in the character string through the entity name recognition unit 130 (S40). In the string "Buddhism is a religion that is much believed in Asian culture. It has rules that prohibit killing." In Buddhism and Asia, individual names are tagged as follows.

"<불교:OGG _ RELIGION>는 <아시아:LCG _ CONTINENT> 문화에서 많이 믿는 종교이다. 그 종교는 살생을 금하는 규율을 가지고 있다.""<Buddhism: OGG _ RELIGION > is a religion that is much believed in <Asia: LCG _ CONTINENT > culture. It has rules that prohibit killing."

이어, 청킹부(140)는 문자열을 "불교는/아시아 문화에서/많이/믿는/종교이다"와 같이 청킹(chunking)한 뒤, 각각의 청크에 다음과 같이 하나의 개념을 배정한다(S50).Subsequently, the chunking unit 140 chunks the string as "Buddhism / In Asian Culture / Much / Believe / Religion" and assigns a concept to each chunk as follows (S50). .

"[<불교:OGG_RELIGION>는:종교] [<아시아:LCG_CONTINENT>문화에서:양식] [많이] [믿는] [종교이다:종교]""[<Buddhism: OGG_RELIGION> is: Religion ] [<Asia: LCG_CONTINENT> In culture: Form ] [Much] [Believe] [Religion: Religion]"

이어, 대용어 복원부(150)는 의미 분석부(120)를 통해 대용어에 태깅된 전체의미(모양>양식>문화>종교)와 각각의 청크에 배정된 개념을 비교하여, 청크의 개념 중에서 대용어에 태깅된 전체의미에 해당하는 것이 있으면, 그것을 대상 후보로 선정한다(S60). 그리고, 대용어 복원부(150)는 대상 후보로 선정된 청크들을 비교하고 순위화(ranking)하여 참조 대상을 선정하게 된다(S70).Subsequently, the term restoring unit 150 compares the overall meaning (shape> style> culture> religion) tagged with the terminology through the semantic analysis unit 120 with the concept assigned to each chunk, and among the concepts of the chunk. If there is something corresponding to the whole meaning tagged in the substitute term, it is selected as the target candidate (S60). In operation S70, the substitute word restoration unit 150 selects a reference object by comparing and ranking chunks selected as target candidates.

대상 후보의 개념이 대용어에 태깅된 전체의미에서 하위에 해당할수록 높은 가중치가 부여되고 순위화(Ranking)시 더 높은 순위의 대상 후보가 된다. 그리고, 가장 랭킹이 높은 청크를 이용하여 대용어를 복원하게 된다(S80). 다시 말하면, 대용어에 배정된 계층적인 의미 중에서 가장 하위에 있는 의미인 '종교'가 가장 구체적인 의미이다. 따라서 '종교'와 일치하는 개념을 갖는 청크에 높은 가중치가 부여된다. 반대로, 대용어에 배정된 계층적인 의미 중에서 가장 상위에 있는 의미인 '모양'이 가장 추상적인 의미이다. 따라서 '모양'과 일치하는 개념을 갖는 청크에 낮은 가중치가 부여된다. 만약, 동일한 가중치를 부여받는 청크가 있을 경우, 최근에 입력된 문자열에 포함된 대상후보를 우선으로 하며 같은 조건일 때는 수식절 보다는 주어나 목적어처럼 문장의 필수적인 구성요소를 우선으로 한다.As the concept of the target candidate is lower in the overall meaning tagged in the substitute term, the higher weight is assigned, and the higher candidate target is ranked during ranking. Then, the substitute term is restored using the chunk having the highest ranking (S80). In other words, 'religion', the lowest meaning among the hierarchical meanings assigned to substitute terms, is the most specific meaning. Thus, chunks with concepts that match 'religion' are given high weights. On the contrary, 'shape', the highest meaning among the hierarchical meanings assigned to substitute terms, is the most abstract meaning. Thus, a low weight is given to chunks that have a concept that matches 'shape'. If there is a chunk given the same weight, the target candidate included in the recently inputted string is given priority, and under the same conditions, the essential components of the sentence like the subject or object are given priority over the expression clause.

이러한 규칙을 전술한 예에 적용하면 "[<불교:OGG_RELIGION>는:종교]" 가 "[<아시아:LCG_CONTINENT>문화에서:양식]"보다 더 큰 가중치를 부여받게 되어 참조 대상으로 선정된다.Applying this rule to the above example, "[<Buddhism: OGG_RELIGION>: Religion ]" is given a greater weight than "[<Asia: LCG_CONTINENT> Culture: Style ]" and is selected for reference.

이에 따라, 입력된 문자열에 포함된 대용어를 복원하면 다음과 같다.Accordingly, restoring a substitute word included in the input string is as follows.

"불교는 아시아 문화에서 많이 믿는 종교이다. 불교는 살생을 금하는 규율을 가지고 있다.""Buddhism is a religion that is much believed in Asian culture. Buddhism has rules that prohibit killing."

이상 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형 실시예들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.While the preferred embodiments of the present invention have been shown and described, the present invention is not limited to the specific embodiments described above, and the present invention is not limited to the specific embodiments of the present invention, without departing from the spirit of the invention as claimed in the claims. Various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or the scope of the present invention.

Claims

A substitute word recognition unit that receives a string and recognizes a substitute word;

A semantic analysis unit for analyzing the overall meaning of the substitute term;

An entity name recognition unit recognizing an entity name of the character string;

A chunking unit that receives the string recognized by the entity name and assigns a concept to each chunk after chunking;

And a substitute term restoring unit for comparing and comparing the whole meaning of the terminology with a concept assigned to each chunk to select a reference object to be used for restoring the term.

The method according to claim 1,

The terminology recognizing unit, the terminology reference solving apparatus, characterized in that for recognizing the 'directive + correspondence' in the input string.

The method according to claim 1,

And the semantic analysis unit grasps the overall meaning of the substitute term based on a plurality of semantic codes having a hierarchical semantic structure.

The method according to claim 1,

The chunk assigns a concept to each chunk based on the meaning of an entity name or a noun in each chunk, and the concept assigned to each chunk is one of a plurality of semantic codes having a hierarchical semantic structure. Terminology reference solving device, characterized in that one.

A substitute word recognition step of recognizing a substitute word included in the string by analyzing the input string;

A semantic analysis step of analyzing the overall meaning of the substitute term;

A target candidate selection step of chunking the string and assigning a concept to each chunk, and comparing and analyzing the concept and the overall meaning to select target candidates to which the substitute term can indicate;

And a term restoring step of ranking the candidate candidates to select a reference object to be used for the term restoration and restoring the term using the reference object.

The method according to claim 5,

The term recognition step,

A grammatical analysis of the string to recognize the term "directive + correspondence" as a substitute, reference word resolution method characterized in that.

The method according to claim 5,

The semantic analysis step,

Alternative word reference solution characterized in that for analyzing the overall meaning of the substitute terms using a plurality of semantic codes having a hierarchical meaning structure.

The method according to claim 5,

The target candidate selection step,

Recognizing the entity name of the character string; And

Assigning a concept to each chunk based on the meaning of the noun in each chunk or the individual name.

The method according to claim 5,

Terminology restoration step,

The higher term weighting of the target candidate corresponds to a lower meaning in the overall meaning of the substitute term, characterized in that the ranking (ranking).

The method according to claim 9,

If there are a plurality of target candidates that are given the same weight, the term reference solution method of claim 1, wherein the target candidate included in the character string having a later input time is given priority.