KR20050032937A

KR20050032937A - Method for automatically creating a question and indexing the question-answer by language-analysis and the question-answering method and system

Info

Publication number: KR20050032937A
Application number: KR1020030068931A
Authority: KR
Inventors: 허정; 황이규; 장명길
Original assignee: 한국전자통신연구원
Priority date: 2003-10-02
Filing date: 2003-10-02
Publication date: 2005-04-08
Also published as: KR100546743B1

Abstract

A method and system for automatically indexing questions/answers based on the analysis of a language is provided to sort vocabularies and phrases through analyzing the language of various documents, automatically generate a corresponding natural language question, and store the question and corresponding answer in order to present good candidate of the answer corresponding to the question. A question/answer system includes an indexing engine(100) and a user question/answer engine(200). The indexing engine(100) includes a language analyzing unit(10), a candidate sentence selecting unit(20), a natural language question generating unit(30), and a question/answer indexing unit(40). The indexing engine(100) also includes an index database(42) and a question/answer database(44). The user question/answer engine(200) searches the indexed question/answer and answers the inquiry of the user. The user question/answer engine(200) includes a language analyzing unit(10), a question analyzing unit(50), a question detecting unit(60), and an outputting unit(70). The language analyzing unit(10) analyzes the language structure of the inputted various documents.

Description

Method for automatically creating a question and indexing the question-answer by language-analysis and the question-answering method and system}

본 발명은 언어분석을 기반으로 하는 자동 질문/정답 색인 방법과 그 질의응답 방법 및 시스템에 관한 것이며, 보다 상세히는 다수의 문서들에 대한 언어 분석을 통해 정답후보 문장에 대한 자연어 질문을 자동 생성하여 질문/정답 쌍을 색인 저장하고, 사용자의 질의시 색인 질문과의 유사도를 계산하여 입력 질문에 가장 근접한 질문 및 정답을 순위화하여 사용자에게 제시하는 언어분석 기반 자동 질문/정답 색인 방법과 그 질의응답 방법 및 시스템에 관한 것이다. The present invention relates to an automatic question / answer indexing method based on linguistic analysis and a question and answer method and system, and more specifically, to automatically generate a natural language question for a correct candidate sentence through linguistic analysis of a plurality of documents. Indexing and storing question / answer pairs, and calculating the similarity with the index question in the user's query, ranking the questions and answers closest to the input question, and presenting them to the user based on linguistic analysis. It relates to a method and a system.

인터넷 및 각종 네트워크가 널리 보급되면서 웹사이트 상에 각종 검색 엔진 및 정보검색 사이트가 제공되고 있으며 사용자들로서는 이를 이용하여 웹상의 각종 문서 등에 존재하는 다양한 정보를 쉽고 빠르게 획득하고 있다. 또한, 최근에는 이와 같이 문서형태로 검색정보를 제공하는 기존의 검색엔진에 진일보하여 사용자의 질의에 해당되는 정답내용만을 제공하는 질의응답시스템이 개발되어 시행되고 있다. With the widespread use of the Internet and various networks, various search engines and information retrieval sites are provided on websites, and users can easily and quickly acquire various information existing in various documents on the web by using them. In addition, in recent years, a question-and-answer system has been developed and implemented to provide only the correct content corresponding to a user's query, in addition to the existing search engine providing search information in the form of a document.

하지만, 종래 질의응답시스템은 기존의 정보검색엔진을 이용하여 검색된 상위 문서만을 대상으로 언어분석을 한 후, 사용자의 질문에 상응하는 정답후보 문장이나 구를 제시하는 형태로 이뤄지고 있다. 따라서, 사용자가 원하는 정보가 포함된 문서가 상위에 출현하지 않을 경우, 양질의 정답을 제시하기가 어려워지는 문제점이 있다. 또한, 검색된 문서를 실시간으로 언어 분석하여 사용자 질문에 응답하기 때문에 그 응답시간이 매우 길게 되는 단점이 있다. However, the conventional question-and-answer system is formed in the form of presenting correct candidate candidate sentences or phrases corresponding to the user's question after performing language analysis on only the upper documents searched using the existing information search engine. Therefore, when a document including information desired by a user does not appear at the top, there is a problem in that it is difficult to present a correct answer. In addition, the response time is very long because the searched document to answer the user question by language analysis in real time.

한편, 종래의 질의응답 시스템의 일 예로서, "인공지능과 자연어처리 기술에 기반한 자연어 문장형 질문에 대한 자동 해답 및 검색 제공 방법"{한국특허 10-2000-61428}이 제안된 바 있다. 하지만, 이 방법은 지식 데이터베이스를 이용하기 때문에 그 데이터베이스 구축에 많은 자원이 요구되는 문제점이 있으며, 이를 해결하여 실용적인 질의응답시스템을 위해서는 문서들로부터 자동으로 지식 데이터를 구축할 필요가 있다. On the other hand, as an example of a conventional question-and-answer system, a method for providing automatic answers and retrieval for natural sentence sentences based on artificial intelligence and natural language processing technology has been proposed. However, since this method uses a knowledge database, a lot of resources are required to construct the database. To solve this problem, it is necessary to automatically build knowledge data from documents for a practical question answering system.

따라서, 본 발명은 상술한 종래의 문제점을 해결하기 위한 것으로서, 본 발명의 목적은 다양한 문서들에 대한 언어분석을 통해 정답후보 어휘나 구들을 선별하고 이와 관련된 자연어 질문을 자동 생성하여 그 질문/정답 쌍들을 미리 색인 저장하여 둠으로써, 사용자의 질의 시 입력 질문과 색인 질문과의 유사도 비교를 통해 사용자의 질의에 대한 양질의 정답을 빠르게 제시할 수 있고, 또한 관련된 질문들을 함께 제시하고 사용자 피드백 기능을 도입함으로써 보다 다양하고 정확한 정보를 제공할 수 있는 언어분석 기반 자동 질문/정답 색인 방법 및 그 질의응답 방법 및 시스템을 제공하는데 있다. Accordingly, the present invention is to solve the above-mentioned conventional problems, an object of the present invention is to select the correct candidate vocabulary or phrases through language analysis of various documents and to automatically generate a natural language question related to the question / correct answer By indexing the pairs in advance, the user can quickly present the correct answer to the user's query by comparing the similarity between the input question and the index question during the user's query. The present invention provides a linguistic analysis-based automatic question / answer indexing method and a question-and-answer method and system capable of providing more diverse and accurate information.

상기 본 발명의 목적을 달성하기 위한 언어분석 기반 자동 질문/정답 색인 방법은, 다수의 입력 문서들의 언어적 구조를 분석하여 중요 어휘나 구에 대해 정답유형을 인식하고 문장의 의미구조를 파악하는 단계; 상기 정답유형 인식 구 또는 어휘를 포함하는 주요 문장을 정답 후보문으로서 선택 추출하는 단계; 상기 추출된 정답 후보문에 대하여 그 정답유형 및 의미구조 정보를 기반으로 자연어 질문을 생성하는 단계; 및 상기 정답후보 문장과 그 자연어 질문을 쌍으로 하여 색인 저장하는 단계;로 이루어진다. In order to achieve the object of the present invention, a language analysis-based automatic question / answer indexing method may include: analyzing a linguistic structure of a plurality of input documents, recognizing a correct answer type for an important vocabulary or phrase, and identifying a semantic structure of a sentence. ; Selecting and extracting a main sentence including the correct answer type recognition phrase or a vocabulary as a correct answer candidate sentence; Generating a natural language question for the extracted correct answer candidate sentence based on correct answer type and semantic structure information; And indexing and storing the correct candidate candidate sentence and the natural language question in pairs.

상기 언어구조 분석 단계는, 입력 문서들의 각 문장을 형태소 분석하고 그 분석 결과에 품사를 태깅하는 단계; 상기 형태소 분석 결과의 각 단어들에 대한 의미범주를 파악하는 단계; 주변 단어의 의미범주 및 개체명 사전을 참조하여, 문장내 고유명사에 대한 개체명을 인식하는 단계; 중요 어휘나 구에 부착된 의미범주 또는 개체명 태그를 기반으로 그 어휘 또는 구에 대한 정답유형을 인식하는 단계; 각 문장에 대해 구 단위 청킹을 수행하는 단계; 상기 구 청킹 결과들에 대해 문장에서의 술어를 인식하고, 나머지 구 청킹들을 그 술어에 대한 각각의 논항으로서 인식하는 단계; 및 상기 술어 및 각 논항의 인식 결과에 따라 문장에서의 술어와 논항간의 논리적 관계인 문장 의미구조를 생성하는 단계;로 이루어지는 것이 바람직하다. The linguistic structure analyzing step may include: stemming each sentence of the input documents and tagging the parts of speech in the analysis result; Identifying a semantic category for each word of the morphological analysis result; Recognizing an entity name for a proper noun in a sentence by referring to a semantic category and an entity name dictionary of surrounding words; Recognizing a correct answer type for the vocabulary or phrase based on the semantic category or the entity name tag attached to the important vocabulary or phrase; Performing phrase chunking on each sentence; Recognizing a predicate in a sentence for the old chunking results, and recognizing the remaining old chunkings as respective arguments to the predicate; And generating a sentence semantic structure that is a logical relationship between the predicate and the argument in a sentence according to the predicate and the recognition result of each argument.

또한, 상기 본 발명의 목적을 달성하기 위한 언어분석 기반 자동 질의응답 방법은, 사용자 질의문에 대해 언어구조를 분석하여 사용자 질문에 적합한 정답유형을 인식하고 질의문의 의미구조를 파악하는 단계; 상기 인식된 정답유형 및 의미구조 정보를 통해 사용자 질의문의 내용을 분석하는 단계; 상기 질문분석 결과를 바탕으로, 제 1항에 따라 생성된 질문/정답 쌍 데이터베이스로부터 사용자의 질문과 유사한 자연어 질문들을 검색하는 단계; 및 상기 검색된 각 자연어 질문에 대한 정답 후보문들을 상기 질문/정답 쌍 데이터베이스로부터 추출하여 그 자연어 질문과 함께 사용자에게 제시하는 단계;로 이루어진다. In addition, a language analysis-based automatic question answering method for achieving the object of the present invention comprises the steps of analyzing the language structure of the user query to recognize the correct answer type suitable for the user question and to identify the semantic structure of the query; Analyzing contents of a user query through the recognized correct answer type and semantic structure information; Searching for natural language questions similar to a user's question from a question / correct pair database generated according to claim 1 based on the question analysis result; And extracting correct answer candidates for each searched natural language question from the question / correct pair database and presenting the natural language question to the user.

상기 질문 검색 단계는, 상기 인식된 사용자 질의문의 정답유형에 대하여, 정답유형의 계층 구조에 따라, 그 하위 유형들로 정답유형을 확장하여, 상기 질문/정답 쌍 데이터베이스에 대한 검색을 수행하며, 또한, 온톨로지 및 유의어 사전을 참조하여 사용자 질의문에 포함된 키워드를 그 유사표현들로 확장한 후, 상기 질문/정답 쌍 데이터베이스를 통해 키워드 확장된 사용자 질의문에 대한 검색을 수행하는 것이 바람직하다. In the question retrieving step, the correct answer type of the recognized user questionnaire is extended according to the hierarchical structure of the correct answer type to subtypes, and the search is performed on the question / correct pair database. After referring to the ontology and thesaurus, the keyword included in the user query is expanded to similar expressions, and then a search for the keyword extended user query is performed through the question / answer pair database.

또한, 상기 검색결과 출력 단계는, 낮은 순위의 자연어 질문 및 정답후보도 함께 제시한 후 사용자의 정답선택을 통해 정답순위 정보를 피드백 받고, 정답순위 정보가 피드백되는 경우, 상기 질문검색 단계는 사용자가 선택한 정답 후보문을 원하는 정답에 근접한 것으로 보고, 그 자연어 질문으로 재검색을 실시하여 그에 따른 정답후보들을 제시하는 것이 바람직하다. In addition, the search result output step, the low priority natural language questions and correct answer candidates are also presented along with the feedback information of the correct ranking by selecting the correct answer of the user, if the correct ranking information is fed back, the question search step is the user It is desirable to view the selected correct candidate candidate as being close to the desired answer, to perform a re-search with the natural language question, and to present correct candidates accordingly.

또한, 상기 본 발명의 목적을 달성하기 위한 언어분석 기반 자동 질의응답 시스템은, 다수 문서들의 언어적 구조를 분석하여 그 문서내 중요 어휘나 구에 대한 정답유형을 인식하고 문장의 의미구조를 파악하는 언어구조 분석부; 정답유형이 태깅된 주요 문장을 정답 후보문으로 선정하는 정답 후보문 선정부; 상기 파악된 정답유형 및 의미구조 정보를 기반으로 상기 정답 후보문에 대해 자연어 질문을 생성하는 자연어 질문 생성부; 상기 생성된 자연어 질문과 그 정답 후보문을 쌍으로 하여 색인 데이터베이스 및 질문/정답 데이터베이스에 색인 저장하는 질문/정답 색인부; 및 상기 자연어 질문에서 추출된 주요 단어가 사용자 질의검색을 위한 색인 데이터로서 저장되는 색인 데이터베이스와, 상기 생성된 자연어 질문과 그 정답 후보문 데이터가 쌍으로 저장되는 질문/정답 데이터베이스;를 포함하는 질문/정답쌍 색인 엔진과, In addition, the language analysis-based automatic question answering system for achieving the object of the present invention, by analyzing the linguistic structure of a plurality of documents to recognize the correct answer type for important vocabulary or phrases in the document and to identify the semantic structure of the sentence Language structure analysis unit; Correct candidate candidate selection unit for selecting the main sentence tagged with the correct answer type as the correct candidate candidate; A natural language question generator configured to generate a natural language question for the correct answer candidate sentence based on the identified correct answer type and semantic structure information; A question / correct index unit configured to index and store the generated natural language question and its correct candidate candidate in pairs in an index database and a question / correct database; And an index database in which key words extracted from the natural language question are stored as index data for a user query search, and a question / correct database in which the generated natural language question and correct answer candidate data are stored in pairs. With the correct pair index engine,

사용자의 질의문에 대해 언어구조를 분석하여 사용자 질문에 적합한 정답유형을 인식하고 그 질의문 의미구조를 파악하는 언어구조 분석부; 상기 인식된 정답유형 및 의미구조 정보를 통해 사용자 질의문의 내용을 분석하는 질문 분석부; 상기 질문분석 결과를 바탕으로 상기 색인 데이터베이스로부터 사용자 질문과 유사한 자연어 질문들을 검색하는 질문 검색부; 및 상기 검색된 각 자연어 질문들을 사용자 질문과의 유사도에 따라 순위화하고, 상기 질문/정답 데이터베이스로부터 각 자연어 질문 및 그 정답 후보문들을 추출하여 순위대로 사용자에게 제시하는 검색결과 출력부;를 포함하는 사용자 질의/응답 엔진으로 구성된다. A language structure analysis unit for recognizing a correct answer type suitable for the user's question by analyzing a language structure of the user's question and grasping the semantic structure of the question; A question analyzing unit analyzing the content of the user query through the recognized correct answer type and semantic structure information; A question search unit for searching for natural language questions similar to a user question from the index database based on the question analysis result; And a search result output unit for ranking the searched natural language questions according to similarities with user questions, extracting each natural language question and its correct candidate candidates from the question / correct database, and presenting the natural language questions to the user in order. It consists of a query / response engine.

이하, 본 발명에 따른 실시예를 첨부한 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 질의응답 시스템에 대한 블록 구성도이다. 1 is a block diagram of a question-answering system of the present invention.

도 1을 참조하면, 본 발명의 질의응답 시스템은 크게 질문/정답쌍 색인 엔진(100)과 사용자 질의/응답 엔진(200)으로 이루어진다. 또한, 상기 질문/정답쌍 색인 엔진(100)은 언어 분석부(10), 질문생성 후보 문장 선택부(20), 자연어 질문 생성부(30), 질문/정답 색인부(40)로 구성되고, 생성된 질문/정답쌍을 색인 저장하기 위한 색인 데이터베이스(42)와 질문/정답 데이터베이스(44)를 구비한다. Referring to FIG. 1, the question and answer system of the present invention is composed of a question / answer pair index engine 100 and a user question / answer engine 200. In addition, the question / correct pair index engine 100 includes a language analyzer 10, a question generation candidate sentence selection unit 20, a natural language question generator 30, and a question / correct index unit 40. An index database 42 and a question / answer database 44 for indexing the generated question / answer pairs are provided.

그리고, 상기 사용자 질의/응답 엔진(200)은 상기 색인 저장된 질문/정답쌍을 검색하여 사용자의 질의에 응답하는 검색엔진으로서, 언어 분석부(10)와, 질문 분석부(50)와, 질문 검색부(60)와, 검색결과 출력부(70)로 구성된다. The user query / response engine 200 is a search engine that searches the indexed question / answer pairs and responds to the user's query. The user query / response engine 200 includes a language analyzer 10, a question analyzer 50, and a question search. And a search result output unit 70.

먼저, 상기 질문/정답쌍 색인 엔진(100)의 각 구성요소에 대하여 살펴보도록 한다. First, the components of the question / answer pair index engine 100 will be described.

상기 언어 분석부(10)는 상기 질문/정답쌍 색인 엔진(100)과 상기 사용자 질의/응답 엔진(200)에 대해 함께 적용되는 구성부분으로, 입력되는 다양한 문서들(또는 사용자의 질의 문장)에 대한 언어적 구조를 분석한다. The language analyzer 10 is a component part applied to the question / correct pair index engine 100 and the user query / response engine 200 together, and inputs various documents (or user's query sentences) to be input. Analyze the linguistic structure.

도 2는 이러한 언어 분석부(10)의 각 블록구성 및 처리 흐름을 상세하게 도시하고 있다. FIG. 2 shows each block configuration and processing flow of the language analyzer 10 in detail.

도 2에서, 형태소 분석부(11)는 입력 문서의 각 문장(또는 사용자의 질문)을 형태소 단위로 구분하고 그 형태소 분석 결과에 각각의 품사를 태깅한다. In FIG. 2, the morpheme analysis unit 11 divides each sentence (or user's question) of the input document into morpheme units and tags respective parts of speech in the morpheme analysis result.

또한, 의미 분석부(12)는, 명사개념망을 기반으로 구축되는 온톨로지(81)를 사용하여 상기 형태소 분석된 각 단어들(또는 어휘)에 대한 의미범주를 파악한다. 특히, 의미 모호성 단어(예를 들면, "배"와 같이, 교통수단, 신체일부, 과일명, 숫자의 배수.. 등과 같이 다수의 의미범주로 사용되는 경우)에 대해 상기 온톨로지(81)의 명사개념을 참조하여 그 문맥상에서의 가장 적합한 의미범주를 파악한다. In addition, the semantic analysis unit 12 grasps the semantic category for each of the words (or vocabulary) analyzed by the morpheme using the ontology 81 constructed based on the noun concept network. In particular, nouns in the ontology 81 for the meaning ambiguity word (for example, when used in multiple semantic categories, such as transportation, body parts, fruit names, multiples of numbers, etc., such as "pears"). Reference concepts to identify the most appropriate semantic category in that context.

또한, 개체명 인식부(13)는 개체명 사전(82)을 참조하여 상기 부착된 의미를 기반으로 고유명사에 대한 개체명을 분석한다. 다양하고 수없이 새로 생성되는 고유명사에 대해서는 온톨로지(81)를 통한 의미 분석이 가능하지 않기 때문에 상기 개체명 사전(82)과 이미 분석된 주변 단어의 의미를 통해 그 개체명을 인식하게 된다. 즉, 문장내 "서울"이라는 고유명사가 있을 경우 그 문맥 및 상기 개체명 사전을 참조함으로써 '지명(Location)'인 것으로 인식하게 된다. 이때 사용되는 상기 개체명 사전(82)은 인물(PERSON), 지명(LOCATION), 기관명(ORGANIZATION) 등을 기반으로 하는 고유명사 사전이다. In addition, the entity name recognition unit 13 analyzes the entity name for the proper noun based on the attached meaning with reference to the entity name dictionary 82. Since the meaning analysis through ontology 81 is not possible for various and numerous newly generated proper nouns, the entity name is recognized through the meaning of the entity name dictionary 82 and the neighboring words already analyzed. That is, when there is a proper noun "Seoul" in the sentence, the context and the entity name dictionary are referred to as 'Location' by referring to the dictionary. The entity name dictionary 82 used at this time is a proper noun dictionary based on PERSON, LOCATION, ORGANIZATION, and the like.

또한, 정답유형 인식부(14)는, 상기와 같이 각 단어에 대한 의미 및 개체명이 인식되면, 정답유형 DB(83)를 기반으로 각 어휘 및 구들에 대한 정답 유형을 인식하게 된다. 즉, 상기 의미분석 및 개체명 인식에 의해 해당 어휘나 구에 부착된 의미범주 또는 개체명 태그를 도 4와 같은 정답유형의 형태로 세분화하는 기능을 수행한다. 도 4는 본 발명에 따른 정답유형 테이블에 대한 예시 도표도이다. In addition, when the meaning and the entity name for each word are recognized as described above, the correct answer type recognition unit 14 recognizes the correct answer type for each vocabulary and phrase based on the correct answer type DB 83. That is, the semantic analysis and the entity name tag attached to the corresponding vocabulary or phrase by the semantic analysis and the entity name recognition are performed in the form of the correct answer type as shown in FIG. 4. 4 is an exemplary diagram for the correct answer type table according to the present invention.

이때, 상기 정답유형의 태그는 질의응답시스템에서의 정답후보에 해당하는 어휘나 구에 대한 태그이므로, 그 의미범주 또는 개체명 형태와는 반드시 일치하지는 않는다. 예를 들면, "1988년 서울 올림픽"이라는 구에 개체명이 부착되면, "<1988년:DATE> <서울:LOCATION> 올림픽"과 같이 되지만, 상기 정답유형 인식부는 "<1988년:DATE> <서울 올림픽:EVENT>", 또는 "<1988년 서울 올림픽:EVENT>"와 같은 형태로 정답유형을 인식하게 된다.In this case, since the tag of the correct answer type is a tag for a vocabulary or phrase corresponding to the correct candidate in the question and answer system, it does not necessarily match the semantic category or the entity name form. For example, if the individual name is attached to the phrase "1988 Seoul Olympics", it would be "<1988: DATE> <Seoul: LOCATION> Olympics", but the correct answer type recognition unit is "<1988: DATE> <Seoul. The type of answer will be recognized in the form of "Olympic Games: EVENT>" or "<1988 Olympic Games in Seoul: EVENT>".

상기 정답유형 DB(83)는 정답유형 사전과 정답유형 패턴 DB로 이루어진다. 상기 정답유형 사전은 도 4에 제시된 정답유형 테이블과 같이, 다수의 정답유형과 그 상,하위 분류 정보를 갖고 있다. 또한, 상기 정답유형 패턴 DB에는 각 정답유형을 인식하기 위한 규칙들(패턴)이 저장된다. 예를 들면, "<1988년:DATE> <서울:LOCATION> 올림픽"에서와 같이 "올림픽"이라는 단어 앞에 '지명(LOCATION)' 또는 '년도(DATE)'가 올 경우 "이벤트(EVENT)"로서 정답유형을 인식하고, "100 kg"과 같이 <숫자> + <kg>의 경우 "무게(WEIGHT)"라는 정답유형으로 인식하는 것과 같이, 각각의 정답유형에 대한 인식 규칙들이 저장된다.The correct answer type DB 83 includes a correct answer type dictionary and a correct answer type pattern DB. The correct answer type dictionary has a plurality of correct answer types and their upper and lower classification information, as shown in the correct answer type table shown in FIG. 4. In addition, the correct answer type pattern DB stores rules (patterns) for recognizing each correct answer type. For example, if the word "Olympics" precedes "LOCATION" or "DATE" as in "<1988: DATE> <Seoul: LOCATION> Olympics", Recognition rules for each type of correct answer are stored, such as recognizing the correct type of answer and recognizing the correct type of "WEIGHT" for <number> + <kg>, such as "100 kg."

한편, 부분구문 분석부(15)는 격틀 기반 문장분석을 위해 구 단위 청킹(Chunking)을 수행한다. On the other hand, the partial syntax analysis unit 15 performs a chunking (Chunking) unit for the frame-based sentence analysis.

그리고, 격틀 기반 문장구조 분석부(16)는, 격틀 사전(84) 및 이벤트 용언에 대한 용례 사전(85)을 참조하여, 상기 구 청킹 결과들에 대하여 문장에서의 술어(용언)을 인식하고, 나머지 구 청킹들을 그 술어에 대한 각각의 논항(주격, 목적격 등)으로서 인식한다. 즉, 상기 격틀사전(84)을 통해 문장의 술어구에 대한 격틀정보를 확인하고, 각 청킹된 구의 조사 및 그 의미정보를 이용하여 논항에 대한 격을 결정한다. 또한, 조사가 생략된 구에 대해서는 상기 파악된 의미정보만을 이용하여 논항의 격을 결정하게 된다. In addition, the frame-based sentence structure analysis unit 16 recognizes a predicate (word) in the sentence with respect to the phrase chunking results with reference to the case dictionary 84 and the case dictionary 85 for the event terminology. Recognize the remaining old chunks as individual arguments (subject, object, etc.) to the predicate. That is, through the battle dictionary 84, the battle information about the predicate phrase of the sentence is confirmed, and the price of the dispute is determined using the investigation of each chunked phrase and its semantic information. In addition, for phrases omitted from the survey, the price of arguments is determined using only the semantic information identified above.

그리고, 의미구조 생성부(17)는, 이와 같은 술어 및 논항의 인식 결과를 이용하여, 각 문장에 대한 술어, 논항 구조의 논리적 형태(LF; Logical Form)를 생성한다. 즉, 술어와 각 논항의 인식을 기반으로 문장에서의 술어와 논항간의 논리적 관계를 표현하는 의미구조를 생성한다. The semantic structure generation unit 17 generates the logical form (LF) of the predicate and the argument structure for each sentence by using the result of the recognition of the predicate and the argument. That is, based on the recognition of the predicate and each argument, it creates a semantic structure that expresses the logical relationship between the predicate and the argument in the sentence.

한편, 상기 설명된 바와 같이 언어 분석부(10)를 통해 입력 문서에 대한 정답유형 및 문장의 의미구조가 파악되면, 상기 질문생성 후보 문장 선택부(20)는, 이를 바탕으로 각 문서에서 중요한 어휘나 구(정답후보 어휘나 구, 즉 정답유형이 태깅된 어휘나 구)를 포함하는 문장을 질문생성 후보문장(즉, 정답 후보문)으로서 선정한다. On the other hand, when the correct answer type and the semantic structure of the sentence is identified through the language analysis unit 10 as described above, the question generation candidate sentence selection unit 20 based on this, the important vocabulary in each document A sentence containing a phrase (a correct candidate candidate vocabulary or phrase, that is, a vocabulary or phrase tagged with the correct answer type) is selected as a candidate sentence for generating a question (ie, a candidate for correct answer).

그리고, 자연어 질문 생성부(30)는, 상기 선택된 후보 문장에 대한 자연어 질문을 생성하는 것으로, 그 문장에 포함된 정답유형 정보와 상기 파악된 문장의 의미구조 정보를 기반으로 가장 적합한 자연어 질문을 생성한다. 예를 들면, "서울 올림픽은 1988년에 개최되었다."라는 문장에서 "서울 올림픽<EVENT,주격>"라는 정답유형과 "1988년에<DATE,부사격>" 및 "개최되다<술어>"의 의미구조 정보를 기반으로, "서울 올림픽은 언제 개최되었는가?" 또는 "서울 올림픽의 개최 년도는?"이라는 자연어 질문을 생성하게 된다.The natural language question generator 30 generates a natural language question for the selected candidate sentence, and generates a natural language question that is most suitable based on the correct answer type information included in the sentence and the semantic structure information of the identified sentence. do. For example, in the sentence "Seoul Olympics was held in 1988", the correct answer type was "Seoul Olympics <EVENT, Main>," and "The 1988 <DATE, Side Shot>" and "Holded <Predicate>". Based on semantic information, "When was the Seoul Olympics held?" Or create a natural language question, "What year did the Seoul Olympics take place?"

이와 같이 자연어 질문이 생성되면, 상기 질문/정답 색인부(40)는, 상기 선정된 질문생성 후보문장과 그 자연어 질문을 색인 데이터베이스(42) 및 질문/정답 데이터베이스(44)에 저장하게 되는데, 상기 색인 데이터베이스(42)에는 그 질문의 주요 단어 정보와 질문/정답 DB에 대한 색인정보가 저장된다. 또한, 상기 질문/정답 데이터베이스(44)에는 자연어 질문과 그에 대한 실질적인 정답 데이터인 상기 선정된 정답문장 데이터가 저장된다. When the natural language question is generated as described above, the question / correct index index unit 40 stores the selected question generation candidate sentence and the natural language question in an index database 42 and a question / correct database 44. The index database 42 stores key word information of the question and index information of the question / answer DB. In addition, the question / correct database 44 stores natural language questions and the selected correct sentence data, which are actual correct answer data.

이하에서는, 도 1 및 도 3을 참조하여, 사용자의 질의에 따른 질의/응답 엔진(200)의 동작에 대해 설명하도록 한다. Hereinafter, the operation of the query / response engine 200 according to a user's query will be described with reference to FIGS. 1 and 3.

도 1에 도시된 바와 같이, 언어 분석부(10)는 질문/정답쌍 색인 엔진(100)과 사용자 질의/응답 엔진(200)에 공통되는 구성요소이며, 그에 대한 상세한 동작은 상기 도 2을 통해 설명된 바와 같다. 즉, 상기 언어 분석부(10)는, 사용자의 질의 문장에 대하여, 형태소 분석을 한 후 의미 분석 및 개체명 인식을 하고, 사용자 질문에 대한 적절한 정답유형을 인식한다. 또한, 용언의 격틀을 기반으로 질의 문장의 의미구조를 생성한다. As shown in FIG. 1, the language analyzer 10 is a component common to the question / answer pair index engine 100 and the user question / answer engine 200, and detailed operations thereof will be described with reference to FIG. 2. As described. That is, the language analyzing unit 10 performs a morphological analysis on the user's query sentence, performs semantic analysis and entity name recognition, and recognizes an appropriate correct answer type for the user question. In addition, the semantic structure of the query sentence is generated based on the perceptual context.

이에 따라, 질문 분석부(50)는 이와 같이 파악된 정답유형 및 의미구조 정보를 통해 사용자의 질문을 분석한다. 예를 들면, "불국사가 있는 곳은?"이라는 사용자 질문의 경우, "불국사<주격>"의 개체명 인식과 "곳<LOCATION>"의 정답유형을 통해서 그 질문내용을 파악하게 된다.Accordingly, the question analyzing unit 50 analyzes the user's question through the correct answer type and semantic structure information thus identified. For example, in the case of a user question, "Where is Bulguksa?", The question content is identified through the recognition of the individual name of "Bulguksa" and the type of correct answer "Location <LOCATION>".

그리고, 질문 검색부(60)는, 상기 질문분석 내용을 바탕으로 상기 색인 데이터베이스(42)에서 사용자 질문과 유사한 자연어 질문들을 검색한다. Then, the question search unit 60 searches for natural language questions similar to user questions in the index database 42 based on the question analysis contents.

또한, 상기 질문 검색부(60)는, 보다 정확한 정답 검색을 위하여 정답유형 및 사용자의 입력 키워드에 대한 확장을 할 필요가 있다. 도 3은 이러한 질문 검색부(60)의 구성블록 및 그 처리 흐름을 도시하고 있다. In addition, the question search unit 60 needs to expand the correct answer type and the user's input keyword in order to search the correct answer more accurately. 3 shows the building blocks of the question search unit 60 and the processing flow thereof.

도 3에서, 정답유형 확장부(61)는, 도 4에 제시된 바와 같은 정답유형의 계층적 구조를 참조하여, 사용자 질문에 대해 상기 분석된 정답유형에 대해 그 하위 유형들로 확장함으로써 그 검색의 정확성을 도모한다. 또한, 정답유형에 의한 검색부(63)는 상기 확장된 정답유형을 적용하여 상기 색인 데이터베이스(42)로부터 그 유사 질문들을 검색하게 된다. In FIG. 3, the correct answer type expansion unit 61 refers to the hierarchical structure of the correct answer type as shown in FIG. 4 to expand the subtypes to the analyzed correct answer type for the user question by sub-types of the search. Promote accuracy. In addition, the search unit 63 according to the correct answer type applies the extended correct answer type to search for similar questions from the index database 42.

예를 들면, "불국사가 있는 곳은?"이라는 사용자 질문에 대한 정답유형이 <LOCATION>이라는 것을 알았다면, 실제 정답인 '경주'라는 어휘를 제공하기 위해서는 <LOCATION>의 하위 유형인 <CITY>로 정답유형을 확장하여 검색할 필요가 있다.For example, if you knew that the correct answer type for the user question "Where is Bulguksa?" Is <LOCATION>, to provide the vocabulary of 'race' which is the actual answer, <CITY> is a subtype of <LOCATION>. We need to expand the answer type to search.

또한, 키워드 확장부(62)는, 개념어 집합인 온톨로지(81) 및 유의어 사전(86)을 참조하여, 사용자 질의문에 포함된 주요 단어(키워드)를 그 유사표현으로 확장한다. 즉, 사용자 질의문 분석을 통해 "배<교통기관>"이 파악된 경우, 그 유사어인 '선박', '함선'.. 등으로 확장한다. 그리고, 키워드 검색부(64)는 이와 같이 키워드 확장된 사용자 질문에 대해 상기 색인 데이터베이스(42)로부터 유사한 자연어 질문들을 검색한다.In addition, the keyword expansion unit 62 refers to the ontology 81 and the thesaurus 86 which are a set of conceptual words, and expands the main words (keywords) included in the user query to similar expressions. That is, when "ship <transportation agency>" is identified through the user query analysis, it expands to the similar words "ship", "ship" .. The keyword retrieval unit 64 then retrieves similar natural language questions from the index database 42 for the keyword expanded user question.

그리고, 검색결과 출력부(70)는, 상기 정답유형 확장 및 키워드 확장을 통해 검색된 각 자연어 질문들에 대하여 사용자 질문과의 유사도에 따라 순위화하고 상기 질문/정답 데이터베이스(44)로부터 해당 자연어 질문 및 정답 문장을 추출하여 그 순위에 따라 사용자에게 제시한다. In addition, the search result output unit 70 ranks the natural language questions searched through the correct answer type expansion and the keyword expansion according to the similarity with the user question, and the corresponding natural language question and the answer from the question / correct database 44. Extract correct sentences and present them to users according to their ranking.

이때, 상기 검색결과 출력부(70)는 순위가 낮은 자연어 질문 및 정답도 함께 제시하며, 상기 질문 검색부(60)는 사용자로부터 정답순위에 대한 피드백 정보를 입력받아 재검색을 실시한다. 즉, 제시된 정답문들중에서 사용자가 선택(클릭)한 정답을 사용자가 원하는 정답에 근접한 것으로 보고 상기 색인 데이터베이스(42)에 대한 재검색을 실시하여 상기 클릭된 것을 우선 순위로 하여 재검색 정답후보들을 제시한다. 이와 같은 재검색은 사용자에게 재검색 여부를 물어보고 그에 따라 실시하는 것이 바람직하다. In this case, the search result output unit 70 also presents low-order natural language questions and correct answers, and the question search unit 60 receives feedback information on correct rankings from the user and performs re-search. That is, the correct answer selected by the user (clicked) among the presented correct answers is considered to be close to the correct answer desired by the user, and the search for the index database 42 is performed to present the re-search correct answer candidates with the clicked as a priority. . Such re-search is preferably performed by asking the user whether to re-search.

한편, 도 4는 본 발명에 따른 정답유형 테이블을 예시한 도표도로서, 도 4의 테이블은 문서내에서 질문될 가능성이 많은 중요한 어휘나 구에 대한 유형을 분석하여 각 계층별로 구분한 정답유형 테이블이다. On the other hand, Figure 4 is a diagram illustrating a table of correct answer type according to the present invention, the table of Figure 4 is the correct answer type table divided by each layer by analyzing the types of important words or phrases that are likely to be asked in the document to be.

상술한 바와 같은 본 발명의 언어분석 기반 자동 질문/정답 색인 방법 및 그 질의응답 방법 및 시스템에 따르면, 다량의 문서에서 중요한 어휘나 구를 인식하여 이 어휘와 구를 정답으로써 요구하는 자연어 질문을 자동으로 생성할 수 있고, 생성된 질문과 정답을 색인하여, 사용자의 질문에 부합하는 정답을 색인된 질문과 정답에서 유사도를 계산하여 제시함으로써 사용자의 질문에 정확하고 빠른 정답을 제공할 뿐만 아니라, 사용자의 질문과 연관성이 많은 질문들을 함께 제시함으로써 사용자가 요구하는 정보를 폭넓게 제시할 수 있다. According to the linguistic analysis-based automatic question / answer indexing method and the question-answer method and system of the present invention as described above, it is possible to recognize natural vocabulary or phrases in a large amount of documents and to automatically request natural language questions that are required as correct answers. Indexed questions and correct answers, and by providing the correct answer that matches the user's question by calculating the similarity in the indexed questions and correct answers, not only to provide accurate and fast answers to the user's questions By presenting a lot of questions that are related to the questions of, you can present a wide range of information required by the user.

또한, 다량의 문서를 언어 분석하여 자동으로 질문을 생성할 수 있는 기술을 제시함으로써, 관공서나 기업에서 제공하는 다양한 문서들에 대한 FAQ를 자동으로 구축할 수 있어, 기존에 FAQ구축에 소요되는 자원과 시간을 절약할 수 있다. In addition, by presenting a technology that can automatically generate questions by linguistic analysis of a large number of documents, it is possible to automatically build a FAQ for a variety of documents provided by government offices and companies, resources that are required to build the existing FAQ Save time and time.

또한, 웹 상에 범람하는 다양한 교육 문서들을 대상으로 자동으로 주관식 문제를 생성할 수 있어, 교육 문서를 이용한 교육의 이해도를 측정하기 위한 주관식 문제지 제작이 자동으로 가능해 짐으로써, 웹 교육의 효과를 향상시킬 수 있다. In addition, the subjective questions can be automatically generated for various educational documents overflowing on the web, and the production of the subjective questionnaires for measuring the comprehension of education using the educational documents can be automatically produced, thereby improving the effectiveness of web education. You can.

이상에서 설명한 것은 본 발명에 따른 언어분석 기반 자동 질문/정답 색인 방법 및 그 질의응답 방법 및 시스템을 실시하기 위한 하나의 실시예에 불과한 것으로서, 본 발명은 상기한 실시예에 한정되지 않고, 이하의 특허청구의 범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변경 실시가 가능한 범위까지 본 발명의 기술적 정신이 있다고 할 것이다.What has been described above is only one embodiment for implementing the language analysis-based automatic question / answer index method and the question and answer method and system according to the present invention, the present invention is not limited to the above embodiment, Without departing from the gist of the invention as claimed in the claims, anyone of ordinary skill in the art will have the technical spirit of the present invention to the extent that various modifications can be made.

도 1은 본 발명에 따른 질의응답 시스템의 블록 구성도. 1 is a block diagram of a question and answer system according to the present invention;

도 2는 본 발명에 따른 언어분석 과정에 대한 구성블록 및 흐름도. 2 is a block diagram and a flow diagram for the language analysis process according to the present invention.

도 3은 본 발명에 따른 질문검색 과정에 대한 구성블록 및 흐름도. 3 is a block diagram and a flow chart for the question search process according to the present invention.

도 4는 본 발명에 따른 정답유형 테이블에 대한 예시 도표도. Figure 4 is an exemplary diagram for a correct answer type table according to the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100: 질문/정답쌍 색인 엔진 200: 사용자 질의/응답 엔진100: question / answer pair index engine 200: user question / answer engine

10: 언어 분석부 20: 질문생성 후보문장 선택부10: language analysis unit 20: question generation candidate sentence selection unit

30: 자연어 질문 생성부 40: 질문/정답 색인부30: natural language question generator 40: question / answer index

42: 색인 데이터베이스 44: 질문/정답 데이터베이스42: Index database 44: Question / answer database

50: 질문 분석부 60: 질문 검색부50: question analysis unit 60: question search unit

61: 정답유형 확장부 62: 키워드 확장부61: correct answer extension 62: keyword expansion

70: 검색결과 출력부 70: search result output unit

Claims

Analyzing the linguistic structures of the plurality of input documents to recognize correct answer types for important vocabulary or phrases and to identify the semantic structure of the sentence;

Selecting and extracting a main sentence including the correct answer type recognition phrase or a vocabulary as a correct answer candidate sentence;

Generating a natural language question for the extracted correct answer candidate sentence based on correct answer type and semantic structure information; And

And indexing and storing the correct candidate candidate sentences and the natural language questions in pairs.

The method of claim 1, wherein the analyzing the language structure comprises:

Stemming each sentence of the input documents and tagging a part-of-speech to the analysis result;

Identifying a semantic category for each word of the morphological analysis result;

Recognizing an entity name for a proper noun in a sentence by referring to a semantic category and an entity name dictionary of surrounding words;

Recognizing a correct answer type for the vocabulary or phrase based on the semantic category or the entity name tag attached to the important vocabulary or phrase;

Performing phrase chunking on each sentence;

Recognizing a predicate in a sentence for the old chunking results, and recognizing the remaining old chunkings as respective arguments to the predicate; And

And generating a sentence semantic structure that is a logical relationship between the predicate and the argument in a sentence according to the predicate and the result of recognition of each argument.

The method of claim 2, wherein the meaning analysis step for each word comprises:

An automatic question / answer indexing method based on linguistic analysis, characterized by identifying contextual semantic categories for each word or vocabulary by referring to an ontology built on a noun concept network.

The method of claim 2, wherein the correct answer type recognition step comprises:

With reference to the correct answer type dictionary having a plurality of correct answer type information and its upper and lower classification information, and the correct answer type pattern DB that stores the recognition rules for each correct answer type,

An automatic question / answer indexing method based on linguistic analysis, characterized by recognizing the correct answer type for an important vocabulary or phrase based on the semantic category or individual name recognition of each word.

The method of claim 2, wherein the predicate and argument recognition step comprises:

Automatic analysis of question / answer answers based on linguistic analysis, characterized by checking battlefield information on the predicate phrases of sentences through dictionary dictionary and event usage dictionary, and determining the price for each argument by using the investigation and semantic information of each chunked phrase Way.

The method of claim 1, wherein storing the question / answer pair index is:

Based on the linguistic analysis, the main word of the generated natural language question is extracted and stored in an index database as index data for searching, and the natural language question and the correct answer candidate data are paired and stored in a question / correct database. Automatic question / answer indexing method.

Analyzing the language structure of the user query to recognize a correct answer type suitable for the user question and to identify the semantic structure of the query;

Analyzing contents of a user query through the recognized correct answer type and semantic structure information;

Searching for natural language questions similar to a user's question from a question / correct pair database generated according to claim 1 based on the question analysis result; And

And extracting correct answer candidates for each searched natural language question from the question / correct pair database and presenting the natural language question to the user together with the natural language question.

The method of claim 7, wherein the analyzing the language structure,

Stemming and tagging the part-of-speech for the user query;

Identifying a semantic category for each word of the result of the morpheme analysis using a conceptual network ontology, and recognizing the individual name by referring to the entity name dictionary for proper nouns;

Recognizing a correct answer type for a user query through recognizing a semantic category or entity name of the recognized important vocabulary or phrase;

Chunking the user query unit by phrase, recognizing the predicates in the query for the results of the old chunking, and generating a semantic structure of the user query based on the predicate frame; .

The method of claim 7, wherein the question search step,

Regarding the correct answer type of the recognized user questionnaire, according to the hierarchical structure of the correct answer type, the correct answer type is extended to its subtypes, and the search is performed on the question / correct pair database. How to answer questions.

The method of claim 7, wherein the question search step,

Linguistic analysis characterized by expanding the keywords included in the user query with similar expressions by referring to the ontology and thesaurus, and then searching the keyword expanded user query through the question / answer pair database -Based automated question and answer method.

The method of claim 7, wherein the search result output step,

The searched natural language questions are ranked according to similarities with user questions, and the natural language questions and correct answer candidates are extracted from the question / correct pair database and presented to the user according to the ranking. Response method.

The method of claim 7 or 11, wherein the outputting of the search result comprises presenting a lower ranking natural language question and a correct answer candidate, and then receiving feedback on the correct ranking information through user selection of the correct answer.

When the correct ranking information is fed back, the question search step regards the correct candidate candidate selected by the user as close to the desired answer, re-searches the natural language question, and presents correct candidates according to the linguistic analysis. How to answer questions.

A language structure analysis unit that analyzes linguistic structures of a plurality of documents, recognizes correct answer types for important words or phrases, and grasps the semantic structure of sentences;

Correct candidate candidate selection unit for selecting the main sentence tagged with the correct answer type as the correct candidate candidate;

A natural language question generator configured to generate a natural language question for the correct answer candidate sentence based on the identified correct answer type and semantic structure information;

A question / correct index unit configured to index and store the generated natural language question and its correct candidate candidate in pairs in an index database and a question / correct database; And

An index database in which key words extracted from the natural language question are stored as index data for a user query search, and a question / correct database in which the generated natural language question and correct answer candidate data are stored in pairs; With a pair index engine,

A language structure analysis unit for recognizing a correct answer type suitable for the user's question by analyzing a language structure of the user's question and grasping the semantic structure of the question;

A question analyzing unit analyzing the content of the user query through the recognized correct answer type and semantic structure information;

A question search unit for searching for natural language questions similar to a user question from the index database based on the question analysis result; And

A search result output unit for ranking the searched natural language questions according to similarities with user questions, extracting each natural language question and its correct candidate candidates from the question / correct database, and presenting the search results to the user in order of ranking; Language analysis based automatic question answering system, characterized in that consisting of / response engine.

The method of claim 13, wherein the language structure analysis unit,

Applied to the question / answer pair indexing engine and the user query / response engine together,

A morphological analysis unit for stemming and tagging parts of speech for an input document or a user query;

A semantic analyzer for identifying semantic categories for each word of the morphological analysis result and recognizing individual names for proper nouns;

A correct answer type recognition unit for recognizing a correct answer type for a document or a user query by using a semantic category or entity name recognition of the recognized important vocabulary or phrase;

Partial syntax analysis unit for chunking the input document or user query,

Recognizing a predicate of a document or user query with respect to the old chunk result and expressing a logical relationship between the predicate and each argument based on the preliminary information of the predicate, generating a semantic structure for the document or user query Language analysis-based automatic question answering system, characterized in that it consists of a wealth and semantic generation unit.

The method of claim 13, wherein the question search unit,

A language characterized in that a search for the index database is performed by extending a user query through an extension of a correct answer type according to a hierarchical structure of a correct answer type and a keyword expansion that extends to a similar expression for a keyword. Analysis-based automatic question and answer system.

The method of claim 13, wherein the search result output unit,

After presenting the low priority natural language questions and correct candidates, feedback information of the correct ranking according to the user's correct choice is fed back to the question search unit, so that the user can perform a re-search to prioritize the natural language questions and correct candidates selected by the user. Linguistic analysis-based automatic question and answer system.