KR100726176B1

KR100726176B1 - Method and apparatus for extracting correct answer in question answering system

Info

Publication number: KR100726176B1
Application number: KR1020060056194A
Authority: KR
Inventors: 오효정; 이충희; 황이규; 왕지현; 이창기; 장명길
Original assignee: 한국전자통신연구원
Priority date: 2005-12-09
Filing date: 2006-06-22
Publication date: 2007-06-11

Abstract

A method and a device for extracting a correct answer in a Q/A(question/answer) system are provided to improve performance of the Q/A system by constructing various heterogeneous distributed information sources and extracting the desired answer from the information source storing the most suitable answer to the information need of the user through various correct answer extracting techniques. A language analyzer(130) linguistically analyzes a sentence of a target document(110) or a question sentence(120) of the user. A heterogeneous correct answer indexer(150) constructs the heterogeneous distributed information sources(140) by indexing the correct answer through various correct answer indexing techniques according to the sentence of the target document or the question sentence of the user. A multi-correct answer extractor(160) extracts the most suitable candidate answers to the information need of the user from the information sources through the correct answer extracting techniques. A correct answer manager(170) infers the extracted candidate answers according to the information need of the user, and integrates and presents the correct answers to the user.

Description

Method and apparatus for extracting correct answer in question answering system}

도 1은 본 발명에 따른 다중 정답 추출 장치의 블록도이다. 1 is a block diagram of an apparatus for extracting multiple correct answers according to the present invention.

도 2는 도 1에 있어서 이질 정답 색인기를 설명하기 위한 도면이다.FIG. 2 is a diagram for explaining a heterogeneous correct indexer in FIG. 1.

도 3은 도 1에 있어서 다중 정답 추출기를 설명하기 위한 도면이다.FIG. 3 is a diagram for describing a multiple answer extractor in FIG. 1.

도 4는 본 발명에 따른 다중 정답 추출 방법의 흐름도이다. 4 is a flowchart of a multiple correct answer extraction method according to the present invention.

도 5는 도 4에 있어서 언어 분석 단계의 상세 흐름도이다.5 is a detailed flowchart of a language analysis step of FIG. 4.

도 6은 본 발명에 있어서 정답 유형의 일예를 나타낸 도면이다.6 is a view showing an example of the correct answer type in the present invention.

도 7은 도 4에 있어서 다중 정답 추출 단계의 상세 흐름도이다.FIG. 7 is a detailed flowchart of a multiple correct answer extraction step in FIG. 4.

도 8은 도 4에 있어서 정답 관리 단계의 상세 흐름도이다.8 is a detailed flowchart of a correct answer management step in FIG. 4.

* 도면의 주요부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

110: 대상문서 120: 질문 문장110: Target Document 120: Question Sentence

130: 언어 분석기 140: 이질 분산 정보원130: language analyzer 140: heterogeneous distributed information source

150: 이질 정답 색인기150: heterogeneous correct indexer

151: 패턴기반 정답 색인기 152: 통계기반 정답 색인기151: pattern-based correct indexer 152: statistics-based correct indexer

153: 단답형 정답 색인기 154: 서술형 정답 색인기153: Short Answer Indexer 154: Descriptive Answer Indexer

155: 나열형 정답 색인기155: enumerated correct indexer

160: 다중 정답 추출기160: Multiple Answer Extractor

161: 패턴기반 정답 추출기 162: 통계기반 정답 추출기161: pattern based answer extractor 162: statistical based answer extractor

163: 단답형 정답 추출기 164: 서술형 정답 추출기163: Short Answer Answer Extractor 164: Descriptive Answer Extractor

165: 나열형 정답 추출기165: enumerated correct answer extractor

170: 정답 관리기170: Answer Manager

본 발명은 질의응답 시스템에 있어서 다중 정답 추출 방법 및 장치에 관한 것으로, 특히 사용자의 질문 의도(information need)를 파악하여 사용자가 원하는 정답이 저장되어 있는 다양한 정보원에 접근하여 정답을 추출함으로써 질의응답 성능을 높일 수 있는 정답 추출 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for extracting multiple correct answers in a question and answer system, and in particular, to grasp a user's information need and access a variety of information sources in which a desired answer is stored to extract a correct answer. The present invention relates to a method and apparatus for extracting a correct answer.

질의응답 시스템은 사용자의 질의에 대한 답변이 될 수 있는 정답을 문서집합내에서 탐색하여 사용자에게 제시해주는 시스템으로, 최근 웹사이트상의 무수한 정보들중에서 사용자의 질의에 해당하는 정보만을 추출하여 사용자에게 제공하여 주는 질의응답 검색 방법이 널리 이용되고 있다.The Q & A system searches and presents the correct answers that can be the answer to the user's query in the document set and presents them to the user. It extracts only the information corresponding to the user's query from a myriad of information on the website and provides it to the user. The question and answer retrieval method is widely used.

이와 같은 질의응답 검색 방법의 하나로서, 국내특허 등록번호 제434688호(2004.05.25)에는 관계형 DB 구조와 FAQ 리스트 형태로 정형화될 수 있는 후보 정답만을 미리 DB로 구축한 다음, 기존 정보검색 시스템을 활용하여 관계있는 단락을 검색하여 그 중에서 패턴과 언어 분석을 기반으로 가장 근접한 데이터를 정답으 로 추정하는 방법이 개시되어 있다.As one of such question and answer search methods, Korean Patent Registration No. 446288 (2004.05.25) only constructs a candidate correct answer that can be formulated in relational DB structure and FAQ list in advance and then builds an existing information retrieval system. A method of searching for related paragraphs and estimating the closest data based on pattern and language analysis among them is disclosed.

그러나, 상기 질의응답 검색 방법은 후보 정답이 DB 엔트리로 정형화될 수 있는 후보만으로 국한되기 때문에 매우 단순한 질문만 처리할 수 있으며, 기존의 정보검색 시스템을 그대로 활용하여 사용자 질의에 대한 응답을 검색하므로, 이로 인해 시스템 응답 속도가 늦거나 수작업의 노력이 많이 필요하게 되는 문제점을 갖고 있다. However, the question and answer retrieval method can process only very simple questions because candidate candidate answers are limited to candidates that can be formalized into DB entries, and use existing information retrieval systems to retrieve responses to user queries. This causes a problem of slow system response or manual labor.

또 다른 질의응답 검색 방법의 하나로서, 연구논문(In Question Answering, Two Heads Are Better Than One, Proceedings of HLT-NAACL 2003, P24~31, 2003.6)에는 다양한 소스에 있는 정답을 찾기 위해 여러 정답 추출 에이전트로부터 정답을 추출한 후 추출된 정답에 가중치(weight)를 부여하여 이에 따라 정답을 순위화하여 그 중에서 1순위의 정답을 제시함으로써 사용자에게 보다 정확한 정답을 제공할 수 있는 방법이 개시되어 있다.As another question answer search method, the research paper (In Question Answering, Two Heads Are Better Than One, Proceedings of HLT-NAACL 2003, P24 ~ 31, 2003.6) contains several answer extraction agents to find answers from various sources. After extracting the correct answer from the weighted weight (weight) to the correct answer is extracted accordingly according to the ranking of the correct answer and presenting the correct answer of the first among them is disclosed a method that can provide a more accurate correct answer to the user.

그러나, 상기 질의응답 검색 방법은 사용자의 질문 특성에 상관없이 항상 모든 에이전트별로 정답을 추출하여 1순위 정답을 제시하는 방식을 취하고 있기 때문에, 단순한 질문을 처리하는 경우에는 불필요한 추출 과정을 수행함으로써 검색 속도가 느려질 뿐만 아니라, 하나의 에이전트로 정답을 찾지 못하는 고난이도의 복합질문일 경우에는 사용자가 원하는 정답을 추출할 수 없다는 문제점이 있다.However, since the question and answer search method always takes the correct answer for every agent and presents the first order answer regardless of the user's question characteristics, when the simple question is processed, the search speed is performed by performing an unnecessary extraction process. In addition to slowing down, there is a problem that the user cannot extract the correct answer that the user wants in the case of a complex question of not finding the correct answer with one agent.

게다가, 상기 질의응답 검색 방법들은 사용자 질문에 대한 정답을 단순히 색인화되어 있는 검색용 DB를 참조하여 검색하기 때문에, 이로 인해 정답 추출에 많은 시간이 소요되어 질의응답 시간이 매우 길다는 문제점도 갖고 있다. In addition, the question and answer search methods have a problem in that the correct answer to the user's question is simply searched by referring to a searched DB for searching, and therefore, a long time is required to extract the correct answer, and thus the question and answer time is very long.

본 발명은 상기한 문제점들을 해결하기 위해 안출된 것으로, 본 발명의 목적은 질의응답을 위해 다양한 이질 분산 정보원을 구축하고 이를 기반으로 사용자의 질문 의도에 가장 부합하는 정답이 저장되어 있는 정보원에 접근하여 다양한 정답 추출 기법을 통해 사용자가 원하는 정답을 추출함으로써 질의응답 성능을 향상시킬 수 있는 다중 정답 추출 방법 및 장치를 제공하는 것이다.The present invention has been made to solve the above problems, an object of the present invention is to build a variety of heterogeneous distributed information sources for question and answer and to access the information source that stores the correct answer that best matches the user's intention It is to provide a multi-correct answer extraction method and apparatus that can improve the query response performance by extracting the correct answer desired by the user through a variety of correct answer extraction techniques.

상기 목적을 달성하기 위하여 본 발명에 따른 다중 정답 추출 방법은, (a) 대상문서의 문장 또는 사용자의 질문 문장을 언어적으로 분석하는 언어 분석 단계; (b) 상기 대상문서의 문장 또는 사용자의 질문 특성에 따라 다양한 정답 색인 기법을 통해 이질 분산 정보원을 구축하는 단계; (c) 사용자의 질문 의도에 따라 상기 이질 분산 정보원을 기반으로 다양한 정답 추출 기법을 통해 후보 정답을 다중으로 추출하는 다중 정답 추출 단계; 및 (d) 상기 (c) 단계에 의해 다중으로 추출된 후보 정답을 사용자의 질문 의도에 맞게 통합하는 정답 관리 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the multiple correct answer extraction method according to the present invention comprises: (a) a language analysis step of linguistically analyzing a sentence of a target document or a user's question sentence; (b) constructing heterogeneous distributed information sources through various correct answer indexing techniques according to the sentence of the target document or the user's question characteristic; (c) a multiple correct answer extracting step of extracting multiple candidate correct answers based on the heterogeneous distributed information source according to a user's question intention; And (d) a correct answer management step of integrating candidate correct answers multiplexed by the step (c) according to the user's question intent.

한편, 상기 목적을 달성하기 위하여 본 발명에 따른 다중 정답 추출 장치는, 대상문서의 문장 또는 사용자의 질문 문장을 언어적으로 분석하는 언어 분석기; 상기 대상문서의 문장 또는 사용자의 질문 특성에 따라 다양한 정답 색인 기법을 통해 정답 색인을 수행하여 이질 분산 정보원을 구축하는 이질 정답 색인기; 상기 이질 분산 정보원을 기반으로 다양한 정답 추출 기법을 통해 사용자의 질문 의도에 가장 부합하는 후보 정답을 다중으로 추출하는 다중 정답 추출기; 및 상기 추출된 후보 정답을 사용자의 질문 의도에 맞게 추론하고 통합하여 사용자에게 제시하는 정답 관리기를 포함하는 것을 특징으로 한다.On the other hand, in order to achieve the above object, the multiple correct answer extraction apparatus according to the present invention, a language analyzer for linguistically analyzing the sentence of the target document or the user's question sentence; A heterogeneous correct answer indexer for constructing a heterogeneous distributed information source by performing a correct answer index through various correct answer indexing techniques according to a sentence of the target document or a user's question characteristic; A multiple correct answer extractor which extracts multiple candidate correct answers that best match a user's question intent through various correct answer extraction techniques based on the heterogeneous distributed information source; And a correct answer manager for inferring and integrating the extracted candidate correct answers according to a user's question intention and presenting them to the user.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1에 도시된 바와 같이, 본 발명에 따른 다중 정답 추출 장치(100)는, 대상문서(110)에 있는 문장이나 또는 사용자의 질문 문장(120)을 언어적으로 분석하는 언어 분석기(130)와, 후보 정답의 특성에 따라 다양한 정답 색인 기법을 통해 정답 색인을 수행하여 이질 분산 정보원(140)을 구축하는 이질 정답 색인기(150)와, 사용자의 질문 의도를 파악하여 후보 정답이 저장되어 있는 이질 분산 정보원(140)에 접근하여 다중으로 정답을 추출하는 다중 정답 추출기(160)와, 추출된 정답을 사용자의 질문 의도에 맞게 추론하고 통합하여 사용자에게 제시하는 정답 관리기(170)로 구성되어 있다.As shown in FIG. 1, the multi-correct answer extracting apparatus 100 according to the present invention includes a language analyzer 130 for linguistically analyzing a sentence in a target document 110 or a user's question sentence 120. The heterogeneous correct indexer 150 constructs the heterogeneous distributed information source 140 by indexing the correct answer through various correct indexing techniques according to the characteristics of the candidate correct answer, and the heterogeneous variance in which the candidate correct answer is stored by identifying the user's question intent. Multiple answer extractor 160 for accessing the information source 140 to extract the correct answer in multiple, and the correct answer manager 170 that infers and integrates the extracted correct answer according to the user's question intentions to present to the user.

상기 언어 분석기(130)는 대상문서(110)에 있는 문장이나 사용자의 질문 문장(120)을 입력받아 입력된 문장을 언어적으로 분석하여 미리 정의된 분류 항목에 따라 상기 입력된 문장에 대한 정답 유형을 인식하는 역할을 수행하며, 여기에서 이용되는 언어 분석 방법에 대하여는 이하 도 5 및 도 6과 관련된 설명에서 자세히 설명하기로 한다.The language analyzer 130 receives a sentence in the target document 110 or a question sentence 120 of the user and analyzes the input sentence linguistically to answer the type of the correct sentence according to a predefined classification item. The language analysis method used herein will be described in detail with reference to FIGS. 5 and 6.

상기 이질 정답 색인기(150)는 상기 언어 분석기(130)를 통해 분석된 결과를 기반으로 대상문서(110)의 문장 또는 질문 문장(120)에 존재하는 후보 정답들의 특성에 따라 여러 종류의 이질 정답 색인기를 이용하여 이질 분산 정보원(140)을 구축하는 역할을 수행하며, 이하 도 2를 참조하여 상기 이질 정답 색인기(150)에 대하여 더 자세히 설명한다. The heterogeneous correct indexer 150 is based on the results analyzed by the language analyzer 130 according to the characteristics of candidate correct answers existing in the sentence or question sentence 120 of the target document 110, the different types of heterogeneous correct indexer The heterogeneous distributed information source 140 is used to construct the heterogeneous distributed information source 140. Hereinafter, the heterogeneous correct indexer 150 will be described in more detail with reference to FIG. 2.

도 2는 도 1에 있어서 이질 정답 색인기(150)를 설명하기 위한 도면이다.FIG. 2 is a diagram for explaining the heterogeneous correct indexer 150 in FIG. 1.

도 2에 도시된 바와 같이, 이질 정답 색인기(150)는 패턴기반 정답 색인기(151), 통계기반 정답 색인기(152), 단답형 정답 색인기(153), 서술형 정답 색인기(154) 및 나열형 정답 색인기(155)로 구성되어 있다. As shown in FIG. 2, the heterogeneous correct indexer 150 includes a pattern-based correct indexer 151, a statistics-based correct indexer 152, a short-answer correct indexer 153, a descriptive correct indexer 154, and an enumerated correct indexer 155. It consists of).

상기 패턴기반 정답 색인기(151)는 상기 언어 분석기(130)를 통해 분석된 결과에서 패턴 혹은 규칙으로 인식되는 정답을 색인하는 기능을 수행하고, 통계기반 정답 색인기(152)는 기계학습 기법을 통해 후보 정답의 자질을 학습하고 이를 기반으로 정답을 색인하는 기능을 수행한다. The pattern-based correct answer indexer 151 performs a function of indexing the correct answer recognized as a pattern or a rule from the result analyzed by the language analyzer 130, and the statistical-based correct indexer 152 is a candidate through a machine learning technique. Learn the qualities of correct answers and index the correct answers based on them.

상기 패턴기반 정답 색인기(151)와 통계기반 정답 색인기(152)는 미리 정해진 특성에 해당하는 정답만을 색인하는 것으로, 그 이외의 유형에 속하는 후보 정답들은 단답형 정답 색인기(153)를 통해 색인하도록 하는 것이 바람직하다.The pattern-based correct indexer 151 and the statistical-based correct indexer 152 index only the correct answer corresponding to a predetermined characteristic, and candidate correct answers belonging to other types may be indexed through the short-answer correct indexer 153. desirable.

한편, 사용자가 묻는 질문에는 예를 들어 "피는 왜 붉은 색을 띠나요?", "나이아가라 폭포는 어느나라 사이에 있나요?" 와 같이 단답 이외에 서술형 정답이나 나열형 정답이 필요한 경우가 있는데, 이러한 경우 질문에 대한 정답은 각각 서술형 정답 색인기(154)와 나열형 정답 색인기(155)에서 처리하도록 하는 것이 바람직하다.On the other hand, the user asked questions such as "Why is blood red?", "Which country is Niagara Falls in?" As described above, there are cases where descriptive correct answers or enumerated correct answers are required in addition to the short answers. In this case, the correct answers to the questions are preferably processed by the descriptive correct indexer 154 and the enumerated correct indexer 155, respectively.

다음은 각 색인기에서 처리되는 정답의 예이다.Here is an example of the correct answer handled by each indexer:

(1) 패턴기반 정답 색인기(151)(1) Pattern-based correct answer indexer (151)

- 대상 문장: "세계에서 가장 긴 강은 나일강이다"-Target sentence: "The longest river in the world is the Nile"

- 후보 정답: 나일강Candidate Correct: Nile

(2) 통계기반 정답 색인기(152)(2) statistics-based correct answer indexer (152)

- 대상 문장: 박정희는 경북 선산에서 출생하였다.-Subject sentence: Park Jung-hee was born in Sunsan, Gyeongbuk.

- 후보 정답: (박정희, 고향) = (경북 선산)-Candidate Correct Answer: (Park Chung Hee, Hometown) = (Gyeongbuk Sunsan)

(3) 단답형 정답 색인기(153)(3) Short answer type indexer (153)

- 대상 문장: 국제적십자사에서는 '나이팅게일상'을 마련하여 매년 세계 각국의 우수한 간호사를 선발, 표창한다. -Target sentence: The International Red Cross prepares the Nightingale Award and selects and commends excellent nurses from around the world every year.

- 후보 정답: 국제적십자사(ORG), 나이팅게일상(PRIZE), 간호사(OCCUPATION)Candidates' answers: International Red Cross (ORG), Nightingale Award (PRIZE), Nurse (OCCUPATION)

(4) 서술형 정답 색인기(154)(4) Descriptive Answer Indexer (154)

- 대상 문장: 혈액 속의 헤모글로빈으로 인해 붉은 색은 띈다.Subject sentence: Red color due to hemoglobin in blood.

- 후보 정답: (피-붉은색의 원인) = (혈액 속의 헤모글로빈 때문이다)Candidate Correct Answer: (cause of blood-red) = (because of hemoglobin in the blood)

(5) 나열형 정답 색인기(155)(5) enumerated correct indexer (155)

- 대상 문장: 나이아가라 폭포는 미국과 캐나다 국경에 있다.Target sentence: Niagara Falls is on the border between the United States and Canada.

- 후보 정답: 미국, 캐나다-Candidate Correct Answer: USA, Canada

다시 도 1을 참조하면, 상기 다중 정답 추출기(160)는 사용자의 질문 의도 및 난이도에 적합한 정답 추출기를 선택하여 상기 이질 분산 정보원(140)을 기반으로 후보 정답을 다중으로 추출하는 역할을 수행하며, 이하 도 3을 참조하여 상기 다중 정답 추출기(160)에 대하여 더 자세히 설명한다. Referring back to FIG. 1, the multiple correct answer extractor 160 selects a correct answer extractor suitable for a user's question intent and difficulty, and extracts multiple candidate correct answers based on the heterogeneous distributed information source 140. Hereinafter, the multiple answer extractor 160 will be described in more detail with reference to FIG. 3.

도 3은 도 1에 있어서 다중 정답 추출기(160)를 설명하기 위한 도면이다.FIG. 3 is a diagram for describing the multiple correct answer extractor 160 in FIG. 1.

도 3에 도시된 바와 같이, 다중 정답 추출기(160)는 패턴기반 정답 추출기(161), 통계기반 정답 추출기(162), 단답형 정답 추출기(163), 서술형 정답 추출기(164), 나열형 정답 추출기(165)로 구성되어 있으며, 각 정답 추출기는 사용자의 질문 의도에 가장 부합하는 정답이 저장되어 있는 이질 분산 정보원(140)에 접근하여 다양한 정답 추출 기법을 통해 사용자가 원하는 정답을 추출한다.As shown in FIG. 3, the multiple answer extractor 160 includes a pattern-based correct answer extractor 161, a statistical-based correct answer extractor 162, a short answer correct answer extractor 163, a descriptive correct answer extractor 164, and an enumerated correct answer extractor 165. ), Each correct answer extractor extracts the correct answer desired by the user through various correct answer extraction techniques by accessing the heterogeneous distributed information source 140 that stores the correct answer that best matches the user's question intent.

상기 다중 정답 추출기(160)를 통해 얻어지는 후보 정답의 형태는 상기 이질 정답 색인기(150)를 통해 얻어지는 후보 정답의 형태와 거의 동일하므로 이에 대한 자세한 설명은 생략한다.Since the form of the candidate correct answer obtained through the multiple correct extractor 160 is almost the same as the form of the candidate correct answer obtained through the heterogeneous correct indexer 150, a detailed description thereof will be omitted.

다시 도 1을 참조하면, 상기 정답 관리기(170)는 상기 다중 정답 추출기(160)를 통해 추출된 정답을 사용자의 질문 의도에 맞게 추론하고 통합하여 사용자에게 제시하는 역할을 수행한다.Referring back to FIG. 1, the correct answer manager 170 infers and integrates the correct answer extracted through the multiple correct extractor 160 according to the user's question intent and presents the result to the user.

이하, 본 발명에 따른 다중 정답 추출 방법에 대하여 첨부된 도면을 참조하여 상세히 설명한다.Hereinafter, a method for extracting multiple correct answers according to the present invention will be described in detail with reference to the accompanying drawings.

도 4에 도시된 바와 같이, 본 발명에 따른 다중 정답 추출 방법은, 크게 사용자의 질문 문장을 언어적으로 분석하는 언어 분석 단계(S410)와, 후보 정답의 특성에 따라 다양한 정답 색인 기법을 통해 정답 색인을 수행하여 이질 분산 정보원(140)을 구축하는 이질 정답 색인 단계(S420)와, 사용자의 질문 의도를 파악하여 후보 정답이 저장되어 있는 이질 분산 정보원(140)에 접근하여 다중으로 정답을 추출하는 다중 정답 추출 단계(S430)와, 추출된 정답을 사용자의 질문 의도에 맞게 추론하고 통합하여 사용자에게 제시하는 정답 관리 단계(S440)로 나뉠 수 있는데, 각 단계에 대하여 더 자세히 설명하면 다음과 같다.As shown in FIG. 4, the multi-correct answer extraction method according to the present invention includes a language analysis step (S410) of largely linguistic analysis of a user's question sentence and various correct answer indexing techniques according to characteristics of candidate correct answers. Index heterogeneous answer index step (S420) to build a heterogeneous distributed information source 140 by indexing, and grasp the intention of the user's question to access the heterogeneous distributed information source 140 that stores candidate correct answers to extract the correct answer multiple times Multiple correct answer extraction step (S430), and the correct answer can be divided into the correct answer management step (S440) of inferring and integrating the user's question intention to present to the user, each step will be described in more detail as follows.

[1] 언어 분석 단계(S410)[1] Language Analysis Step (S410)

언어 분석 단계(S410)는 대상문서(110)에 있는 문장이나 사용자의 질문 문장(120)을 언어적으로 분석하여 질문을 분류하는 단계로, 이하 도 5 및 도 6을 참조하여 언어 분석 단계(S410)에 대하여 더 자세히 설명한다. The language analysis step S410 is a step of classifying questions by linguistically analyzing a sentence in the target document 110 or a question sentence 120 of the user. Hereinafter, the language analysis step S410 will be described with reference to FIGS. 5 and 6. ) In more detail.

도 5는 도 4에 있어서 언어 분석 단계(S410)의 상세 흐름도이고, 도 6은 본 발명에 있어서 정답 유형의 일예를 나타낸 도면이다. FIG. 5 is a detailed flowchart of the language analysis step S410 of FIG. 4, and FIG. 6 is a diagram illustrating an example of a correct answer type in the present invention.

도 5를 참조하면, 형태소 분석 단계(S411)에서는 입력된 문장을 형태소 단위로 분해한 후 각 형태소의 품사를 결정하고, 정답 유형 인식 단계(S412)에서는 미리 정의한 160 여개의 의미적 분류를 정답 유형으로 정의하고 이에 대한 인식을 수행하는데, 여기에서 정답 유형은 도 6과 같이 인명, 학술분야명, 이론, 인공물, 조직명, 지명, 문화/문명, 날짜, 시간, 수량, 사건, 동물, 식물, 물질 등 분야별로 상세히 분류되어 있다.Referring to FIG. 5, in the morphological analysis step (S411), the input sentence is decomposed into morphological units to determine parts of speech of each morpheme, and in the correct answer type recognition step (S412), the 160 semantic classifications defined in advance are correct answer types. In this case, the correct answer type is life name, academic name, theory, artifact, organization name, place name, culture / civilization, date, time, quantity, event, animal, plant, It is classified in detail by field such as substance.

그 다음, 어휘 의미 태깅 단계(S413)에서는 명사 어휘 개념망을 이용하여 각 명사에 개념을 할당하며, 어휘 의미 결정 단계(S414)에서는 상기 어휘 의미 태깅 단계(S413)를 통해 태깅된 정보를 기반으로 각 어휘에 대한 어휘 의미를 결정한다. Next, in the lexical semantic tagging step (S413), a concept is assigned to each noun using a noun vocabulary concept network, and in the lexical semantic determining step (S414), based on the information tagged through the lexical semantic tagging step (S413). Determine the vocabulary meaning for each vocabulary.

그 다음, 구문 분석 단계(S415)에서는 각 문장에 대한 구문 구조를 분석하여 출력하고, LF(Logical Form) 추출 단계(S416)에서는 격틀을 이용한 구문 분석 결과를 용언을 중심으로 재구성하여 LF 구조를 추출한다. Next, in the parsing step (S415), the syntax structure for each sentence is analyzed and outputted, and in the LF (Logical Form) extraction step (S416), the LF structure is extracted by reconstructing the parsing result using the syntax around the word. do.

[2] 이질 정답 색인 단계(S420)[2] heterogeneous answer index step (S420)

이질 정답 색인 단계(S420)는 대상문서(110)에 있는 문장이나 사용자의 질문 문장(120)에 존재하는 후보 정답들의 특성에 따라 여러 종류의 이질 정답 색인기(150)를 통해 이질 분산 정보원(140)을 구축하는 단계로, 상기 이질 정답 색인기(150)를 통한 정답 색인에 대하여는 상기 도 2와 관련된 설명에서 자세히 설명하였으므로 이에 대한 자세한 설명은 생략한다.The heterogeneous answer index step (S420) is heterogeneous distributed information source 140 through the heterogeneous correct answer indexer 150 according to the characteristics of the candidate correct answers existing in the sentence in the target document 110 or the user's question sentence 120. In the step of constructing, since the correct answer index through the heterogeneous correct indexer 150 has been described in detail with reference to FIG. 2, a detailed description thereof will be omitted.

[3] 다중 정답 추출 단계(S430)[3] Extracting multiple correct answers (S430)

다중 정답 추출 단계(S430)는 사용자의 질문 의도 및 난이도에 적합한 정답 추출기를 선택하여 이질 분산 정보원(140)을 기반으로 후보 정답을 다중으로 추출하는 단계로, 이하 도 7을 참조하여 다중 정답 추출 단계(S430)에 대하여 더 자세히 설명한다. The multi-correct answer extracting step (S430) is a step of extracting a candidate correct answer based on the heterogeneous distributed information source 140 by selecting a correct answer extractor suitable for the user's question intention and difficulty. Hereinafter, the multi-correct answer extracting step is described with reference to FIG. 7. It will be described in more detail with respect to (S430).

도 7은 도 4에 있어서 다중 정답 추출 단계(S430)의 상세 흐름도이다.FIG. 7 is a detailed flowchart of the multiple-correct answer extraction step S430 of FIG. 4.

우선, 언어 분석 단계(S410)를 통해 분석된 사용자의 질문 분석 결과를 기반으로 사용자의 질문 의도 및 난이도를 파악한 후(S431), 다중 전략 구성에 의해 사용자의 질문 의도 및 난이도에 적합한 정답 추출기를 선택한다(S432).First, after grasping the user's question intention and difficulty based on the user's question analysis result analyzed through the language analysis step (S410) (S431), the correct answer extractor suitable for the user's question intention and difficulty is selected by the multi-strategy configuration. (S432).

여기에서, 다중 전략 구성이란, 사용자 질문에 부합하는 정답을 추출하기 위해서는 어떤 종류의 정답 추출기를 선택해야 하는지, 하나 이상의 정답 추출기를 다중으로 사용해야 하는지, 추출된 후보 정답들을 어떤 유형으로 통합해야 하는지 등 정답 추출과 관련된 전반적인 규칙을 구성화한 것으로, 다중 전략 구성에는 사용자 질문의 문장 구조와 핵심어, 단서 어휘, 정답 유형에 따른 정답 추출기 선택 규칙과 후보 정답들의 통합 규칙이 포함되어 있다.Here, a multi-strategy construct means what kind of correct answer extractor should be selected to extract the correct answer that matches the user question, which one or more correct answer extractors should be used in multiple, what type of candidate correct answers should be integrated, etc. Comprising the overall rules for extracting the correct answer, the multi-strategy composition includes the rule structure of the user question, the choice of the correct answer extractor according to the key word, the clue vocabulary, the correct answer type, and the integration rule of the candidate correct answers.

즉, 상기와 같은 다중 전략 구성에 의해 사용자의 질문 의도 및 난이도에 적합한 정답 추출기가 선택되면, 이에 따라 선택된 정답 추출기를 통해 이질 분산 정보원(140)에서 사용자 질문에 가장 부합하는 후보 정답을 다중으로 추출한다(S433).That is, when the correct answer extractor suitable for the user's question intention and difficulty is selected by the multi-strategy configuration as described above, the candidate correct answer most suitable for the user question is extracted from the heterogeneous distributed information source 140 through the selected correct answer extractor. (S433).

[4] 정답 관리 단계(S440)[4] answer management step (S440)

정답 관리 단계(S440)는 다중 정답 추출 단계(S430)를 통해 추출된 정답을 사용자의 질문 의도에 맞게 추론하고 통합하여 사용자에게 제시하는 기능을 수행하는 단계로, 이하 도 8을 참조하여 정답 관리 단계(S440)에 대하여 더 자세히 설명한다. The correct answer management step S440 is a step of inferring and integrating the correct answer extracted through the multiple correct answer extracting step S430 according to the user's intention of the question and presenting it to the user. It will be described in more detail with respect to (S440).

도 8은 도 4에 있어서 정답 관리 단계(S440)의 상세 흐름도이다.FIG. 8 is a detailed flowchart of the correct answer management step S440 in FIG. 4.

도 8에 도시된 바와 같이, 정답 관리 단계(S440)는 상기 다중 정답 추출 단계(S430)를 통해 추출된 후보 정답을 사용자 질문에 대한 정답 유형에 맞게 통합하는 단계(S441), 상기 통합된 정답이 사용자의 질문 의도에 적합한지 비교 판단하고 재순위화하는 적합도 비교 및 정답 재순위화 단계(S442), 상기 통합된 후보 정답 및 부가정보를 제시하는 단계(S443)로 나뉠 수 있다.As shown in FIG. 8, in the correct answer management step S440, the candidate correct answer extracted through the multiple correct answer extracting step S430 may be integrated according to the correct answer type for the user question (S441). The method may be divided into a comparison of relevance and correct reranking (S442) for comparing and determining and re-ranking whether or not it is suitable for a user's question intention, and presenting the integrated candidate correct answer and additional information (S443).

한편, 다음의 예는 본 발명에 따라 사용자의 질문 의도를 분석하고 이를 기반으로 정답을 추출하여 최종적으로 사용자에게 정답을 제시하는 과정을 나타낸 예 이다.On the other hand, the following example is an example showing the process of analyzing the question intention of the user in accordance with the present invention and extracting the correct answer based on this to finally present the correct answer to the user.

예) Yes)

사용자 질문: 세계에서 가장 긴 강의 길이는?User Question: How long is the longest river in the world?

(1) 사용자의 질문 의도 파악 단계(1) Identifying user's question intent

- 정답 유형: 길이(QUANTITY: QT_LENGTH)-Answer Type: Length (QUANTITY: QT_LENGTH)

- 요구 정보: '세계에서 가장 긴 강' + '그 강의 길이'-Required Information: 'World's Longest River' + 'Long River'

(2) 다중방식 정답 추출 단계(2) Multiple Method Answer Extraction Step

- 패턴기반 정답 추출기: '세계에서 가장 긴 강'-Pattern based answer extractor: 'World's Longest River'

- 통계기반 정답 추출기: '그 강의 길이'Statistical-based Answer Extractor: the length of the river

(3) 정답 통합 단계(3) integration of correct answers

- 패턴기반 정답 추출기의 후보 정답: 나일강 (세계에서 가장 긴 강)Candidate answers for pattern-based correct answer extractors: Nile (longest river in the world)

- 통계기반 정답 추출기의 후보 정답: 6690㎞ (나일강의 길이)Candidate answers for statistical-based correct answer extractors: 6690 km (length of Nile)

(4) 적합도 비교 및 정답 재순위화 단계(4) Goodness-of-fit comparison and correct reranking steps

- 후보 정답과 사용자 질문과의 적합도 비교-Goodness of fit between candidate correct answers and user questions

(5) 후보 정답 및 부가정보 제시 단계(5) Presentation of Candidate Answers and Additional Information

- 후보 정답: 6690㎞Candidate Correct Answer: 6690 km

- 부가정보: 나일강의 위치, 나일강의 사진, 나일강과 연관된 기타 정보-Additional Information: Location of the Nile, photos of the Nile, and other information related to the Nile

상기한 바와 같이, 후보 정답들의 다양한 특성에 따라 이질 정답 색인기를 통해 이질 분산 정보원을 구축한 다음, 사용자의 질문 의도를 파악하여 이에 적합한 정답 추출기를 선택하여 다중으로 후보 정답을 추출하고, 추출된 후보 정답을 사용자 질문에 대한 정답 유형에 맞게 통합하여 제시함으로써, 이에 따라 정보검색 및 질의응답 시스템의 성능을 향상시킬 수 있게 된다.As described above, the heterogeneous distributed information source is constructed through the heterogeneous correct indexer according to various characteristics of the candidate correct answers, and after grasping the intention of the user's question, the correct answer extractor is selected to extract the candidate correct answer multiple times, and the extracted candidate By integrating and presenting the correct answers according to the types of correct answers to user questions, the performance of information retrieval and question answering systems can be improved.

한편, 상술한 본 발명의 실시예들은 컴퓨터에서 실행될 수 있는 프로그램으로 작성가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다.Meanwhile, the above-described embodiments of the present invention can be written as a program that can be executed in a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium.

상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드디스크 등), 광학적 판독 매체(예를 들면, 씨디롬, 디브이디 등) 및 캐리어 웨이브(예를 들면, 인터넷을 통한 전송)와 같은 저장매체를 포함한다.The computer-readable recording medium may be a magnetic storage medium (for example, a ROM, a floppy disk, a hard disk, etc.), an optical reading medium (for example, a CD-ROM, DVD, etc.) and a carrier wave (for example, the Internet). Storage medium).

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았으며, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been described with reference to the preferred embodiments, and those skilled in the art to which the present invention belongs may be embodied in a modified form without departing from the essential characteristics of the present invention. You will understand. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

상기한 바와 같이, 본 발명에 따르면, 질의응답을 위해 다양한 이질 분산 정보원을 구축하고 이를 기반으로 사용자의 질문 의도에 가장 부합하는 정답이 저장되어 있는 정보원에 접근하여 다양한 정답 추출 기법을 통해 사용자가 원하는 정답 만을 추출할 수 있으므로, 이에 따라 정보검색 및 질의응답 시스템의 성능을 향상시킬 수 있는 효과가 있다.As described above, according to the present invention, various heterogeneous distributed information sources are constructed for question and answer, and the user accesses the information source that stores the correct answer that best fits the user's intention of the question, and the user wants the user through the various answer extraction techniques. Since only the correct answer can be extracted, there is an effect that can improve the performance of the information retrieval and question answering system.

또한, 본 발명에 따르면, 질문의 난이도에 따라 단일 정답추출 혹은 병행 정답추출을 선택하는 다중 전략에 기반하여 정답을 제시함으로써, 사용자 응답 속도를 향상시킬 수 있는 효과도 있다.In addition, according to the present invention, according to the difficulty of the question by presenting the correct answer based on multiple strategies to select a single correct extraction or parallel correct answer extraction, there is an effect that can improve the user response speed.

Claims

(a) a language analysis step of linguistically analyzing a sentence of a target document or a question sentence of a user;

(b) constructing heterogeneous distributed information sources through various correct answer indexing techniques according to the sentence of the target document or the user's question characteristic;

(c) a multiple correct answer extracting step of extracting multiple candidate correct answers based on the heterogeneous distributed information source according to a user's question intention; And

and (d) a correct answer management step of integrating candidate correct answers multiplexed by the step (c) according to a user's question intent.

The method of claim 1, wherein step (a) comprises:

A morphological analysis step of decomposing the user's question sentence into morpheme units and determining parts of speech of each morpheme;

Recognizing a correct answer type for a user's question according to a predefined classification item;

A vocabulary meaning tagging step of tagging each vocabulary meaning using a noun vocabulary concept network;

A vocabulary meaning determining step of determining a vocabulary meaning for each vocabulary based on the information tagged through the vocabulary meaning tagging step;

A parsing step of analyzing a syntax structure for each sentence; And

And a LF extraction step of extracting a logical form (LF) structure by reconstructing the result of parsing using the framework around the verb.

According to claim 1, wherein step (b),

Selecting a correct answer indexing technique suitable for a characteristic of a candidate correct answer included in a sentence of the target document or a question sentence of a user; And

And constructing the heterogeneous distributed information source according to the characteristics of the candidate correct answer through the selected correct answer indexing technique.

The method of claim 3, wherein the answer indexing technique is:

A method for extracting multiple answers in a question-and-response system comprising at least one of a pattern-based correct index technique, a statistical-based correct index technique, a short-answer correct index technique, a descriptive correct index technique, and an enumerated correct index technique.

The method of claim 1, wherein step (c) comprises:

Selecting a correct answer extraction method suitable for a user's question intent and difficulty by constructing a multiple strategy; And

And extracting multiple candidate correct answers based on the heterogeneous distributed information source through the selected correct answer extracting technique.

The method of claim 5, wherein the multi-strategy configuration,

A method for extracting multiple answers in a question-and-answer system, comprising a rule for selecting a correct answer extraction technique suitable for a user's question intent and difficulty, and an integrated rule for the multiple candidate candidates.

The method of claim 5, wherein the correct answer extraction technique,

A method for extracting multiple answers in a question-and-response system comprising at least one of a pattern-based correct answer extraction technique, a statistical-based correct answer extracting technique, a short answer correcting extracting technique, a descriptive correct extracting technique, and an enumerated correct extracting technique.

The method of claim 1, wherein step (d)

Integrating candidate correct answers multiplexed by the step (c) according to the correct answer type for the user question;

Comparing and reranking whether the integrated correct answer is suitable for a user's question intent; And

And presenting the integrated candidate correct answer and additional information.

A language analyzer for linguistically analyzing a sentence of a target document or a question sentence of a user;

A heterogeneous correct answer indexer for constructing a heterogeneous distributed information source by performing a correct answer index through various correct answer indexing techniques according to a sentence of the target document or a user's question characteristic;

A multiple correct answer extractor which extracts multiple candidate correct answers that best match a user's question intent through various correct answer extraction techniques based on the heterogeneous distributed information source; And

And a correct answer manager for inferring and integrating the extracted candidate correct answers according to a user's question intention and presenting them to a user.

The language analyzer of claim 9, wherein the language analyzer comprises:

After recognizing the correct answer type for the sentence of the target document or the user's question sentence according to a predefined classification item, the syntax structure is analyzed based on the semantic meaning of each vocabulary, and the LF (Logical Form) structure is extracted according to the parse result. Multiple answer extracting device in the question and answer system, characterized in that.

The method of claim 9, wherein the heterogeneous correct indexer,

At least one of a pattern-based correct indexer, a statistical-based correct indexer, a short answer correct indexer, a descriptive correct indexer, and an enumerated correct indexer,

The apparatus for extracting multiple answers in a question-and-answer system according to at least one of the correct answer indexers according to the characteristics of candidate correct answers included in the sentence of the target document or the user's question sentence.

The method of claim 9, wherein the multiple correct answer extractor,

At least one of a pattern-based correct answer extractor, a statistical-based correct extractor, a short answer correct extractor, a descriptive correct extractor, and an enumerated correct extractor,

Multiple answer extraction apparatus in a question-and-answer system, characterized in that the correct answer extractor of at least one of the correct answer extractor extracts the candidate answer that best matches the user's question intent according to the user's question intent and difficulty by the multi-strategy configuration .

The method of claim 12, wherein the multi-strategy configuration,

An apparatus for extracting multiple answers in a question-and-answer system comprising a rule for selecting a correct answer extractor suitable for a user's question intent and difficulty, and an integrated rule for the multiple candidate candidates.

The correct answer manager according to claim 9 or 13,

Integrating the extracted candidate correct answers according to the type of the correct answer to the user's question by the multi-strategy configuration, then comparing and re-ranking the proposed correct answers to the user's question intention and presenting them to the user. An apparatus for extracting multiple correct answers in a question and answer system.