KR100597435B1

KR100597435B1 - System and method for classfying question based on hybrid of information search and question answer system

Info

Publication number: KR100597435B1
Application number: KR1020040102494A
Authority: KR
Inventors: 황이규; 장명길
Original assignee: 한국전자통신연구원
Priority date: 2004-12-07
Filing date: 2004-12-07
Publication date: 2006-07-10
Also published as: KR20060063345A

Abstract

본 발명은 정보검색 및 질문응답시스템에서의 하이브리드 기반 질문 분류 시스템 및 방법에 관한 것으로서, 입력된 질문에 포함된 작품명을 인식하는 질문 작품명 처리와, 질문을 형태소분석, 개체명 인식, 어휘 의미태깅 과정을 통해 개개의 어휘를 의미 있는 코드로 변환하는 질문 언어분석 과정과, 질문의 의미 있는 LSP 형태 코드와 미리 정의된 질문분류 규칙을 이용하여 질문이 요구하는 정답 유형을 분류하는 규칙기반 질문분류 과정과, 질문의 정답유형이 태깅된 학습문서로부터 구축된 통계정보를 이용하여, 질문의 LSP 코드에 대한 분류를 수행하는 통계기반 질문분류 과정과, 규칙기반 질문분류와 통계기반 질문분류의 결과를 이용하여 최종적으로 사용자의 질문에 대한 정답의 유형을 판별하는 질문 정답유형 결정 과정을 진행함으로써 정보검색 및 질문응답에서 사용자의 질문에 대한 검색 결과인 방대한 문서로부터, 사용자가 원하는 정답만을 초점으로 할 수 있어, 정보검색 및 질문응답 시스템의 성능을 개선할 수 있다. The present invention relates to a hybrid-based question classification system and method in an information retrieval and question answering system. The present invention relates to a question work name processing for recognizing a work name included in an input question, to morphological analysis of a question, object name recognition, and lexical meaning. The question language analysis process converts individual vocabulary into meaningful codes through tagging process, and the rule-based question classification classifying the types of correct answers required by a question by using meaningful LSP form codes of the question and predefined question classification rules. Using the process and the statistical information constructed from the learning document tagged with the correct answer type of the question, the statistics-based question classification process for classifying the LSP code of the question, and the results of the rule-based question classification and the statistics-based question classification Information inspection by going through the process of determining the type of correct answer to finally determine the type of correct answer to the user's question From the vast documents that are the search results for the user's questions in color and answering questions, only the correct answers the user wants can be focused, improving the performance of the information retrieval and question answering system.

정보검색, 질문응답, 작품명 인식, 하이브리드 질문분류 Information search, question and answer, title recognition, hybrid question classification

Description

SYSTEM AND METHOD FOR CLASSFYING QUESTION BASED ON HYBRID OF INFORMATION SEARCH AND QUESTION ANSWER SYSTEM}

도 1은 기존의 검색과정을 나타낸 도면, 1 is a view showing a conventional search process,

도 2는 기존의 지지벡터기계 기반의 질의유형 분류기를 이용한 질문 분류 과정을 나타낸 도면, 2 is a diagram illustrating a question classification process using a query type classifier based on a conventional support vector machine;

도 3은 본 발명의 일실시예에 의한 하이브리드 기반 질문 분류 시스템 및 그 분류 방법을 나타낸 도면, 3 is a diagram showing a hybrid-based question classification system and a classification method according to an embodiment of the present invention;

도 4는 본 발명의 사용자의 질문 분류에 대한 다양한 정답유형을 나타낸 도면, 4 is a view showing various types of correct answers for question classification of a user of the present invention;

도 5는 본 발명의 미리 구축한 작품명 사전과 고유코드의 관계를 나타낸 도면, 5 is a diagram showing a relationship between a dictionary of works already constructed and a unique code of the present invention;

도 6 및 도 7은 구축된 작품명 좌/우 문맥 정보를 이용한 필터링 관계를 나타낸 도면, 6 and 7 are diagrams illustrating a filtering relationship using the built-in left / right context information of a work name;

도 8은 본 발명의 질문에 대한 언어분석 과정을 나타낸 도면, 8 is a view showing a language analysis process for a question of the present invention,

도 9는 LSP 규칙의 예를 나타낸 도면, 9 shows an example of an LSP rule,

도 10은 질문코퍼스의 예를 나타낸 도면이다. 10 shows an example of a question corpus.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

10 : 질문 작품명 처리부 11 : 작품명 및 인식문맥 DB 10: Question work name processing unit 11: Work name and recognition context DB

20 : 질문 언어 분석부 21 : 언어분석 지식부 DB 20: question language analysis unit 21: language analysis knowledge DB

30 : 규칙기반 질문분류부 31 : 질문분류규칙 DB 30: Rule-based question classification unit 31: Question classification rule DB

40 : 통계기반 질문분류부 41 : 질문코퍼스 통계정보 DB 40: Statistics based question classification unit 41: Question corpus statistical information DB

50 : 질문 정답유형 결정부 50: question type determination section

본 발명은 정보검색 및 질문응답시스템에서의 하이브리드 기반 질문 분류 시스템 및 방법에 관한 것으로, 더욱 상세하게는 질문에 포함된 파라메타에서 질문이 요구하는 정답의 유형을 찾아, 질문을 분류하는 정보검색 및 질문응답시스템에서의 하이브리드 기반 질문 분류 시스템 및 방법에 관한 것이다. The present invention relates to a hybrid-based question classification system and method in an information retrieval and question answering system. More specifically, an information retrieval and question classifying a question by finding a type of correct answer required by a question in a parameter included in the question. A hybrid based question classification system and method in an answering system.

정보검색 시스템에서, 사용자의 질의는 키워드의 나열로 이해되고, 키워드들 간의 관계는 불리언 연산에 따라 표현된다. 키워드 기반의 정보검색 시스템은 해당 키워드의 문서내 존재 여부 및 존재할 때, 색인과정에 부여된 키워드 빈도(TF; Term Frequency)나 역문헌 빈도(IDF; Inverse Document Frequency)와 같은 정보를 이용한 가중치를 이용한다. 이에 따라, 검색의 결과는 해당 키워드와 관련이 있는 문서 전체가 되며, 문서내에서 사용자가 원하는 부분을 찾는 것은 사용자의 노력이 필요하다. 그러나, 사용자는 문서내에 포함된 답을 직접적으로 검색시스템이 찾아주기를 원하며, 질의응답 시스템은 이러한 사용자의 목적을 만족시키기 위해, 질문이 요구하는 정답의 유형을 분류하기를 원한다. In an information retrieval system, a user's query is understood as a sequence of keywords, and the relationships between the keywords are represented by Boolean operations. The keyword-based information retrieval system uses weights using information such as keyword frequency (TF) or inverse document frequency (IDF) assigned to the indexing process when the keyword exists in the document. . Accordingly, the result of the search is the entire document related to the keyword, and finding the part desired by the user in the document requires the effort of the user. However, the user wants the search system to find the answer contained in the document directly, and the question answering system wants to classify the type of correct answer required by the question to satisfy the user's purpose.

기존에, 자연어로 입력된 사용자의 질문을 인공지능 시스템이 분석하여 인터넷에 존재하는 정보를 효과적으로 제시하는 서비스에 대한 방법이 제안된 바 있는데, 이는 사용자의 질문을 형태소 분석 및 어휘에 대한 유의어 사전을 통한 확장, 그리고 사용자 질문의 유형 분류를 통해 ‘정의형’, ‘법률’, ‘의료’, ‘교육’, ‘인터넷’ 등을 분류하고 있다. In the past, a method has been proposed for a service in which an artificial intelligence system analyzes a user's question input in natural language to effectively present information existing on the Internet. Through the expansion and classification of user questions, 'Definition', 'Legal', 'Medical', 'Education' and 'Internet' are classified.

또한, 문서 검색 시스템 및 질문 응답 시스템이 제안된 바 있는데, 여기에서는 질문응답 시스템을 위해 질문에서 키워드를 추출하고 있는데, 질문문이 나타내는 중심적인 주제에 관한 주요형태 키워드와 보충적인 정보에 관한 비주요형태 키워드를 추출한다. 예를 들어, “2002년에 개최된 FIFA 월드컵의 우승국은 어디입니까”에 대하여, ‘2002년’, ‘개최’, ‘FIFA’, ‘월드컵’, ‘우승’, ‘국가’, ‘어디’ 등을 얻는다. 이를 바탕으로, 각 검색 키워드에 키워드 형태를 부여하는데, ‘2002년’에는 날짜표현, ‘FIFA’에는 조직명 등의 의미표현을 부여한다. 또한, 질문에서 동사들을 구문적 속성에 따라 주된 동사와 보조동사로 분류하 기도 하며, 질문에 포함된 의문사를 이용하여 질문을 분류한다. 각 단어의 의미표현은 계층적 형태를 띄는데, 날짜의 경우, 연도/월/일/시간의 구조를 가지며, 장소의 경우, 국가/도도부현/시구정촌/번지 레벨을 가지고 있다. In addition, a document retrieval system and a question answering system have been proposed, where keywords are extracted from the question for the question answering system, which are non-critical about key forms keywords and supplementary information on the central subject represented by the question. Extract the form keyword. For example, "Where are the winners of the 2002 FIFA World Cup?", "2002", "Host", "FIFA", "World Cup", "Winner", "Country", "Where", etc. Get Based on this, a keyword type is assigned to each search keyword, and a meaning expression such as a date expression in 2002 and an organization name in FIFA is given. In addition, the verbs are classified into main verbs and auxiliary verbs according to their syntactic properties, and the questions are classified using the interrogative verb included in the question. The meaning expression of each word is hierarchical. In case of date, it has year / month / day / time structure, and in case of place, it has national / provincial prefecture / city / provincial village / address level.

“2002년에 개최된 FIFA 월드컵의 우승국은 어디입니까?”에 대한 처리 과정 및 그 결과는 도 1과 같다. 도 1에서와 같이, 개개의 단어에 대한 의미속성을 이용하여 키워드 형태를 부여하고, 의문사 ‘어디’ 및 ‘국’과 같은 개개 키워드의 의미형태를 이용하여 질문 형태를 분류하여 “장소”에 관한 질문임을 찾아내고 있다. The process and result of "Where is the winner of the 2002 FIFA World Cup?" As shown in FIG. 1, a keyword form is assigned using a semantic attribute for each word, and a question form is classified by using a semantic form of individual keywords such as the questionnaire 'where' and 'country'. We are finding out question.

그리고, 한국어 질의응답시스템을 위한 지지벡터기계 기반의 질의유형 분류기가 제안된 바 있는데, 이 질의유형분류기는 질문분류를 위한 질문코퍼스로부터 자질추출과 자질 선택, 자질 가중치 할당과 학습을 통해 지지벡터기계를 생성하고, 이를 이용하여 질문의 유형을 결정하는 것이다. 이는 크게 질의 학습 부분과 질의분류 과정으로 나누어지며, 구체적인 흐름은 도 2와 같다. In addition, a query type classifier based on a support vector machine for Korean question-and-response systems has been proposed. This query type classifier supports the support vector machine through feature extraction and feature selection, feature weight assignment and learning from the question corpus for question classification. We create a and use it to determine the type of question. This is largely divided into a query learning part and a query classification process, and the detailed flow is shown in FIG. 2.

질의유형 분류를 위해 각 단어를 개체명 인식 및 의미 표지 부착 과정을 거치게 하는데, 각 어휘는 다음 네가지 중 하나로 분류된다. To classify the query type, each word is subjected to the process of recognizing the name of the object and attaching the meaning marker. Each vocabulary is classified into one of the following four words.

- 고유명사: 인명, 국가명, 도시명, 기관명 등 -Proper noun: name of person, country, city, institution

- 일반명사: 직책, 취미, 직위 등 -Common nouns: job title, hobby, job title, etc.

- 단위명사: km, m, kg, g, mg 등 Unit nouns: km, m, kg, g, mg, etc.

- 기타: 질의응답 시스템에 필요한 특수 단어 -Other: Special Words Needed for Q & A

도 2에 도시된 바와 같이, 학습 질의 세트에 있는 질문에서 개개의 단어를 개체명 인식 및 의미표지 부착을 통해 코드화하고, 이를 이용하여 벡터를 형성한 후, 각 질의 유형에 따른 지지벡터를 학습하여, 이를 이용하여 질문을 분류한다. As shown in FIG. 2, the individual words in the question in the learning query set are encoded through object name recognition and semantic labeling, and a vector is formed using the same, and the support vectors for each query type are learned. Use this to classify questions.

그런데, 이 한국어 질의응답시스템을 위한 지지벡터기계 기반의 질의유형 분류기는 통계기반의 모델로, 대용량의 학습 질문 구축이 필요하며, 질문의 정답유형이 다양할수록, 자료부족 문제가 발생하는 단점이 있었다. 또한, 통계적 모델만을 사용함으로써, 새로운 질문 정답 유형의 확장이 어렵고, 질문분류 시스템의 튜닝에 어려운 문제가 있었다. However, this type of support vector machine based question type classifier for the Korean question and answer system is a statistics-based model, which requires the construction of a large-scale learning question. . In addition, by using only the statistical model, it was difficult to expand the new question correct answer type, and it was difficult to tune the question classification system.

한편, 자연어로 입력된 사용자의 질문을 인공지능 시스템이 분석하여 인터넷에 존재하는 정보를 효과적으로 제시하는 서비스에 대한 방법이 제안된 바 있는데, 이는 질문의 분류 체계가 ‘정의형’, ‘법률’, ‘의료’, ‘교육’, ‘인터넷’으로 아주 단순하며, 질문이 요구하는 정답의 유형을 찾기보다는 질문 그 자체가 해당하는 범주를 찾는 방법이다. 여기에서는 개개의 어휘에 의미속성을 부여하고, 의미속성과 의문사를 결합하여 질문의 유형을 분리하는 것으로, 구체적인 규칙이나 통계모델과 같은 정형화되고 체계적인 방법보다는 휴리스틱(Heuristic) 정보를 활용하여 질문을 분류하고 있다. On the other hand, a method has been proposed for a service that effectively analyzes the user's question input in natural language and presents information on the Internet. The classification system of the question is defined by definition, law, 'Medical', 'education' and 'internet' are very simple, and the question itself is a way to find the corresponding category rather than the type of correct answer the question requires. In this section, semantic attributes are assigned to individual vocabulary and questions are classified by combining semantic attributes and question sentences, and the questions are classified using heuristic information rather than formal and systematic methods such as concrete rules or statistical models. Doing.

이와 같이, 기존의 방법들에서는 질문에 포함된 다양한 작품명(영화, 드라마, 소설, 희곡, 오페라, 음악 등의 예술적 분야에서의 작품의 이름; 한 단어 이상 으로 구성되며, 질문의 분석에서 개별적인 어휘는 의미가 없음)의 인식이 없이 질문을 처리함으로써 질문분류의 정확성에 문제가 있었다. As such, in the existing methods, the names of various works included in the question (names of works in artistic fields such as film, drama, fiction, drama, opera, music, etc.) consist of more than one word, and the individual vocabulary in the analysis of the question. There was a problem with the accuracy of the question classification by processing the question without any awareness of the meaning.

따라서, 본 발명의 목적은 상기한 종래 기술의 문제점을 해결하기 위해 이루어진 것으로서, 질문에 포함된 작품명의 인식과, 어휘, 중요 어휘의 포함관계 및 배제 관계, 품사, 의문사, 어휘의 의미적 분류 및 어휘의 개체유형에 따라 질문이 요구하는 정답의 유형을 찾아, 질문을 분류하는 정보검색 및 질문응답시스템에서의 하이브리드 기반 질문 분류 시스템 및 방법을 제공하는데 있다.
Accordingly, an object of the present invention is to solve the above problems of the prior art, the recognition of the name of the work included in the question, the vocabulary, the inclusion and exclusion relationship of important vocabulary, parts of speech, interrogation, semantic classification of the vocabulary and The present invention provides a hybrid-based question classification system and method in an information retrieval and question answering system that classifies questions by finding the type of correct answer required by a vocabulary individual type.

상기와 같은 목적을 달성하기 위한 본 발명의 정보검색 및 질문응답시스템에서의 하이브리드 기반 질문 분류 시스템은, 입력된 질문에 포함된 작품명을 인식하는 질문 작품명 처리부; 질문을 형태소분석, 개체명 인식, 어휘 의미태깅 과정을 통해 개개의 어휘를 의미 있는 코드로 변환하는 질문 언어분석부; 질문의 의미 있는 LSP 형태 코드와 미리 정의된 질문분류 규칙을 이용하여 질문이 요구하는 정답 유형을 분류하는 규칙기반 질문분류부; 질문의 정답유형이 태깅된 학습문서로부터 구축된 통계정보를 이용하여, 질문의 LSP 코드에 대한 분류를 수행하는 통계기반 질문분류부; 및 규칙기반 질문분류와 통계기반 질문분류의 결과를 이용하여 최종적으로 사용자의 질문에 대한 정답의 유형을 판별하는 질문 정답유형 결정부를 포함 하여 이루어진 것을 특징으로 한다. Hybrid based question classification system in the information retrieval and question answering system of the present invention for achieving the above object, the question title processing unit for recognizing the name of the work included in the input question; A question language analysis unit for converting a question into a meaningful code through morphological analysis, entity name recognition, and vocabulary semantic tagging process; A rule-based question classification unit for classifying the types of correct answers required by the question using meaningful LSP type codes of the questions and predefined question classification rules; A statistics-based question classifier configured to classify the LSP code of the question using statistical information constructed from a learning document tagged with the correct answer type of the question; And a question correct type determination unit that finally determines the type of the correct answer to the user's question using the results of rule-based question classification and statistical-based question classification.

한편, 본 발명의 정보검색 및 질문응답시스템에서의 하이브리드 기반 질문 분류 방법은, 입력된 질문에 포함된 작품명을 인식하는 질문 작품명 처리 단계; 질문을 형태소분석, 개체명 인식, 어휘 의미태깅 과정을 통해 개개의 어휘를 의미 있는 코드로 변환하는 질문 언어분석 단계; 질문의 의미 있는 LSP 형태 코드와 미리 정의된 질문분류 규칙을 이용하여 질문이 요구하는 정답 유형을 분류하는 규칙기반 질문분류 단계; 질문의 정답유형이 태깅된 학습문서로부터 구축된 통계정보를 이용하여, 질문의 LSP 코드에 대한 분류를 수행하는 통계기반 질문분류 단계; 및 규칙기반 질문분류와 통계기반 질문분류의 결과를 이용하여 최종적으로 사용자의 질문에 대한 정답의 유형을 판별하는 질문 정답유형 결정 단계를 포함하여 이루어진 것을 특징으로 한다. On the other hand, the hybrid based question classification method in the information retrieval and question answering system of the present invention comprises: a question work name processing step of recognizing the work name included in the input question; A question language analysis step of converting the individual words into meaningful codes through morphological analysis, entity name recognition, and vocabulary semantic tagging process; A rule-based question classification step of classifying a type of correct answer required by a question using a meaningful LSP type code of the question and a predefined question classification rule; A statistics-based question classification step of performing classification on the LSP code of the question using statistical information constructed from a learning document tagged with the correct answer type of the question; And determining a type of a question correct answer that finally determines a type of a correct answer to a user's question by using the results of rule based question classification and statistical based question classification.

이와 같이, 본 발명에서 제안하는 시스템 및 방법은 사용자의 질문에서 나타나는 어휘에 나타나는 의미적 특성 및 통계적 특성을 이용하여, 질문이 요구하는 정답의 의미적 부류를 결정하는 것이다. As described above, the system and method proposed by the present invention are to determine the semantic class of correct answers required by the question by using the semantic and statistical characteristics of the vocabulary appearing in the user's question.

즉, 본 발명은 정보검색 및 질문응답 시스템에서, 사용자의 자연어 질문에 대하여, 질문이 요구하는 정답의 의미적 분류를 사전에 정의하고, 사용자의 질문이 요구하는 정답의 유형을 실시간으로 자동 분류함으로써, 질문에 대해 적절한 문서를 찾는 것이 아니라, 문서 내에서 질문의 정답을 찾는데 도움을 주기 위한 질문 분류 방법이다. 예컨대, “에버랜드를 운영하는 기업은 어디인가요?”라는 질문에 대해, “회사이름”을, “훈민정음의 보급책의 일환으로 만든 책은?”에 대해서는 “작품이름”이라는 질문이 요구하는 정답의 유형을 찾아, 질문을 분류하는 것으로, 이를 위하여, 1) 사용자가 질문할 가능성이 있는 질문에 대한 정답의 유형을 미리 정의하고, 2) 질문에 포함되며, 개개의 단어는 의미를 가지지 않지만, 결합하여 의미있는 단위로 작용하는 작품명을 인식하며, 3) 각 질문의 정답 유형에 대해, 정교하며 유연성이 있는 규칙을 통해 질문의 유형을 분류하고, 4) 통계적 방법에 의해 규칙에 의해 결정되지 않는 질문의 정답유형을 분류하는 질문의 자동 분류 방법 및 그 장치에 관한 것이다. That is, the present invention, in the information retrieval and question answering system, defines a semantic classification of the correct answer required by the question in advance for the user's natural language question, and automatically classifies the type of the correct answer required by the user's question in real time. Instead of finding the right document for the question, it is a question-categorization method that helps you find the correct answer in the document. For example, for the question "Where is the company that runs Everland?" For the "Company name" and "A book made as part of the distribution of Hunminjeongeum?" Finding the type and classifying the question, for this purpose: 1) predefine the type of correct answer for the question the user is likely to ask, 2) be included in the question, and individual words have no meaning, but are combined Recognizes the name of a work that acts as a meaningful unit, and 3) classifies the question types through elaborate and flexible rules for each type of correct answer for each question, and 4) does not determine the rules by statistical methods. A method and apparatus for automatically classifying a question classifying the correct answer type of a question.

이하, 본 발명의 정보검색 및 질문응답시스템에서의 하이브리드 기반 질문 분류 시스템 및 방법에 대하여 첨부된 도면을 참조하여 상세히 설명하기로 한다. Hereinafter, a hybrid based question classification system and method in an information retrieval and question answering system of the present invention will be described in detail with reference to the accompanying drawings.

도 3은 본 발명의 일실시예에 의한 하이브리드 기반 질문 분류 시스템 및 그 분류 방법을 나타낸 도면이다. 도 3에 도시된 바와 같이, 본 발명의 하이브리드 기반 질문 분류 시스템은, 크게 질문에 나타난 작품명(책이름, 연극/영화제목, 음악명 등 하나 이상의 단어로 이루어진 타이틀)을 인식하는 질문 작품명 처리부(10)와, 질문을 형태소 분석, 개체명 인식, 개념망에 기반한 어휘의미 부착 및 어휘의미 분류를 수행하는 질문 언어분석부(20)와, 수작업으로 작성한 질문분류규칙을 이용한 규칙기반 질문분류부(30)와, 질문의 정답유형을 태깅한 문서로부터 학습한 통계정보를 이용한 통계기반 질문분류부(40)와, 최종적으로 질문의 정답유형 결정부(50)로 이루어져 있다. 3 is a diagram illustrating a hybrid based question classification system and a classification method according to an embodiment of the present invention. As shown in FIG. 3, the hybrid-based question classification system of the present invention includes a question work name processing unit that recognizes a work name (a title composed of one or more words such as a book name, a play / movie title, a music name, etc.) largely indicated in a question. (10), a question language analysis unit 20 which performs morphological analysis of the question, object name recognition, lexical semantic attachment and lexical semantic classification based on the concept network, and a rule-based question classification unit using a manual classification rule. (30), a statistics-based question classification unit 40 using statistical information learned from documents tagged with the correct answer type of question, and finally a correct answer type determination unit 50 of the question.

여기에, 상기 질문 작품명 처리부(10)에는 작품명을 인식하기 위한 작품명 테이블 및 작품명 좌/우 문맥 규칙을 저장한 작품명 및 인식문맥 DB(11)가 더 마련되며, 상기 질문 언어분석부(20)에는 질문언어분석을 위한 언어분석지식 DB(개체명 사전, 개체명 인식문맥, 어휘개념망, 명사 의미분별을 위한 상호정보)(21)가 더 마련되며, 상기 규칙기반 질문분류부(30)에는 규칙기반 질문분류를 위한 LSP(Lexico-Semantic Pattern) 기반의 질문분류규칙 DB(31)가 더 마련되며, 상기 통계기반 질문분류부(40)에는 Naive- Bayesian을 이용한 통계기반 질문분류를 위해 학습문서로부터 자동으로 학습한 질문코퍼스 통계정보 DB(41)가 더 마련되어 있다. Here, the question work name processing unit 10 further includes a work name table for recognizing the work name and a work name and recognition context DB 11 storing the work name left / right context rules. The question language analysis Part 20 further includes a language analysis knowledge DB (object name dictionary, object name recognition context, lexical concept network, mutual information for classification of nouns) 21 for question language analysis, and the rule-based question classification unit. (30) further provides a query classification rule DB (31) based on the Lexico-Semantic Pattern (LSP) for rule-based question classification, and the statistics-based question classification unit 40 has a statistics-based question classification using Naive Bayesian. Question corpus statistical information DB 41 automatically learned from the learning document is further provided for.

이 시스템을 이용한 하이브리드 기반 질문 분류 방법은, 크게 질문 작품명 처리 단계(S10), 질문 언어분석 단계(S20), 규칙기반 질문분류 단계(S30), 통계기반 질문분류 단계(S40), 질문 정답유형 결정 단계(S50)로 진행되게 된다. Hybrid-based question classification method using this system, largely the question title processing step (S10), question language analysis step (S20), rule-based question classification step (S30), statistics-based question classification step (S40), question correct answer type The determination proceeds to step S50.

상기한 과정에 의한 결과로, 사용자의 질문 분류의 결과는 도 4에 일부를 보인 다양한 정답의 유형이 될 수 있다. 각 단계와 관련된 처리 방법 및 사용되는 지식을 살펴보면 아래와 같다. As a result of the above process, the result of the user's question classification may be various types of correct answers shown in part in FIG. The processing method and the knowledge used for each step are as follows.

- 작품명 처리 단계(S10) -Title processing step (S10)

정보검색 및 질의응답을 위한 시스템에서, 사용자의 질문에는 작품명과 같은 단어들이 표현될 수 있다. 예를 들어, “영화 바람과 함께 사라지다의 여자 주인공은 누구였나요?”와 같은 질문에서 사용자는 ‘바람과 함께 사라지다’를 위해 특별한 괄호를 표시하지 않고 질문을 할 것이다 여기서, 상기 “작품명”이라 함은 연극, 영화, 소설, 음악, 드라마 등의 다양한 장르에서 널리 사용되는 제목을 통칭한다. 이때, 검색 단계에서 ‘바람’, ‘사라지다’ 등을 개별적인 키워드로 하여 문서 검색을 수행하면 원하지 않는 답을 얻게 된다. 이에 작품명처리단계(10)에서는 도 5에서와 같이 미리 구축한 작품명 사전과 고유코드를 이용하여, 질문을 작품명으로 변환한다. 예를 들어, “영화 ET의 장르는 무엇인가요?”에서, ‘ET’와 ‘장르’는 작품명 사전에 있으며, 이를 “영화 햇햇24191의 햇햇49126는 무엇인가요?”와 같이 자동 변환한다. ‘장르’는 실제 작품명에 있을 수 있지만, 질문의 문맥 상황에서는 작품명이 아니며, 도 6 및 도 7에서와 같이 구축된 작품명 좌/우 문맥 정보를 이용하여 필터링 된다. 따라서, 최종적으로는 “영화 햇햇24191의 장르는 무엇인가요?”와 같은 결과를 만들어 낸다. 작품명 및 인식문맥 DB(11)은 작품명 사전을 이용하여, 웹으로부터 반자동 구축하여 이용할 수 있다. In a system for information retrieval and question and answer, the user's question can be expressed words such as the name of the work. For example, in a question such as "Who was the heroine of Gone with the Wind?", The user would ask the question without displaying special parentheses for Gone with the Wind. Ham refers to the titles that are widely used in various genres such as plays, movies, novels, music, and dramas. At this time, if you search documents by using 'wind', 'disappear', etc. as individual keywords in the search step, you may get unwanted answers. In the work name processing step 10, the question is converted into a work name using a work name dictionary and a unique code previously constructed as shown in FIG. For example, in "What is the genre of the movie ET?", "ET" and "genre" are in the title dictionary, and they are automatically converted to "What is the hat 49126 of the movie Hat 24191?" 'Genre' may be in the actual work name, but is not the work name in the context of the question, and is filtered using the work name left / right context information constructed as in FIGS. 6 and 7. Thus, in the end, it produces something like "What is the genre of movie Hat 24191?" The work name and recognition context DB 11 can be semi-automatically constructed from the web using the work name dictionary.

- 질문 언어분석 단계(S20) -Question language analysis step (S20)

질문을 언어분석하여 질문분류를 하는데 이용한다. 도 8은 본 발명의 질문에 대한 언어분석 과정을 나타낸 도면이다. 도 8에 도시된 바와 같이, 형태소분석 단계(S201)에서는 형태소 사전(2011)을 이용하여 각 형태소의 품사를 결정하며, 형태소가 의문사일 경우, 이에 대한 정보를 부여한다. 개체명 인식단계(S202)에서는 미리 정의한 160 여개의 의미적 분류를 개체명으로 정의하고, 이에 대한 인식을 수행한다. 도 4에서 보는 바와 같이, 개체명은 인명, 학술분야명, 이론, 인공물, 조직명, 지명, 문화/문명, 날짜, 시간, 수량, 사건, 동물, 식물, 물질, 전문용어에 대 해 상세히 분류하였다. 개체명 사전(2021)은 160여 개체 분류에 대한 사전을 구축한 것이다. 어휘의미태깅 단계(S203)에서는 명사 어휘 개념망(2031)을 이용하여 각 명사에 개념을 할당한다. 어휘의미 분별단계(S204)에서는 어휘의미태깅 단계(S203)에서 부착한 정보중 유일한 하나의 어휘 의미를 결정한다. 이를 위해, 대용량 코퍼스로부터 한 문장에 공기하는 어휘들간의 상호정보(2041)를 얻어 활용한다. 예를 들어 질문 “사찰에서 대문 역할을 하는 문은?”에서 ‘문’은 ‘#모양’, ‘#시설물’, ‘#단위’ 등의 의미로 이용되며, 이 질문에서는 ‘#시설물’의 의미로 분별되어야 한다. This class is used to classify questions by linguistic analysis. 8 is a diagram illustrating a language analysis process for a question of the present invention. As shown in FIG. 8, in the morpheme analysis step S201, the parts of speech of each morpheme are determined using the morpheme dictionary 2011, and when the morpheme is a question mark, information on the morpheme is given. In the entity name recognition step (S202), about 160 predefined semantic classifications are defined as the entity name, and recognition thereof is performed. As shown in FIG. 4, the individual names are classified in detail for human names, academic names, theories, artifacts, organization names, names, culture / civilization, dates, times, quantities, events, animals, plants, materials, and terminology. . The entity name dictionary 2021 is a dictionary of more than 160 entity classifications. In the lexical meaning tagging step (S203), the noun vocabulary concept network 2031 is used to assign a concept to each noun. In the lexical meaning classification step (S204), only one lexical meaning of information attached in the lexical meaning tagging step (S203) is determined. To this end, the mutual information 2041 between the vocabularies in one sentence is obtained from the large corpus and utilized. For example, in the question “Does the door act as the gate in the temple?”, The door is used to mean '#shape', '#facility', '#unit', etc. To be discerned.

- 규칙기반 질문분류 단계(S30) Rule-based question classification step (S30)

규칙기반의 질문분류를 위하여 LSP(Lexico-Semantic Pattern) 형태의 질문 분류 규칙을 이용한다. 질문에 포함된 각각의 어휘는 언어분석 단계에서 어휘의 특성에 따라 LSP 형태로 코드화된다. LSP 형태는, 질문에 포함된 단어를 품사, 어휘, 개체명, 의문사, 의미코드 등으로 변환하는 과정을 말한다. 즉, 질문을 LSP로 변환하고, 미리 정의한 LSP 규칙과 비교하여, 질문의 유형을 분류하는 것이다. 이를 위해 여러 연산자를 정의하였으며, 사용된 연산자는 다음과 같다. 아래 연산자를 사용하여 규칙을 작성하고, 질문을 아래 연산자를 바탕으로 구성한다. For rule-based question classification, LSP (Lexico-Semantic Pattern) question classification rules are used. Each vocabulary included in the question is coded in LSP format according to the characteristics of the vocabulary at the language analysis stage. The LSP form refers to the process of converting a word included in a question into a part-of-speech, vocabulary, entity name, question, and semantic code. That is, the question is converted into an LSP, and the question type is classified by comparing with a predefined LSP rule. Several operators are defined for this purpose and the following operators are used. Write a rule using the operators below, and construct a question based on the operators below.

- ^어휘: 실제 어휘 정보 -^ Vocabulary: Real Vocabulary Information

* 예) 식물 --> ^식물, 직업 --> ^직업 * Ex) plants-> ^ plants, occupation-> ^ occupation

* 질문에 나타난 실제 어휘를 일치시킴. * Matches the actual vocabulary shown in the question.

- !품사: 어휘의 형태소 품사 정보 Parts of Speech: Part of speech information

* 예) 용어 --> !nc, 다르 --> !pa, 가 --> !jc, 았 --> !ep * Ex) Term->! Nc, Dar->! Pa,->! Jc, did->! Ep

* 질문에 나타난 어휘의 품사를 일치시킴. * Match the parts of speech in the vocabulary.

- @의문사: 어휘의 의문사 여부 -Question: whether the question is a vocabulary

* 예) 무슨 --> @무슨, 언제 --> @언제 * Ex) what-> @what, when-> @when

- &개체명: 어휘가 개체유형중 하나임 -& Object name: vocabulary is one of the object types

* 예) 영국 --> &LC_COUNTRY, 종달새 --> &AM_BIRD * Eg UK-> & LC_COUNTRY, Lark-> & AM_BIRD

* 질문에 나타난 어휘가 속하는 개체형을 일치시킴. * Matches the entity type to which the vocabulary presented in the question belongs.

- #의미: 어휘의 의미표현 -# Meaning: Expression of meaning of vocabulary

* 예) 문 --> #시설물, 긴축 --> #감소 * Ex) Door-> #Facilities, austerity-> #Reduction

* 질문에 나타난 어휘의 의미표현을 일치시킴 * Matches the expression of meaning in the vocabulary

- %어휘: 어휘 포함관계 % Vocabulary: vocabulary inclusion

* 예) 대학교 --> %학교, 초등학교 --> %학교 * Example) University->% School, Elementary School->% School

* 질문에 나타난 어휘의 부분포함관계를 일치시킴 * Matches partial inclusion relations in vocabulary

- ^$: 질문의 끝을 나타내는 특수 기호 ^ $: Special symbol to end the question

- 각 연산자는 아래와 같은 선택 및 포함 연산자에 의해 결합될 수 있다. Each operator can be combined by the following selection and inclusion operators.

* [@무엇] : ‘[’와 ‘]’안의 정보는 생략 가능 * [@What]: Information in ‘[’ and ‘]’ can be omitted

* {@무엇} : ‘{’와 ‘}’안의 정보는 질문분류에 중요한 정보 * {@What}: The information in "{" and "}" is important for classifying questions

* <@무엇>: ‘<’와 ‘>’안의 정보는 나와서는 안됨 * <@What>: Information within ‘<’ and ‘>’ should not be listed

* ‘|’: 선택 가능 * ‘|’: Selectable

위의 표현 방식에 따른 LSP 규칙의 예는 도 9와 같다. 언어분석단계를 통해, 질문을 LSP 형태로 변환하고, 구축된 LSP 규칙과 비교하여 질문이 요구하는 정답의 유형을 분류한다. 예를 들어, “헤밍웨이는 어떤 소설을 썼나?” 는 언어분석 단계를 거쳐 “&PS_NAME !jx @어떤 #책 !jc !pv !ep !ef”와 같은 열로 변환되고, LSP규칙과 비교하여, “@어떤 #책”이 부분적으로 일치하므로, “AF_WORKS”를 찾는 질문으로 분류하게 된다. 한 질문은 여러 규칙과 부분적으로 일치할 수 있으며, 개개의 규칙은 LSP의 의미에 따라 가중치를 부여하며, 가중치의 합에 따른 결과를 정렬하여 질문의 최종적인 분류로 간주한다. An example of the LSP rule according to the above expression method is shown in FIG. 9. Through the linguistic analysis step, the question is transformed into LSP form and compared with the established LSP rules to classify the type of correct answer required by the question. For example, "What novel did Hemingway write?" Goes through a linguistic analysis step and converts it into a column such as "& PS_NAME! Jx @ which # book! Jc! Pv! Ep! Ef" and compares it to the LSP rule, "@ Since a #book ”partially matches, it will be classified as a search for“ AF_WORKS ”. A question can be partially coincident with several rules, each of which is weighted according to the meaning of the LSP, and sorted the result of the sum of the weights to be considered the final classification of the question.

- 통계기반 질문분류(S40) -Statistics based question classification (S40)

질문을 분류할 때, LSP 기반으로 분류할 수 없는 경우가 있다. 예를 들어, “관상용 열대어 중 가장 작은 종은?”와 같은 질문은 정답유형을 분류하는데 중요한 정보가 다른 단어에 의해 분리되어 있다. 이러한 경우는 패턴으로 분류하기 어려우며, 질문에 포함된 자질을 벡터 형태로 간주하는 통계기반 질문 분류를 수행한다. q를 질문, at를 질문의 정답유형, t를 질문에 포함된 여러 정보(개체명, 의미태깅, 품사, 의문사, 어휘 등), n을 전체 학습 질문의 크기라고 할 때, P(at|q)는 주어진 질문 q에 대한 질문의 정답유형 at를 찾는다. 이를 위해 질문에 질문이 요구하는 정답유형을 부착한 학습문서(도 8)를 이용하여, 각 q에서 t가 at와 가지는 통계적 정보를 학습하고, 이를 바탕으로 Naive Bayesian 모델을 이용하여 P(at|q)의 값을 구한다. When classifying questions, you may not be able to classify on an LSP basis. For example, a question such as “What is the smallest species of ornamental tropical fish?” Is the information that is important for classifying the correct answer type, separated by other words. In such a case, it is difficult to classify the pattern, and perform statistical-based question classification that considers the qualities included in the question as a vector form. where q is the question, at is the correct type of question, t is the information contained in the question (object name, semantic tagging, part-of-speech, question mark, vocabulary, etc.), and n is the size of the entire learning question. ) Finds the type of the correct answer at for the given question q. To this end, using the learning document (Fig. 8) attached to the correct answer type required by the question to the question, and to learn the statistical information that t has at with at each q, based on this, using the Naive Bayesian model P (at | Find the value of q).

--- 식(1)

--- Equation (1)

이때, 직접적으로 계산할 수 없고, 확률이 0인 경우를 고려하여, 위의 식을 다음과 같은 식으로 변형하여 질문에 대한 정답유형을 분류한다. In this case, considering the case where the probability cannot be calculated directly and the probability is 0, the correct answer type for the question is classified by modifying the above equation as follows.

--- 식(2) --- Equation (2)

- 질문의 정답유형 결정(S50) -Determination of the correct type of question (S50)

규칙기반의 질문분류를 통해 임계값(threshold) 이상의 가중치를 가지면 규칙기반의 질문분류 값을 이용하지만, 임계값 이하일 경우, 통계기반의 질문분류 결과를 이용한다. 다만, 통계기반의 질문분류에서도 학습을 통해 얻어진 임계값 이하일 경우, 질문에 대한 분류를 포기하고, 질문의 가장 뒤 어휘에서부터, 그 어휘가 가지는 의미표현을 질문분류의 결과값으로 반환한다. If a rule-based question classification is used, the rule-based question classification value is used when the weight is greater than or equal to the threshold value. However, when the threshold value is less than the threshold value, the statistical question classification result is used. However, in the statistical question classification, if the threshold value obtained through learning is lower than the threshold, the classification of the question is abandoned, and the meaning expression of the vocabulary is returned as the result value of the question classification from the vocabulary at the end of the question.

이상에서 몇 가지 실시예를 들어 본 발명을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것이 아니고 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다. Although the present invention has been described in more detail with reference to some embodiments, the present invention is not necessarily limited to these embodiments, and various modifications can be made without departing from the spirit of the present invention.

상술한 바와 같이, 본 발명에 의한 정보검색 및 질문응답시스템에서의 하이브리드 기반 질문 분류 시스템 및 방법은, 사용자가 원하는 정답이 포함된 문서를 축소하거나, 실제 문서로부터 개체명 인식과 단락검색 같은 과정을 거쳐 정답을 직접 추출하는데 도움을 줄 수 있어, 정보검색이나 질문응답 시스템의 정확성을 향상시킬 수 있다. As described above, the hybrid-based question classification system and method in the information retrieval and question answering system according to the present invention can reduce the document containing the correct answer desired by the user, or process such as entity name recognition and paragraph search from the actual document. It can help to extract the correct answer directly, thereby improving the accuracy of the IR or question answering system.

Claims

A question work name processing unit recognizing a work name included in the input question;

A question language analysis unit for converting a question into a meaningful code through morphological analysis, entity name recognition, and vocabulary semantic tagging process;

A rule-based question classification unit for classifying the types of correct answers required by the question using meaningful LSP type codes of the questions and predefined question classification rules;

A statistics-based question classifier configured to classify the LSP code of the question using statistical information constructed from a learning document tagged with the correct answer type of the question; And

Question answer type determination unit that finally determines the type of answer to the user's question by using the results of rule-based question classification and statistics-based question classification

Hybrid based question classification system in the information retrieval and question answering system comprising a.

[Claim 2] The information search and question according to claim 1, wherein the question work name processing unit further comprises a work name table for recognizing the work name and a work name and a recognition context DB storing the work name left / right context rules. Hybrid-based Question Classification System in Response System.

The method of claim 1, wherein the question language analysis unit further comprises a language analysis knowledge DB including an entity name dictionary, an entity name recognition context, a lexical concept network, and mutual information for discriminating noun meanings for question language analysis. Hybrid-based Question Classification System in Information Retrieval and Question and Answer Systems.

The method of claim 1, wherein the rule-based question classification unit further includes a question classification rule DB (31) based on a Lexico-Semantic Pattern (LSP) for rule-based question classification. Hybrid based question classification system.

The information retrieval and question answering system of claim 1, wherein the statistics-based question classification unit further includes a question corpus statistical information DB automatically learned from a learning document for statistics-based question classification using Naive Bayesian. Hybrid based question classification system.

A question work name processing step of recognizing a work name included in the input question;

A question language analysis step of converting the individual words into meaningful codes through morphological analysis, entity name recognition, and vocabulary semantic tagging process;

A rule-based question classification step of classifying a type of correct answer required by a question using a meaningful LSP type code of the question and a predefined question classification rule;

A statistics-based question classification step of performing classification on the LSP code of the question using statistical information constructed from a learning document tagged with the correct answer type of the question; And

A step of determining the type of question correct answer that finally determines the type of answer to the user's question by using the results of rule-based question classification and statistics-based question classification.

Hybrid based question classification method in the information retrieval and question answering system comprising a.

The method of claim 6, wherein the question title processing step comprises:

Converting a question into a work name using a pre-built work name dictionary and a unique code;

Filtering an area other than the work name by using the work name left / right context information constructed on the converted question; And

Outputting the filtered question

The method of claim 6, wherein the question language analysis step comprises:

Determining the parts of speech of each morpheme using a morpheme dictionary;

Performing recognition on an entity name using an entity name dictionary in which a plurality of predefined semantic classifications are defined as entity names;

Assigning a concept to each noun using a noun vocabulary concept network; And

Determining the lexical meaning of only one of the information attached in the lexical semantic tagging process through mutual information between the vocabularies in a sentence from a large corpus

The method of claim 8, wherein the individual name includes a person's name, academic name, theory, artifact, organization name, place name, culture / civilization, date, time, quantity, event, animal, plant, substance, terminology. Hybrid based question classification method in information retrieval and question answering system.

The method of claim 6, wherein the rule-based question classification step comprises:

In the question language analysis step, the information is coded into an LSP form according to the characteristics of each vocabulary included in the question, in the form of an LSP, and an information retrieval comprises an operator included in the question to classify the question types. Hybrid based question classification method in question answering system.

The method of claim 6, wherein the statistical based question classification step,

Using the learning document with the correct type of question required for the question, the statistical information of each variable is learned, and the correct type of question for the given question is found by using the following equation (1). Hybrid-based Question Classification in Information Retrieval and Question Answering Systems.

--- Equation (1)

Where q is the question, at is the correct answer type of the question, t is the name of the individual included in the question, semantic tagging, parts of speech, question sentences, vocabulary, and the like, and n is the size of the entire learning question.

12. The method of claim 6 or 11, wherein the statistical based question classification step,

Hybrid based question classification method in an information retrieval and question answer system, characterized by classifying the correct answer type for a question by considering the case where the probability is 0.

--- Equation (2)

The method of claim 6, wherein the determining of the correct type of question comprises:

Hybrid based question classification method in information retrieval and question answering system, characterized in that only the question classification satisfying the threshold among rule based question classification and statistical based question classification is selectively determined as the correct answer type of question.

The method of claim 6 or 13, wherein the determining the correct type of question

If the result obtained through learning in rule-based and statistical-based question classification is less than or equal to the threshold value, the classification of the question is abandoned and the meaning of the vocabulary is expressed from the last word of the question. Hybrid based question classification method in information retrieval and question answering system, characterized in that it is returned as a value.