KR20180125746A

KR20180125746A - System and Method for Sentence Embedding and Similar Question Retrieving

Info

Publication number: KR20180125746A
Application number: KR1020170060425A
Authority: KR
Inventors: 고영중; 배경만
Original assignee: 동아대학교 산학협력단
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2018-11-26
Also published as: KR101923650B1

Abstract

The present invention relates to an apparatus for sentence embedding and similar query retrieval and a method thereof which can improve a query retrieval service by using word embedding and case frame-based sentence embedding technology. The apparatus comprises a query language analysis processing unit for performing a language analysis on a query inputted by a user and a query which is subjected to retrieval to map a dependency relation between words in the query and a role as a sentence component and extract sentence components; a case frame generating unit for generating a case frame using the sentence components extracted by the query language analysis processing unit; a similarity calculating unit for calculating semantic similarity between the user′s query and the query which is subjected to retrieval by using a case frame-based vector generated by the case frame generating unit and a bag of words (BOW) model-based vector; a retrieval model building-up unit for building-up a retrieval model (WCFM) based on a relation between the sentence components using the case frame and the word embedding by using the result of similarity calculation by the similarity calculating unit; and a re-ranking unit for performing re-ranking by reflecting both the similarity and ranking of each retrieval result obtained through a translation based language model (TRLM) and the similarity obtained through the WCFM which has been built-up by the retrieval model building-up unit.

Description

{System and Method for Sentence Embedding and Similar Question Retrieving}

본 발명은 질문 검색 서비스에 관한 것으로, 구체적으로 워드 임베딩과 격틀 기반의 문장 임베딩 기술을 이용하여 질문 검색 서비스를 개선할 수 있도록 한 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법에 관한 것이다.The present invention relates to a query search service, and more particularly, to an apparatus and a method for a sentence embedding and a similar query search that can improve a query search service using word embedding and sentence-based embedding techniques.

인터넷이 발전하면서 사람들은 인터넷상에 존재하는 다양한 정보를 검색엔진을 이용해 검색한다. 하지만, 사용자가 원하는 정보를 정확히 찾기 위해서는 많은 시간과 노력이 필요하다.As the Internet develops, people search for various information on the Internet using a search engine. However, it takes a lot of time and effort to find exactly the information that the user wants.

최근에는 사용자가 자신이 원하는 정보에 대한 질문을 등록하면, 다른 사용자가 질문에 대한 응답을 해주는 커뮤니티 기반의 질문-응답 서비스(communitybased Question Answering Service)의 중요성이 늘어나고 있다.Recently, when a user registers a question about his or her desired information, the importance of community-based Question Answering Service, in which a user responds to a question, is increasing.

이와 같은 커뮤니티 기반의 질문-응답 서비스에서 사용되는 질문 답변 데이터에 대한 유사 질문 검색에서는 단어 불일치(Word Mismatch) 문제가 이슈가 된다. Word mismatch problem is a problem in similar query search for question answering data used in community based question - answering service.

이는 공통으로 쓰이는 단어가 적기 때문에 발생하며, 기존의 벡터 스페이스 모델(Vector Space Model)과 같은 단어 매칭(Term Matching) 기반의 검색 모델은 이 문제로 인하여 성능이 낮다.This is caused by a small number of common words, and a term matching based search model such as the conventional vector space model has low performance due to this problem.

유사 질문 검색 모델이 사용되는 커뮤니티 기반 질문-응답 서비스에서는 사용자가 자연어로 질문을 하고, 사용자 질문과 의미적으로 유사한 질문들을 검색하는 것이 중요하다.In a community-based question-and-answer service where the quasi-question retrieval model is used, it is important for the user to ask questions in natural language and to search for semantically similar questions to user questions.

유미적으로 유사한 질문을 검색하기 위해서는 단어 불일치 문제를 해결할 수 있어야 하며, 단어 간의 의미적 유사도를 효과적으로 계산하는 것이 필요하다.In order to retrieve similar queries, it is necessary to solve the word inconsistency problem, and it is necessary to calculate semantic similarity between words effectively.

종래 기술의 대표적인 질문 검색 모델은 언어 모델(Language Model, LM)을 기반으로 사용자 질문에 존재하는 단어가 검색할 대상 질문에서의 어떤 분포를 가지는지를 계산하여 의미적으로 유사한 질문을 검색하였다.A typical question retrieval model of the related art searches for a semantic similar question by calculating a distribution of a word existing in a user query based on a language model (LM).

하지만, 이 모델은 의미가 유사하지만 형태가 다른 단어사이의 연관성을 계산하지 못하는 문제가 존재한다. 이를 해결하기 위해 단어 간의 번역확률을 이용해 단어 불일치 문제를 해결하는 번역기반 언어 모델(Translation based Language Model, TRLM)이 제안되었다.However, there is a problem that this model can not calculate the associations between words with similar meanings but different forms. To solve this problem, a translation based language model (TRLM) has been proposed that solves the problem of word mismatch by using the probability of translation between words.

많은 질문 검색 모델들은 번역기반 언어 모델을 개선하기 위한 연구를 진행하였다. 하지만, 종래 기술들의 모델들은 단어가 질문 내에서 문장성분으로써의 역할을 고려하지 않고 있기 때문에 주어 또는 목적어 사이의 의미 연관성이 높은 유사질문임에도 불구하고, 만약 다른 검색 대상 질문의 서술어와 의미 연관성이 높으면 잘못된 검색 결과를 보여주게 된다.Many question retrieval models have been studied to improve the translation - based language model. However, although the models of the prior art do not consider the role of the word as a sentence component in the question, the similarity question between the subject or the object is highly similar, and if the semantic relation with the predicate of the other query object is high You will see incorrect search results.

따라서, 커뮤니티 기반 질문-응답 서비스 사용자가 입력한 질문과 가장 유사한 질문을 검색하는 질문 검색 서비스를 개선할 수 있는 새로운 기술의 개발이 요구되고 있다.Therefore, it is required to develop a new technology that can improve the question retrieval service which searches the questions most similar to the questions inputted by the community based Q & A service users.

대한민국 공개특허 제10-2004-0097814호Korean Patent Publication No. 10-2004-0097814 대한민국 공개특허 제10-2006-0063345호Korean Patent Publication No. 10-2006-0063345

본 발명은 이와 같은 종래 기술의 커뮤니티 기반 질문-응답 서비스의 문제를 해결하기 위한 것으로, 워드 임베딩과 격틀 기반의 문장 임베딩 기술을 이용하여 질문 검색 서비스를 개선할 수 있도록 한 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention solves the problem of the conventional community-based question-and-answer service. The present invention provides a method for searching a sentence embedded in a sentence and searching for a similar question so as to improve a question search service using word embedding and sentence- And an object of the present invention is to provide an apparatus and a method for providing the same.

본 발명은 사용자 질문의 의존관계로부터 문장의 성분으로 이루어진 격틀을 생성하고, 워드 임베딩과 격틀 기반의 문장 임베딩 기술을 이용하여 격틀 간의 단어 불일치 문제를 해결함으로써 커뮤니티 기반 질문-응답 서비스 사용자가 입력한 질문과 가장 유사한 질문을 효과적으로 검색할 수 있도록 한 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention generates conflicts composed of the components of a sentence from the dependency of the user questions and solves the problem of word inconsistency between the frames by using word embedding and sentence-based embedding techniques, And to provide a device and a method for searching sentence embedding and similar questions so that the most similar questions can be effectively retrieved.

본 발명은 주요 문장 성분으로 이루어진 격틀을 생성한 후 격틀간의 효과적인 의미 유사도 계산을 위해 워드 임베딩을 이용하여 문장 임베딩 기술을 개발하고, 기존 검색 모델의 결과를 재순위화 함으로써 검색 성능을 개선하는 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention relates to a sentence embedding technique for improving search performance by developing a sentence embedding technique using word embedding to re-rank the results of an existing search model to generate effective phrases consisting of main sentence components, And a device and a method for searching similar queries.

본 발명은 주절 및 수식어와 보어 형태의 종속절에서 주어, 목적어, 서술어, 보어를 각각 추출하고 최대 12개의 단어로 이루어진 격틀을 생성한 후 생성된 격틀을 기반으로 문장 성분으로써의 역할이 같은 단어 간의 의미 유사도를 효과적으로 계산함으로써 단어가 문장에서 가지는 역할을 반영할 수 있도록 질문 검색 모델을 개선하는 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention extracts subject, object, predicate, and bore from subordinate clauses, modifiers, and bore-type subordinate clauses, and generates a combat frame of up to 12 words. Then, based on the generated combat frame, It is an object of the present invention to provide a device and a method for searching sentence embedding and similar questions which improve a question retrieval model so as to reflect a role of a word in a sentence by effectively calculating similarity.

본 발명은 단어 간의 의미 유사도를 잘 표현하는 워드 임베딩 기반의 자질 벡터를 이용하여, 같은 문장 성분 간의 의미 유사도 계산 시 발생하는 단어 불일치 문제를 해결하는 동시에 효과적으로 단어 간 의미 유사도를 계산하는 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention uses a feature vector based on word embedding to well express the semantic similarity between words, thereby solving a problem of word mismatch occurring in the calculation of semantic similarity between the same sentence components, And an apparatus and method for searching for a question.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

이와 같은 목적을 달성하기 위한 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치는 사용자가 입력한 질문과 검색 대상이 되는 질문에 대해 언어 분석을 진행하여 질문에서 단어 간의 의존관계 및 문장 성분으로써의 역할을 매핑하고 문장 성분을 추출하는 질문 언어 분석 처리부;상기 질문 언어 분석 처리부에서 추출된 문장 성분을 이용하여 격틀을 생성하는 격틀 생성부;상기 격틀 생성부에서 생성된 격틀 기반의 벡터와 BOW(bag of words) 모델 기반의 벡터를 이용해 사용자 질문과 검색 대상 질문 사이의 의미 유사도를 계산하는 유사도 계산부;상기 유사도 계산부의 유사도 계산 결과를 이용하여 격틀과 워드 임베딩을 이용한 문장 성분 간의 연관성을 고려한 검색 모델(WCFM)을 구축하는 검색 모델 구축부;TRLM(Translation based Language Model)을 통해 얻어진 각 검색 결과의 유사도 및 순위와 상기 검색 모델 구축부에서 구축된 WCFM을 통해 얻어진 유사도를 모두 반영하여 재순위화를 진행하는 재순위화부;를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a device for searching for a sentence embedded and similar query according to the present invention, comprising: a language analyzing unit for analyzing a question inputted by a user and a question to be searched, A query language analysis processing unit for mapping a role and extracting sentence components, a match language generation unit for generating match words using the sentence components extracted from the query language analysis processing unit, a similarity calculation unit for calculating a semantic similarity between a user query and a search target query using a model based vector, a retrieval model considering association between sentence components using a squirrel-cage and word embedding using a similarity calculation result of the similarity calculation unit, (WCFM), a search model building unit, a translation based language model (TRLM) It characterized in that it comprises a; reflects the similarity obtained through the similarity and WCFM construction and position of each search result from the search model construction unit by re-ranking all eojin conversion unit to proceed with the re-ranking.

여기서, 상기 질문 언어 분석 처리부는, 사용자가 입력한 질문과 검색 대상이 되는 질문에 대해 언어 분석을 진행하는 질문 언어 분석부와,형태소 분석과 개체명 인식을 진행한 후 의존 파싱을 통해 질문에서 단어 간의 의존관계 및 문장 성분으로써의 역할을 매핑하는 역할 매핑부와,의존 파싱을 통해 각 단어의 문장 성분을 확인하여 질문에서 주요 문장 성분인 주어, 목적어, 서술어, 보어를 추출하는 문장 성분 추출부를 포함하는 것을 특징으로 한다.Here, the question language analysis processing unit may include a question language analysis unit for performing a language analysis on a question inputted by a user and a question to be searched, And a sentence component extraction unit for extracting subject, object, predicate, and bore, which are main sentence components in the question, by confirming sentence components of each word through dependency parsing .

그리고 상기 격틀 생성부는, 질문에 대한 의존관계를 이용해 주절 및 수식어 기반의 종속절과 보어 기반의 종속절에서 각각 주어, 서술어, 목적어, 보어를 추출하여 최대 12개의 단어로 이루어진 격틀을 생성하는 것을 특징으로 한다.The confusion generating unit is configured to extract a predicate, an object, and a bore, respectively, in a subordinate clause based on a subordinate clause and a modifier-based subordinate clause and a bore-based subordinate clause using a dependency relation of a query, .

그리고 상기 유사도 계산부는, 각 단어와 매칭이 되는 워드 임베딩 기반 자질 벡터를 이용해 격틀 벡터를 생성하고, 생성된 격틀 벡터간의 코사인 유사도를 계산하여 격틀 단어 간의 단어 불일치 문제를 해결하는 동시에 학습데이터를 통해 생성된 워드 임베딩 자질 벡터를 이용하여 단어 간 의미 유사도를 계산하는 것을 특징으로 한다.Then, the similarity calculation unit generates a match vector using a word-embedding-based feature vector that is matched with each word, calculates a cosine similarity between the generated match vectors, and solves the word mismatch problem between the match words, Word similarity is calculated using the word-embedding qualities vector.

그리고 상기 유사도 계산부에서, 격틀을 구성하는 각 단어는 고정된 위치를 가지며, 같은 위치에 있는 단어 간의 연관성만을 고려하여, 사용자 질문에서 생성한 격틀에 존재하는 주절의 주어는 검색 대상 질문에서 생성한 격틀에 존재하는 주절의 주어하고만 의미 연관성을 계산하는 것을 특징으로 한다.In addition, in the similarity calculation unit, each word constituting the combat frame has a fixed position, and only the relation between the words in the same position is considered, And calculating the semantic relation only with the subject of the main phrase present in the frame.

그리고 상기 유사도 계산부에서, 각 벡터의 가중치(weight)는 바이너리 값을 사용하고, 각 벡터 간 코사인 유사도를 계산하고, 선형 결합을 통해 최종 질문 간 의미 유사도를 계산하는 것을 특징으로 한다.In the similarity calculation unit, the weight of each vector is a binary value, the degree of similarity between the vectors is calculated, and the degree of similarity between the final questions is calculated through linear combination.

그리고 상기 검색 모델 구축부는, 격틀을 구성하는 각 단어와 매칭이 되는 워드 임베딩 기반의 자질 벡터를 매핑하여 새로운 벡터를 구성하여 격틀과 워드 임베딩을 이용한 문장 성분 간의 연관성을 고려한 검색 모델(WCFM) 구축하는 것을 특징으로 한다.The retrieval model building unit constructs a new retrieval model (WCFM) by considering the relevance between the sentence components using the punctuation and word embedding by mapping the qualification vectors based on the word embedding, which are matched with the respective words constituting the capturing frame, .

다른 목적을 달성하기 위한 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 방법은 사용자가 입력한 질문과 검색 대상이 되는 질문에 대해 언어 분석을 진행하여 질문에서 단어 간의 의존관계 및 문장 성분으로써의 역할을 매핑하고 문장 성분을 추출하는 질문 언어 분석 처리 단계;상기 질문 언어분석 처리 단계에서 추출된 문장 성분을 이용하여 격틀을 생성하는 격틀 생성 단계;상기 격틀 생성 단계에서 생성된 격틀 기반의 벡터와 BOW(bag of words) 모델 기반의 벡터를 이용해 사용자 질문과 검색 대상 질문 사이의 의미 유사도를 계산하는 유사도 계산 단계;상기 유사도 계산 단계의 유사도 계산 결과를 이용하여 격틀과 워드 임베딩을 이용한 문장 성분 간의 연관성을 고려한 검색 모델(WCFM)을 구축하는 검색 모델 구축 단계;TRLM(Translation based Language Model)을 통해 얻어진 각 검색 결과의 유사도 및 순위와 상기 구축된 WCFM을 통해 얻어진 유사도를 모두 반영하여 재순위화를 진행하는 재순위화 단계;를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method for embedding a sentence embedded in a sentence and a method for searching for a similar question, comprising the steps of: analyzing a question inputted by a user and a question to be searched, A query language analysis processing step of extracting a sentence component from the sentence-based speech data, a sentence generation step of generating a sentence structure by using sentence components extracted in the question language analysis processing step, a similarity calculation step of calculating semantic similarity between the user query and the search target query using the bag-of-words model-based vector, and the correlation between sentence components using the squares and word embedding using the result of similarity calculation in the similarity calculation step A search model building step of constructing a search model (WCFM); a translation based language And a re-ranking step of performing re-ranking by reflecting both the similarity and ranking of each search result obtained through the WCFM model and the similarity obtained through the established WCFM.

여기서, 상기 질문 언어 분석 처리 단계는,사용자가 입력한 질문과 검색 대상이 되는 질문에 대해 언어 분석을 진행하는 질문 언어 분석 단계와,형태소 분석과 개체명 인식을 진행한 후 의존 파싱을 통해 질문에서 단어 간의 의존관계 및 문장 성분으로써의 역할을 매핑하는 역할 매핑 단계와,의존 파싱을 통해 각 단어의 문장 성분을 확인하여 질문에서 주요 문장 성분인 주어, 목적어, 서술어, 보어를 추출하는 문장 성분 추출 단계를 포함하는 것을 특징으로 한다.Here, the question language analysis processing step may include a question language analysis step of performing a language analysis on a question inputted by the user and a question to be searched, and a step of performing morphological analysis and object name recognition, A role mapping step of mapping a role as a dependency relation between words and a role of a sentence component, a step of extracting a sentence component extracting subject, object, predicate, and bore, which are main sentence components in a question, by confirming sentence components of each word through dependency parsing And a control unit.

그리고 상기 격틀 생성 단계는, 질문에 대한 의존관계를 이용해 주절 및 수식어 기반의 종속절과 보어 기반의 종속절에서 각각 주어, 서술어, 목적어, 보어를 추출하여 최대 12개의 단어로 이루어진 격틀을 생성하는 것을 특징으로 한다.The confusion generating step generates confusion frames of up to 12 words by extracting a predicate, an object, and a bore, respectively, in a subordinate clause based on a query and a modifier-based subordinate clause and a bore-based subordinate clause, respectively, do.

그리고 상기 유사도 계산 단계는, 각 단어와 매칭이 되는 워드 임베딩 기반 자질 벡터를 이용해 격틀 벡터를 생성하고, 생성된 격틀 벡터간의 코사인 유사도를 계산하여 격틀 단어 간의 단어 불일치 문제를 해결하는 동시에 학습데이터를 통해 생성된 워드 임베딩 자질 벡터를 이용하여 단어 간 의미 유사도를 계산하는 것을 특징으로 한다.The similarity calculation step may include generating a match vector using a word embedding-based feature vector matched with each word, and calculating a cosine similarity between the generated match vectors to solve a word mismatch problem between the match words, And the similarity degree between words is calculated using the generated word-embedding quality vector.

그리고 상기 유사도 계산 단계에서, 격틀을 구성하는 각 단어는 고정된 위치를 가지며, 같은 위치에 있는 단어 간의 연관성만을 고려하여, 사용자 질문에서 생성한 격틀에 존재하는 주절의 주어는 검색 대상 질문에서 생성한 격틀에 존재하는 주절의 주어하고만 의미 연관성을 계산하는 것을 특징으로 한다.In the similarity calculation step, each word constituting the combat frame has a fixed position, and only the relation between the words in the same position is considered, and the subject of the main frame existing in the combat frame generated by the user query is generated And calculating the semantic relation only with the subject of the main phrase present in the frame.

그리고 상기 유사도 계산 단계에서, 각 벡터의 가중치(weight)는 바이너리 값을 사용하고, 각 벡터 간 코사인 유사도를 계산하고, 선형 결합을 통해 최종 질문 간 의미 유사도를 계산하는 것을 특징으로 한다.In the similarity calculation step, the weight of each vector is a binary value, the degree of similarity between the vectors is calculated, and the similarity degree between the final questions is calculated through linear combination.

이와 같은 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법은 다음과 같은 효과를 갖는다.The apparatus and method for searching sentence embedded and similar questions according to the present invention have the following effects.

첫째, 워드 임베딩과 격틀 기반의 문장 임베딩 기술을 이용하여 커뮤니티 기반 질문-응답 서비스를 개선할 수 있다.First, community - based Q & A services can be improved by using word embedding and penetration - based sentence embedding techniques.

둘째, 워드 임베딩과 격틀 기반의 문장 임베딩 기술을 이용하여 격틀 간의 단어 불일치 문제를 해결함으로써 커뮤니티 기반 질문-응답 서비스 사용자가 입력한 질문과 가장 유사한 질문을 효과적으로 검색할 수 있도록 한다.Second, by using word embedding and sentence embedding technique based on squirrel-cage, the problem of word discrepancy between settlements can be solved, so that community-based question-and-answer service users can effectively search for the questions most similar to the questions inputted.

셋째, 주요 문장 성분으로 이루어진 격틀을 생성한 후 격틀간의 효과적인 의미 유사도 계산을 위해 워드 임베딩을 이용하여 문장 임베딩 기술을 개발하고, 기존 검색 모델의 결과를 재순위화 하여 검색 성능을 개선한다.Third, we develop sentence embedding technique by using word embedding to calculate effective semantic similarity between dummy texts after generating sentences composed of main sentence components, and improve retrieval performance by re-ranking results of existing search models.

넷째, 문장 성분으로써의 역할이 같은 단어 간의 의미 유사도를 효과적으로 계산하여 단어가 문장에서 가지는 역할을 반영할 수 있도록 질문 검색 모델을 개선할 수 있다.Fourth, the query search model can be improved to reflect the role of words in sentences by effectively calculating the semantic similarity between words having the same role as sentence components.

다섯째, 단어 간의 의미 유사도를 잘 표현하는 워드 임베딩 기반의 자질 벡터를 이용하여, 같은 문장 성분 간의 의미 유사도 계산 시 발생하는 단어 불일치 문제를 해결하는 동시에 효과적으로 단어 간 의미 유사도를 계산할 수 있다.Fifth, by using a feature vector based on word embedding that expresses semantic similarity between words, it is possible to solve the word mismatch problem that occurs in calculation of semantic similarity between the same sentence components, and to calculate the semantic similarity between words effectively.

도 1은 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치의 구성도
도 2는 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 방법을 나타낸 플로우 차트
도 3은 한국어와 영어 문장에 대한 의존 파싱 결과와 그래프 기반 의존 관계를 나타낸 구성도
도 4는 주절 및 수식어와 보어 형태의 종속절에서 추출된 격틀의 구조도
도 5는 격틀 기반의 벡터와 bag of words 기반의 벡터를 이용한 사용자 질문과 검색 대상 질문의 의미 유사도 계산 모델 구성도
도 6은 워드 임베딩 기반의 자질 벡터를 이용한 새로운 격틀 벡터 생성 과정을 나타낸 구성도
도 7은 워드 임베딩 기반의 자질 벡터를 이용한 격틀 기반의 질문 검색 모델의 수식화 구성도
도 8은 격틀 및 워드 임베딩을 이용한 유사 질문 검색 모델을 이용한 유사 질문 검색 결과 재순위화를 위한 전체 구성도
도 9는 본 발명을 통해 개발된 검색 모델을 이용한 유사 질문 검색 성능의 비교 결과 그래프1 is a block diagram of an apparatus for searching for sentence embedding and similar questions according to the present invention.
Figure 2 is a flow chart illustrating a method for text embedding and similar query retrieval according to the present invention;
FIG. 3 is a diagram showing a dependency parsing result for Korean and English sentences and a graph-based dependency relationship
Fig. 4 is a schematic diagram of a squirrel cage extracted from subordinate clauses and modifiers and a bore-
FIG. 5 is a schematic diagram of a model for calculating the similarity of semantic similarity between a user query and a search query using a vector based on a puncture base and a vector based on bag of words
6 is a diagram showing a process of generating a new conflict vector using a word-based feature vector
FIG. 7 is a diagram illustrating the structure of a query-based query search model using a word-based feature vector
FIG. 8 is an overall configuration diagram for re-ranking the similar-question retrieval result using the similar-question retrieval model using the punctuation and word embedding
FIG. 9 is a graph showing a comparison result of similar query search performance using a search model developed through the present invention

이하, 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a preferred embodiment of an apparatus and a method for searching a sentence embedded and similar query according to the present invention will be described in detail.

본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.The features and advantages of the apparatus and method for sentence embedding and similar query retrieval according to the present invention will be apparent from the following detailed description of each embodiment.

도 1은 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치의 구성도이고, 도 2는 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 방법을 나타낸 플로우 차트이다.FIG. 1 is a block diagram of an apparatus for searching for sentence embedding and similar questions according to the present invention, and FIG. 2 is a flowchart illustrating a method for embedding sentences and searching for similar questions according to the present invention.

본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법은 주요 문장 성분으로 이루어진 격틀을 생성한 후 격틀 간의 효과적인 의미 유사도 계산을 위해 워드 임베딩을 이용하여 문장 임베딩 기술을 개발하고, 기존 검색 모델의 결과를 재순위화 하여 검색 성능을 개선하기 위한 것이다.The apparatus and method for searching for sentence embedding and similar questions according to the present invention generate sentences composed of main sentence components, develop a sentence embedding technique using word embedding to calculate effective semantic similarities between sentences, And to improve retrieval performance by re-ranking the results.

이를 위하여, 본 발명은 주절 및 수식어와 보어 형태의 종속절에서 주어, 목적어, 서술어, 보어를 각각 추출하고 최대 12개의 단어로 이루어진 격틀을 생성한 후 생성된 격틀을 기반으로 문장 성분으로써의 역할이 같은 단어 간의 의미 유사도를 효과적으로 계산하여 단어가 문장에서 가지는 역할을 반영할 수 있도록 질문 검색 모델을 개선하는 구성을 포함한다.For this purpose, the present invention extracts subject, object, predicate, and bore from subordinate clauses, modifiers, and bore type subordinate clauses, respectively, and generates a skeleton of up to twelve words. Based on the generated skeleton, And to improve the query retrieval model so as to reflect the role of words in sentences by effectively calculating semantic similarity between words.

또한, 본 발명은 단어 간의 의미 유사도를 잘 표현하는 워드 임베딩 기반의 자질 벡터를 이용하여, 같은 문장 성분 간의 의미 유사도 계산 시 발생하는 단어 불일치 문제를 해결하는 동시에 효과적으로 단어 간 의미 유사도를 계산하는 구성을 포함한다.In addition, the present invention solves the problem of word mismatch occurring in the calculation of semantic similarity between the same sentence components by using a word-based feature vector that well expresses semantic similarity between words, and at the same time, .

본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치는 도 1에서와 같이, 사용자가 입력한 질문과 검색 대상이 되는 질문에 대해 언어 분석을 진행하는 질문 언어 분석부(10)와, 형태소 분석과 개체명 인식을 진행한 후 의존 파싱을 통해 질문에서 단어 간의 의존관계 및 문장 성분으로써의 역할을 매핑하는 역할 매핑부(20)와, 의존 파싱을 통해 각 단어의 문장 성분을 확인하여 질문에서 주요 문장 성분인 주어, 목적어, 서술어, 보어를 추출하는 문장 성분 추출부(30)와, 주절 및 수식어 기반의 종속절과 보어 기반의 종속절에서 각각 주어, 서술어, 목적어, 보어를 추출하여 최대 12개의 단어로 이루어진 격틀을 생성하는 격틀 생성부(40)와, 격틀 기반의 벡터와 BOW(bag of words) 모델 기반의 벡터를 이용해 사용자 질문과 검색 대상 질문 사이의 의미 유사도를 계산하는 유사도 계산부(50)와, 격틀을 구성하는 각 단어와 매칭이 되는 워드 임베딩 기반의 자질 벡터를 매핑하여 새로운 벡터를 구성하여 격틀과 워드 임베딩을 이용한 문장 성분 간의 연관성을 고려한 검색 모델(Case Frame based Retrieval Model with Word embedding, WCFM) 구축하는 검색 모델 구축부(60)와, TRLM(Translation based Language Model)을 통해 얻어진 각 검색 결과의 유사도 및 순위와 WCFM을 통해 얻어진 유사도를 모두 반영하여 재순위화를 진행하는 재순위화부(70)를 포함한다.As shown in FIG. 1, the apparatus includes a question language analysis unit 10 for performing a language analysis on a question inputted by a user and a question to be searched, a morphological analysis unit A role mapping unit (20) for mapping dependency relation and a role as a sentence component in a question through dependency parsing after proceeding to recognition of an entity name, and a dependency parsing unit for checking the sentence components of each word through dependency parsing, A sentence component extraction unit 30 for extracting a subject, an object, a predicate, and a bore, which are components of a sentence, and a sentence extraction unit 30 for extracting a predicate, an object, and a bore from a subordinate clause based on a subordinate clause and a modifier and a subordinate clause based on a bore, (40) for generating a battle frame, and a meaning similarity between the user question and the search target question using a vector based on the battle-based vector and a bag of words (BOW) model A similarity calculation unit 50 for calculating a similarity degree between a word and a word, and a search vector modeling a word-based feature vector matching each word constituting the match frame to construct a new vector, (TRLM) based on the degree of similarity and rank of each search result obtained through the TRLM (Translation Based Language Model) and the degree of similarity obtained through the WCFM, And a re-ranking unit 70 for progressing the ranking.

이와 같은 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치는 번역확률 및 언어모델 기반 질문 검색 모델을 개선하기 위해 단어가 질문에서 가지는 문장 성분으로써의 역할을 반영한다.The apparatus for searching for sentence embedded and similar questions according to the present invention reflects the role of a word as a sentence component in a query in order to improve a translation probability and a language model based query search model.

이를 위해, 질문에 대한 의존 관계 결과를 이용해 주절과 수식어 형태의 종속절 그리고 보어 형태의 종속절에 포함된 주어, 목적어, 서술어, 보어를 각각 추출하고, 최대 12개의 단어로 이루어진 격틀을 생성한 후 각 단어와 매칭이 되는 워드 임베딩 기반 자질 벡터를 이용해 격틀 벡터를 생성한다.To do this, we extract subject, object, predicate, and bore included in subordinate clause and modifier subordinate clause and subordinate clause of bore type respectively using dependency relation result of question, And a matching vector is generated using the word-embedding-based feature vector.

생성된 격틀 벡터간의 코사인 유사도를 계산함으로써 격틀 단어 간의 단어 불일치 문제를 해결하는 동시에 대용량의 학습데이터를 통해 생성된 워드 임베딩 자질 벡터를 이용함으로써 단어 간 의미 유사도 역시 효과적으로 계산할 수 있다. The similarity between words can be calculated effectively by using the word-embedded feature vector generated through the large-capacity learning data while solving the problem of word mismatch between litter words by calculating the cosine similarity between the generated mathematical vectors.

본 발명을 통해 개발된 검색 모델(Case Frame based Retrieval Model with Word embedding, WCFM)은 번역확률 및 언어 모델 기반의 검색 모델(Translation based Language Model,TRLM)의 검색 결과 상위 N개에 대해 재순위화 하는 방법으로 검색 결과를 개선한다.The retrieval model (Case Frame Based Retrieval Model with Word embedding, WCFM) developed through the present invention re-ranking the top N search results of the translation probability and language model based search model (Translation based Language Model, TRLM) Improve search results in a way.

본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 방법은 도 2에서와 같이, 사용자가 입력한 질문과 검색 대상이 되는 질문에 대해 언어 분석을 진행하는 질문 언어 분석 단계(S201)와, 형태소 분석과 개체명 인식을 진행한 후 의존 파싱을 통해 질문에서 단어 간의 의존관계 및 문장 성분으로써의 역할을 매핑하는 역할 매핑 단계(S202)와, 의존 파싱을 통해 각 단어의 문장 성분을 확인하여 질문에서 주요 문장 성분인 주어, 목적어, 서술어, 보어를 추출하는 문장 성분 추출 단계(S203)와, 주절 및 수식어 기반의 종속절과 보어 기반의 종속절에서 각각 주어, 서술어, 목적어, 보어를 추출하여 최대 12개의 단어로 이루어진 격틀을 생성하는 격틀 생성 단계(S204)와, 격틀 기반의 벡터와 BOW(bag of words) 모델 기반의 벡터를 이용해 사용자 질문과 검색 대상 질문 사이의 의미 유사도를 계산하는 유사도 계산 단계(S205)와, 격틀을 구성하는 각 단어와 매칭이 되는 워드 임베딩 기반의 자질 벡터를 매핑하여 새로운 벡터를 구성하여 격틀과 워드 임베딩을 이용한 문장 성분 간의 연관성을 고려한 검색 모델(Case Frame based Retrieval Model with Word embedding, WCFM) 구축하는 검색 모델 구축 단계(S206)와, TRLM(Translation based Language Model)을 통해 얻어진 각 검색 결과의 유사도 및 순위와 WCFM을 통해 얻어진 유사도를 모두 반영하여 재순위화를 진행하는 재순위화 단계(S207)를 포함한다.As shown in FIG. 2, a method for searching a sentence embedded and a similar question according to the present invention includes a question language analysis step (S201) for performing a language analysis on a question inputted by a user and a question to be searched, A role mapping step (S202) for mapping the dependency relation between the words and the role of the sentence component in the question through dependency parsing after proceeding to recognition of the entity name, and a sentence component of each word through dependency parsing (S202) (S203) for extracting a subject, an object, a predicate, and a bore, and extracting a predicate, an object, and a bore from the subordinate clauses based on the main clause and the modifier words and the subordinate clauses based on the bore, (S204) for generating a battle frame, and a battle-based vector and a bag-of-words (BOW) A similarity calculation step (S205) for calculating the similarity degree of the sentence, and a feature vector based on a word embedding based on matching each word constituting the enclave, thereby constructing a new vector, taking into consideration the relation between the sentence components using the squirrel and word embedding The degree of similarity and rank of each search result obtained through the TRLM (Translation based Language Model) and the degree of similarity obtained through the WCFM (S206), which construct a search model for building a search model (Case Frame based Retrieval Model with Word embedding, WCFM) And a re-ranking step (S207) for re-ranking.

본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 방법의 각 단계에서의 처리 과정을 구체적으로 설명하면 다음과 같다.The processing in each step of the method for searching for sentence embedding and similar questions according to the present invention will be described in detail as follows.

도 3은 한국어와 영어 문장에 대한 의존 파싱 결과와 그래프 기반 의존 관계를 나타낸 구성도이다.FIG. 3 is a block diagram illustrating a dependency parsing result and a graph-based dependency relationship between Korean and English sentences.

그리고 도 4는 주절 및 수식어와 보어 형태의 종속절에서 추출된 격틀의 구조도이고, 도 5는 격틀 기반의 벡터와 bag of words 기반의 벡터를 이용한 사용자 질문과 검색 대상 질문의 의미 유사도 계산 모델 구성도이다.FIG. 4 is a structural diagram of the squares extracted from the subordinate clauses, the modifiers and the bore-type subordinate clauses, and FIG. 5 is a diagram of the calculation model of the semantic similarities between the user queries and the query questions using the vectors based on the collision- .

본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 방법에서 질문 언어 분석 단계(S201)에서 주요 동작 요소 중 하나인 격틀을 추출하기 위해서 기본적으로 사용자가 입력한 질문과 검색 대상이 되는 질문에 대해 언어 분석을 진행한다.In the method for text embedding and similar query retrieval according to the present invention, in order to extract one of the main operation elements in the query language analysis step S201, .

질문 언어 분석을 위하여 형태소 분석과 개체명 인식을 진행한 후 의존 파싱을 통해 질문에서 단어 간의 의존관계 및 문장 성분으로써의 역할을 매핑한다.After analyzing morphological analysis and object name for question language analysis, we map dependence relation between words and role as sentence component in question through dependency parsing.

여기서, 의존 파싱을 통해 얻어진 결과를 도 3과 같이 간단한 그래프로 표현할 수 있다.Here, the result obtained through the dependency parsing can be expressed by a simple graph as shown in FIG.

의존 파싱을 통해 각 단어의 문장 성분을 확인할 수 있으며, 간단한 규칙을 통해 질문에서 주요 문장 성분인 주어, 목적어, 서술어, 보어를 추출한다. Dependency parsing is used to identify the sentence components of each word and extracts subject, object, predicate, and bore, which are the main sentence components of the question through simple rules.

단문으로 이루어진 질문 이외에 종속절을 가지는 복문으로 구성된 질문도 존재한다. 이러한 경우를 모두 고려하기 위해 도 4에서와 같이, 주절 및 수식어 기반의 종속절과 보어 기반의 종속절에서 각각 주어, 서술어, 목적어, 보어를 추출하여 최대 12개의 단어로 이루어진 격틀을 생성한다.In addition to the short sentence question, there is also a question composed of complex sentences with subordinate clauses. In order to consider all of these cases, as shown in FIG. 4, a caption consisting of a maximum of 12 words is generated by extracting predicates, object words, and bores given in the subordinate clauses and the modifier-based subordinate clauses and the bore-based subordinate clauses, respectively.

격틀을 구성하는 각 단어는 도 4에서와 같이, 고정된 위치를 가지며, 같은 위치에 있는 단어 간의 연관성만을 고려한다.As shown in FIG. 4, each word constituting the squad has a fixed position, and only the association between words in the same position is considered.

즉, 사용자 질문에서 생성한 격틀에 존재하는 주절의 주어는 검색 대상 질문에서 생성한 격틀에 존재하는 주절의 주어하고만 의미 연관성을 계산한다. That is, the subject of the main phrase existing in the frame generated by the user query is calculated only by the subject of the main phrase present in the frame generated by the search target query.

그리고 추가적으로 사용자의 질문에 존재하는 모든 단어와 검색 대상이 되는 질문의 모든 단어 간의 의미 연관성을 계산하기 위해 Bag of Words 기반의 벡터를 사용한다.In addition, we use a Bag of Words-based vector to calculate the semantic associations between all words in the user's query and all the words in the query being searched.

도 5는 격틀 기반의 벡터와 bag of words 기반의 벡터를 이용해 사용자 질문과 검색 대상 질문 사이의 의미 유사도를 계산하는 모델을 나타낸 것이다.FIG. 5 shows a model for calculating the semantic similarity between a user query and a query target using a vector based on a puncture base and a vector based on a bag of words.

각 벡터의 weight는 바이너리 값을 사용한다. 각 벡터 간 코사인 유사도를 계산하고, 선형 결합을 통해 최종 질문 간 의미 유사도를 계산한다. The weight of each vector is a binary value. Compute the cosine similarity of each vector and calculate the semantic similarity between the final questions through linear combination.

하지만, 격틀 기반의 벡터 간 유사도 계산 시에 같은 의미를 가지지만 다른 형태의 단어들의 경우 유사도 계산이 되지 않는 단어 불일치 문제가 발생한다.However, word disagreement problem which has the same meaning in calculation of similarity between vectors based on settlement, but other similar words can not be calculated is generated.

이를 해결하기 위해 본 발명에서는 단어 간의 의미 유사도 계산에 적합한 워드 임베딩 기반의 자질 벡터를 사용한다.In order to solve this problem, the present invention uses a word-based feature vector suitable for calculation of semantic similarity between words.

대용량의 말뭉치로부터 구축한 워드 임베딩 자질 벡터들은 단어 하나가 대용량의 말뭉치에서 가지는 의미를 일정한 크기의 벡터로 잘 표현하고 있다.The word embedding qualities vectors constructed from large corpus corpus express well the meaning of one word in a large corpus by a constant size vector.

이러한 장점을 반영하기 위해 본 발명에서는 도 6에서와 같이 격틀을 구성하는 각 단어와 매칭이 되는 워드 임베딩 기반의 자질 벡터를 매핑하여 새로운 벡터를 구성한다.In order to reflect these advantages, the present invention forms a new vector by mapping a word-based feature vector matched with each word constituting the frame as shown in FIG.

도 6은 워드 임베딩 기반의 자질 벡터를 이용한 새로운 격틀 벡터 생성 과정을 나타낸 구성도이고, 도 7은 워드 임베딩 기반의 자질 벡터를 이용한 격틀 기반의 질문 검색 모델의 수식화 구성도이다.FIG. 6 is a block diagram illustrating a process for generating a new combat vector using a feature vector based on a word embedding, and FIG. 7 is a diagram illustrating the structure of a match-based query search model using a feature vector based on a word embedding.

64차원으로 이루어진 워드 임베딩 기반의 자질 벡터들을 가지는 룩업 테이블(Lookup Table)에서 격틀을 구성하는 각 단어와 매핑되는 자질 벡터를 가져온 후 각 벡터를 이웃하게 붙임으로써 12차원의 격틀 벡터를 최종적으로 워드 임베딩 기반의 자질 벡터가 포함된 768차원의 새로운 격틀 벡터를 생성한다.A feature vector to be mapped to each word constituting a match frame is fetched from a lookup table having feature vectors based on word embedding based on 64 dimensions, and then each of the feature vectors is adjacently attached so that a 12-dimensional match vector is finally word embedded And generates a new combat vector of 768 dimensions including the feature vector based on the feature vector.

만약, 룩업 테이블에 매핑되는 자질 벡터가 없는 경우는 모든 가중치(weight)가 0인 제로벡터를 사용한다. If there is no feature vector mapped to the lookup table, a zero vector with zero weight is used.

그리고 본 발명을 통해 개발된 격틀과 워드 임베딩을 이용한 문장 성분 간의 연관성을 고려한 검색 모델(Case Frame based Retrieval Model with Word embedding,WCFM)을 구축하고, 도 8에서 보는 것과 같이 기존의 TRLM의 검색 결과를 재순위화 하는 형태로 검색 결과를 개선한다.In addition, a search model (Case Frame based Retrieval Model with Word embedding, WCFM) considering the association between sentence components using squares and word embedding developed through the present invention is constructed, and as shown in FIG. 8, Improve search results in a re-ranking fashion.

도 8은 격틀 및 워드 임베딩을 이용한 유사 질문 검색 모델을 이용한 유사 질문 검색 결과 재순위화를 위한 전체 구성도이다.FIG. 8 is an overall configuration diagram for re-ranking the similar-question retrieval result using the similar-question retrieval model using the punctuation and word embedding.

TRLM을 통해 검색된 상위 400개의 검색 결과에 대해 WCFM을 통해 재순위화를 진행하며, 이때 TRLM를 통해 얻어진 각 검색 결과의 유사도 및 순위와 WCFM을 통해 얻어진 유사도를 모두 반영하여 재순화를 진행한다.The re-ranking is performed through the WCFM for the top 400 search results retrieved through TRLM. At this time, the re-optimization is performed by reflecting both the similarity and ranking of each search result obtained through TRLM and the similarity obtained through WCFM.

따라서, 단어가 가지는 문장 성분으로써의 역할과 문장 성분 간 연관성을 워드 임베딩을 기반으로 의미 연관성을 계산하여 기존 검색 결과에 반영함으로써 검색 결과가 개선된다.Therefore, the relevance between the role and the sentence component as a sentence component of a word is calculated on the basis of word embedding and reflected on the existing search result, thereby improving the search result.

이와 같은 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법을 적용한 실시 예를 설명 하면 다음과 같다.Hereinafter, an embodiment of an apparatus and method for searching for sentence embedding and similar questions according to the present invention will be described.

본 발명에서는 한국어와 영어의 대표적인 커뮤니티 기반 질문-응답 서비스인 네이버의 지식인과 Yahoo! Answer의 질문-응답 쌍을 대상으로 실시 예를 작성하였다.In the present invention, Naver's intellectuals, a representative community-based question-and-answer service in Korean and English, Answer question-answer pairs.

실시 예를 위해 사용한 말뭉치는 표 1에 나타내었다. The corpus used for the examples is shown in Table 1.

본 발명에서는 실시 예를 위해서 사용된 평가 방법은 수학식 1에서와 같다.In the present invention, the evaluation method used for the embodiment is as shown in Equation (1).

여기서, Q는 전체 질문의 개수를 의미하며, q는 하나의 질문을 의미한다.Here, Q represents the total number of questions, and q represents one question.

AP(q)는 하나의 질문 q에 대한 평균정확률(Average precision)을 나타내며, n은 질문 q에 대해 검색된 질문-응답 쌍의 개수를 나타낸다. AP (q) represents the average precision for one query q , and n represents the number of query-response pairs searched for q .

k는 검색된 n개의 질문-응답 쌍들 중에서의 순위를 나타내며, P(k)는 k번째 순위까지의 정확률을 의미한다. rel(k)는k 순위를 가지는 질문-응답 쌍이 질문 q와 연관성이 있는 지를 0과 1로 나타낸다. |r|은 질문 q과 연관된 질문-응답 쌍의 개수이다. k represents the ranking among the n query-response pairs searched, and P (k) represents the accuracy rate up to the kth ranking. rel (k) is a question having a rank k - represented by a 0 and 1, if the response in the question and pairs q and correlation. | r | Is the number of question-answer pairs associated with question q .

도 9는 본 발명을 통해 개발된 검색 모델을 이용한 유사 질문 검색 성능의 비교 결과 그래프이다.FIG. 9 is a graph of a comparison result of similar query search performance using a search model developed through the present invention.

한국어 말뭉치의 경우 랜덤하게 300개의 질문-응답 쌍을 추출한 후 Indri 검색 엔진을 통해 각 질문에 대해 상위 20개의 유사질문 집합을 구축하였다.In the case of Korean corpus, we extracted 300 question - answer pairs at random and then constructed the top 20 similar question sets for each question through Indri search engine.

관련된 분야를 전공한 평가자 3명을 통해 각 질문에 대한 20개의 유사질문과의 연관성을 판단한 후 3명의 평가자 중 2명이상이 연관성이 높다고 판단한 유사 질문들을 정답 검색 대상 질문으로 지정하였다.After evaluating the relationship between the 20 similar questions for each question through three evaluators who majored in related fields, two or more of the three evaluators designated similar questions that were deemed to be highly relevant as correct answer queries.

약 1500여개의 정답 질문을 선정하였으며, 이를 포함한 14,401개의 검색 대상 질문 집합을 실험에 사용하였다. 영어 말뭉치의 경우 Yahoo! Answer에서 제공한 말뭉치를 사용하였으며, 평가자가 평가한 평균 점수가 말뭉치에 포함되어 있다.Approximately 1,500 correct answer questions were selected and 14,401 questions were included in the experiment. For English corpus, Yahoo! The corpus provided by Answer is used, and the average score evaluated by the evaluator is included in the corpus.

2점 이상인 질문만을 정답 질문으로 선정하여 실험을 진행하였다.Only two or more questions were selected as correct answer questions.

그리고 기존의 언어 모델 기반 검색 모델(LM)과 번역 확률 및 언어 모델 기반 검색 모델(TRLM)과의 비교를 통해 본 발명에서 개발한 격틀 및 워드 임베딩을 이용한 유사 질문 검색 결과 재순위화 모델이 기존 모델의 검색 성능을 개선하는 것을 보여준다.Based on the comparison between the existing language model based retrieval model (LM) and the translation probability and language model based retrieval model (TRLM), the re-ranking model of the similar query retrieval result using the squirrel- To improve the retrieval performance.

도 9는 기존 모델과 본 발명에서 개발한 모델의 유사 질문 검색 성능을 나타낸다.9 shows the similar query retrieval performance of the existing model and the model developed in the present invention.

기존 검색 모델인 TRLM에 본 발명을 통해 개발한 재순위화 모델을 적용했을 때 한국어와 영어 말뭉치에 대해 모두 개선된 성능을 얻었으며, 특히 영어의 경우 한국어 보다 더 나은 성능 개선을 보였다. When the re - ranking model developed by the present invention is applied to the existing retrieval model TRLM, improved performance is obtained for both Korean and English corpus. Especially, English has better performance than Korean.

이상에서 설명한 본 발명에 따른 문장 임베딩 및 유사 질문 검색을 위한 장치 및 방법은 질문 검색 모델을 개선하여 문장간의 유사도를 측정하여 활용하는 검색모델에서 활용할 수 있을 뿐만 아니라 다양한 문장 간 연관성을 계산하는 영역에 활용될 수 있다.The apparatus and method for searching for sentence embedding and similar questions according to the present invention described above can be utilized not only in a retrieval model that measures similarity between sentences by improving the query retrieval model, Can be utilized.

실시 예에서 확인한 바와 같이 커뮤니티 기반 질문-응답 서비스의 검색 성능을 향상시키는 것으로 확인할 수 있었다. 격틀을 이용해 문장 성분을 검색에 활용할 수 있으며, 단어 간 의미 연관성을 효과적으로 계산할 수 있는 워드 임베딩을 사용하기 때문에 다양한 분야에서 단어 간 의미 연관성을 계산하는 곳에 활용할 수 있다.It is confirmed that the retrieval performance of the community based Q & A service is improved as confirmed in the embodiment. We can use the sentence component to search for sentences using the squat, and use word embedding to calculate the semantic relation between words effectively, so it can be used for calculating semantic associations between words in various fields.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.As described above, it will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.It is therefore to be understood that the specified embodiments are to be considered in an illustrative rather than a restrictive sense and that the scope of the invention is indicated by the appended claims rather than by the foregoing description and that all such differences falling within the scope of equivalents thereof are intended to be embraced therein It should be interpreted.

10. 질문 언어 분석부 20. 역할 매핑부
30. 문장 성분 추출부 40. 격틀 생성부
50. 유사도 계산부 60. 검색 모델 구축부
70. 재순위화부10. Question language analysis section 20. Role mapping section
30. Sentence component extracting unit 40. Sentence generating unit
50. Similarity calculation unit 60. Search model construction unit
70. The reorderer

Claims

A query language analysis processing unit for performing a language analysis on a question inputted by a user and a question to be searched to map a role as a dependency relation and a sentence component between words in a question and extract a sentence component;
An impersonation generator for generating impersonation frames using the sentence components extracted by the question language analysis processor;
A similarity calculation unit for calculating a similarity degree between a user query and a search target query by using a match-based vector and a bag-of-words (BOW) model-based vector generated by the match generation unit;
A retrieval model building unit for constructing a retrieval model (WCFM) that considers the association between sentence components and the sentence components using the punctuation and word embedding, using the similarity calculation result of the similarity calculation unit;
And a re-ranking unit for performing re-ranking by reflecting both the similarity and ranking of each search result obtained through the TRLM (Translation based Language Model) and the similarity obtained through the WCFM constructed in the search model building unit A device for embedding sentences and searching for similar questions.

2. The apparatus according to claim 1,
A question language analysis unit for performing a language analysis on a question inputted by a user and a question to be searched,
A role mapping unit for mapping the dependency relation between the words and the role as a sentence component through dependency parsing after morphological analysis and object name recognition,
And a sentence component extraction unit for extracting a subject, an object, a predicate, and a bore, which are main sentence components in a question, by confirming sentence components of each word through dependency parsing.

The apparatus according to claim 1,
A sentence embedding and similar query search is performed by extracting a predicate, an object, and a bore from a subordinate clause based on a subject dependency and a bore based subordinate clause using a dependency relation to a question, Lt; / RTI >

The apparatus according to claim 1,
In order to solve the problem of word mismatch between domesticated words by generating a mathematical vector using a word-based feature vector matching each word and calculating the cosine similarity between generated mathematical vectors, a word-embedded qualification vector generated through learning data And calculating the similarity degree between the words using the similarity measure.

5. The apparatus according to claim 4,
Each word constituting the squat has a fixed position, and only considering the association between words in the same position,
Wherein the subject of the main phrase existing in the frame generated by the user query is calculated only by the subject of the main phrase present in the frame generated in the search target query.

5. The apparatus according to claim 4,
Wherein the weight of each vector is a binary value, calculating a cosine similarity between each vector, and calculating a semantic similarity between final questions through linear combination.

The search system according to claim 1,
(WCFM) is constructed in consideration of the relation between sentence components using squares and word embedding by constructing a new vector by mapping feature vectors based on word embedding which are matched with the respective words constituting the capturing frame. A device for searching similar queries.

A query language analysis processing step of performing a language analysis on a question inputted by a user and a question to be searched to map a role as a dependency relation and a sentence component between words in a question and extract a sentence component;
Generating an impersonation frame using the sentence components extracted in the question language analysis processing step;
A similarity calculation step of calculating a semantic similarity degree between a user query and a search target query using the match-based vector and the bag-of-words (BOW) model-based vector generated in the match-fighting step;
A retrieval model building step of constructing a retrieval model (WCFM) considering a relation between sentence components using a squirrel-cage and word embedding, using the similarity calculation result of the similarity calculation step;
And a re-ranking step of performing re-ranking by reflecting both the degree of similarity and rank of each search result obtained through the TRLM (Translation based Language Model) and the degree of similarity obtained through the established WCFM. And methods for searching for similar queries.

9. The method according to claim 8,
A question language analysis step of performing a language analysis on a question inputted by a user and a question to be searched,
A role mapping step of mapping the dependency relation between words and the role as a sentence component in the question through dependency parsing after morphological analysis and object name recognition,
And a sentence component extraction step of extracting subject, object, predicate, and bore, which are main sentence components in the question, by confirming sentence components of each word through dependency parsing.

9. The method according to claim 8,
A sentence embedding and similar query search is performed by extracting a predicate, an object, and a bore from a subordinate clause based on a subject dependency and a bore based subordinate clause using a dependency relation to a question, Lt; / RTI >

9. The method according to claim 8,
In order to solve the problem of word mismatch between domesticated words by generating a mathematical vector using a word-based feature vector matching each word and calculating the cosine similarity between generated mathematical vectors, a word-embedded qualification vector generated through learning data And calculating the similarity degree between the words using the similarity measure.

12. The method according to claim 11,
Each word constituting the squat has a fixed position, and only considering the association between words in the same position,
Wherein the subject of the main phrase existing in the frame generated by the user query is calculated only by the presence of the main phrase present in the frame generated in the search target query.

12. The method according to claim 11,
Wherein the weight of each vector is a binary value, calculating a cosine similarity between each vector, and calculating a semantic similarity between final questions through linear combination.