KR20210069298A

KR20210069298A - Apparatus and Method for restoring Conversation Segment Sentences

Info

Publication number: KR20210069298A
Application number: KR1020190158937A
Authority: KR
Inventors: 신성현; 이종언
Original assignee: 주식회사 엘지유플러스
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2021-06-11
Also published as: KR102357023B1

Abstract

An exemplary embodiment of the present invention relates to an apparatus and a method for restoring a continuation being segment sentences omitting verbs from continuous conversation sentences in an intention analysis possible state using an intention analysis possible preceding sentence. The apparatus for restoring a conversation segment sentences comprises: a sentence separating unit to separate continuous conversation sentences in a sentence unit; a natural language processing unit to analyze a morpheme and recognize an object name for the separated sentences; a sentence type classifying unit to classify the separated sentences into preceding sentences including verbs and a continuation omitting the verbs; a buffering unit to give and buffer a position index according to a word order to the classified preceding sentences; a similarity analysis unit to analyze similarity between an object name of the continuation and an object name of the preceding sentence based on a frequency of the object name of the classified continuation appearing in the classified preceding sentence and a probability of the object name of the classified continuation appearing from the classified preceding sentence; and a segment sentences restoring unit to replace a corresponding word of a preceding sentence having the analyzed similarity greater than a reset reference value with a corresponding word of the continuation.

Description

Apparatus and Method for restoring Conversation Segment Sentences

본 발명은 대화 문장에서 용언이 생략된 분절 문장인 후속문을 의도 분석이 가능한 선행문을 활용하여 의도 분석이 가능한 상태로 복원하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for restoring a subsequent sentence, which is a segmental sentence in which a verb is omitted, to a state in which intention analysis is possible by using a preceding sentence in which intention analysis is possible.

자연어 처리(NLP, natural language processing)는 형태소 분석이나 구문 분석, 의미 분석 등을 통해 대화 분석(Dialogue Analysis), 정보 검색(Information Retrieval), 정보 추출(Information Extraction), 정보 요약(Summarization) 등의 다양한 기술을 포함한 개념으로, 자연어를 기계적으로 분석하여 컴퓨터가 이해할 수 있는 형태로 만드는 주요 기술 분야이다. 특히, 최근 인공지능이 발전함에 따라, 대화형 질의응답, 챗봇 시스템 등 로봇에게 사람의 자연어 대화를 이해하고 응답을 생성하여 인간과 로봇이 상호작용하는 미래지향적 연구들이 지속적으로 발전하고 있다. Natural language processing (NLP) is a variety of methods such as dialog analysis, information retrieval, information extraction, and summarization through morpheme analysis, syntax analysis, and semantic analysis. It is a concept that includes technology, and it is a major technological field that mechanically analyzes natural language and makes it into a form that can be understood by computers. In particular, with the recent development of artificial intelligence, future-oriented studies in which humans and robots interact by understanding human natural language conversations and generating responses to robots such as interactive question-and-answer and chatbot systems are continuously developing.

자연어 이해는 대화 문장이 가지는 다양한 표현과 형태에 따라 문맥과 상황에 맞추어 사용자의 의도를 파악하는 것이 중요하다. 그런 이유로, 자연어 처리는 문장의 구어와 문어를 동시에 이해하는 것이 필요하다. 보통의 문장 성분은 단어와 문장 구성 요소들의 관계를 분석하는 의존 구문 분석을 할 경우 주어, 동사, 목적어의 기본 구성으로 이루어진다. 하지만, 한국어의 경우는 동사와 형용사가 포함되는 용언 중심의 언어로 용언을 취하는 것이 일반적이지만 구어의 경우에는 용언의 생략 만으로 길게 대화를 이어나간다. 이렇듯, 대화 문장에서 용언의 생략 현상은 사람은 대화 이력을 통해 자연어의 의도(intent)를 쉽게 이해하지만 기계는 이런 의도를 이해하기가 어려운 문제점이 있다.For natural language understanding, it is important to understand the user's intentions according to the context and situation according to the various expressions and forms of dialogue sentences. For that reason, natural language processing needs to understand both the spoken and written language of a sentence at the same time. A normal sentence component consists of a basic composition of a subject, a verb, and an object when performing dependency syntax analysis to analyze the relationship between words and sentence components. However, in the case of Korean, it is common to take a verb as a verb-centered language that includes verbs and adjectives, but in the case of spoken language, a long conversation continues only by omitting the verb. As such, the omission of verbs in conversational sentences has a problem in that humans can easily understand the intent of natural language through conversation history, but machines have difficulty understanding this intent.

등록특허공보 제10-0641053호(2006.10.25.)Registered Patent Publication No. 10-0641053 (October 25, 2006)

본 발명은 전술한 종래의 문제점을 해결하기 위한 것으로, 그 목적은 연속적인 대화 문장에서 용언이 생략된 분절 문장인 후속문을 의도 분석이 가능한 선행문을 활용하여 의도 분석이 가능한 상태로 문장을 복원하기 위한 장치 및 방법을 제공하는 것이다.The present invention is to solve the above-mentioned conventional problems, and its purpose is to restore a sentence in a state in which intention analysis is possible by using a preceding sentence in which intention analysis is possible in a subsequent sentence, which is a segmental sentence in which a verb is omitted in a continuous dialogue sentence. It is to provide an apparatus and method for doing so.

한국어 대화 문장에서는 대부분 용언이 생략되어 분절된 문장이 되는 경우가 대부분이다. 앞으로, 본 발명의 설명 시 대화 문장 중 문장 내 용언이 존재하는 문장을 선행문(先行文, 이전의 대화 문장)이라 하고, 문장 내 용언이 생략된 분절 문장을 후속문(後續文, 그 후 대화 문장)이라고 한다. 일반 대화는 선행문과 후속문을 적절히 사용하면서 대화가 이뤄진다. 하지만, 사람과 달리 기계는 용언이 생략된 불완전한 문장을 이해하기란 어렵다. 이 경우, 선행문을 활용하여 용언이 생략된 후속문을 복원시키는 것이 중요하다. 예를 들면,In most of the Korean conversational sentences, verbs are omitted to form fragmented sentences. Hereinafter, in the description of the present invention, a sentence in which a proverb in a sentence is present in a conversational sentence is referred to as an antecedent sentence (先行文, a previous conversation sentence), and a segmental sentence in which the proverb in the sentence is omitted is a follow-up sentence (after a conversation) sentence) is called. In a general conversation, the conversation takes place by using the preceding sentences and the following sentences appropriately. However, unlike humans, it is difficult for machines to understand incomplete sentences in which verbs are omitted. In this case, it is important to use the preceding sentence to restore the subsequent sentence in which the verb is omitted. For example,

(1) "명동역까지 버스 요금은 얼마입니까?"(One) "How much is the bus fare to Myeongdong Station?"

(2) "그럼, 택시는?"(2) "Then, how about a taxi?"

선행문은 예문 (1)처럼 "얼마입니까?"와 같은 용언이 포함된 문장이지만, 분절 문장인 후속문은 예문 (2)처럼 "그럼, 택시는?"과 같은 용언이 생략된 문장을 일컫는다. 예문에서 보듯이, 사람은 대화 이력을 통해서 용언이 생략된 불완전한 문장을 이해할 수 있지만, 기계는 이를 이해하지 못한다. 그 이유는 분절 문장 만으로 언어적 분석이 행해지면 문맥상 의도는 모호(ambiguity)해지거나 이해할 수 없기 때문이다. 따라서, 자연스러운 대화에서 제대로 된 분절 문장의 의도를 분석하기 위한 방법이 절대적으로 필요하다.As in Example (1), the preceding sentence includes a verb such as “How much is it?”, but a follow-up sentence, which is a segmental sentence, refers to a sentence in which a verb such as “Then, how about a taxi?” is omitted, as in Example (2). As shown in the example sentence, a human can understand an incomplete sentence in which a verb is omitted through a conversation history, but a machine cannot understand it. The reason is that if linguistic analysis is performed only with segmental sentences, the intention in context becomes ambiguity or cannot be understood. Therefore, a method for analyzing the intention of a properly segmented sentence in a natural conversation is absolutely necessary.

전술한 예문에서 선행문 "명동역까지 버스 요금은 얼마입니까?"의 경우는 해당 지역까지의 이동 수단의 거리 비용을 질문하는 형태로 이해할 수 있지만, 후속문 "그럼, 택시는?"의 경우에는 거리 비용을 질문하는 것인지 또는 이동 수단의 상태를 질문하는 것인 지 이해하기가 어렵다. 따라서 이러한 후속문은 완전한 문장이 아니지만, 용언의 생략은 선행문과 연관된 이해의 일부이므로 대화 분절 문장을 분석하고 의도 분석이 가능한 선행문의 문맥 맥락을 통해 "명동역까지 택시 요금은 얼마입니까?"로 복원하는 것으로 문장의 모호성을 줄이고 이해 및 의도 분석이 가능토록 복원하는 것이 본 발명의 주요 목적이다.In the preceding example, the preceding sentence “How much is the bus fare to Myeongdong Station? ” can be understood as a question of the distance cost of transportation to the area, but in the case of the follow-up sentence “ So, how about a taxi? ”, the distance It is difficult to understand whether we are asking the cost or the status of the vehicle. Therefore, although these follow-up sentences are not complete sentences, the omission of verbs is part of understanding related to the preceding sentence, so it is possible to analyze the dialogue segmentation sentences and restore them to “ How much is the taxi fare to Myeongdong Station?” It is a main object of the present invention to reduce the ambiguity of the sentence and to restore it so that understanding and intention analysis are possible.

본 발명에서 대화 분절 문장의 복원 과정은 다음과 같이 이루어진다. 대화형 인터페이스로부터 사용자 대화 문장을 입력 받는 것으로 본 프로세스가 동작한다. 우선, 입력받은 대화 문장은 텍스트 정규화(text normalized)를 통해서 불필요한 기호를 삭제하고, 문장 단위 형태로 분리한다[문장 단위 분리 과정]. 다음으로, 문장의 최소 단위인 형태소 단위로 구분하여 분류한 후, 분석한 형태소 단위 즉, 단어에서 필요한 정보를 추출하는 과정으로 주요 의미를 분석한다[자연어처리 과정]. 만약, 문장이 용언으로 끝나거나 완성도가 높은 문장으로 파악되면 선행문으로 분류하고, 불완전하거나 용언으로 끝나지 않는 분절 문장은 후속문으로 분류한다[문장 타입 과정]. 분류된 선행문은 단어 순서에 따라 인덱스를 부여하여 메모리 버퍼(buffer)에 저장한다[단어 인덱싱 및 버퍼링 과정]. 분류된 후속문은 버퍼(buffer)에 저장된 선행문 정보를 이용하여 선행문과 후속문의 단어 간의 유사도를 분석 계산하고[유사도 분석 과정], 유사도 분석 결과값과 단어 순서에 의한 상관관계를 기초로 선행문의 특정 단어 위치에 후속문의 대응하는 해당 단어를 대치하여 분절 문장인 후속문을 선행문을 기반으로 복원한다[분절문장 복원 과정].In the present invention, the restoration process of the dialogue segmental sentence is performed as follows. This process operates by receiving user conversation sentences from the interactive interface. First, the received dialogue sentence is separated into sentence unit form by deleting unnecessary symbols through text normalization [sentence unit separation process]. Next, after classifying and classifying the morpheme unit, which is the smallest unit of the sentence, the main meaning is analyzed in the process of extracting the necessary information from the analyzed morpheme unit, that is, the word [Natural Language Processing Process]. If a sentence ends with a verb or is identified as a sentence with high completeness, it is classified as a preceding sentence, and a segmental sentence that is incomplete or does not end with a verb is classified as a follow-up sentence [sentence type process]. Classified preceding sentences are indexed according to word order and stored in a memory buffer [word indexing and buffering process]. For the classified follow-up sentence, the similarity between the preceding sentence and the subsequent sentence is analyzed and calculated using the preceding sentence information stored in the buffer [similarity analysis process], and based on the correlation between the similarity analysis result value and the word order, the preceding sentence A subsequent sentence, which is a segmental sentence, is restored based on the preceding sentence by replacing the corresponding word in the subsequent sentence at a specific word position [segmental sentence restoration process].

전술한 목적을 달성하기 위하여 본 발명의 일 측면에 따른 대화 분절 문장의 복원을 위한 장치는, 연속적인 대화 문장을 문장 단위로 분리하기 위한 문장분리부; 상기 분리된 문장에 대해 형태소 분석 및 (의미 분석에 의한) 개체명 인식을 수행하기 위한 자연어처리부; 상기 분리된 문장에 대해 용언이 포함된 선행문과 용언이 생략된 후속문으로 분류하기 위한 문장타입분류부; 상기 분류된 선행문에 대해 단어 순서에 따른 위치 인덱스를 부여하여 버퍼링하기 위한 버퍼링부; 상기 분류된 선행문 내에서 나타나는 상기 분류된 후속문의 개체명의 빈도수, 및 상기 분류된 선행문에서 상기 분류된 후속문의 개체명이 출현할 확률을 기초로, 후속문의 개체명과 선행문의 개체명 간의 유사도를 분석하기 위한 유사도분석부; 및 상기 분석된 유사도가 기 설정된 기준값 이상인 선행문의 해당 단어를 후속문의 해당 단어로 교체하기 위한 분절문장복원부;를 포함할 수 있다.In order to achieve the above object, an apparatus for restoring a dialogue segmented sentence according to an aspect of the present invention includes: a sentence separation unit for dividing a continuous dialogue sentence into sentence units; a natural language processing unit for performing morphological analysis and (by semantic analysis) recognition of entity names on the separated sentences; a sentence type classification unit for classifying the separated sentence into a preceding sentence including a verb and a subsequent sentence in which a verb is omitted; a buffering unit for buffering by assigning a position index according to a word order to the classified preceding sentence; Analysis of the similarity between the entity name of the subsequent sentence and the entity name of the antecedent sentence based on the frequency of the entity names of the classified follow-up sentence appearing in the classified antecedent sentence, and the probability that the entity name of the classified follow-up sentence appears in the classified antecedent sentence a similarity analysis unit for and a segmented sentence restoration unit for replacing the corresponding word in the preceding sentence in which the analyzed similarity is greater than or equal to a preset reference value with the corresponding word in the subsequent sentence.

상기 유사도분석부는 TF(Term Frequency) 방식을 통해 상기 분류된 선행문 내에서 나타나는 상기 분류된 후속문의 개체명의 빈도수를 산출할 수 있고, IDF(Inverse Document Frequency) 방식을 통해 상기 분류된 선행문에서 상기 분류된 후속문의 개체명이 출현할 확률을 산출할 수 있으며, TF-IDF(Term Frequency - Inverse Document Frequency) 방식을 통해 후속문의 개체명과 선행문의 개체명 간의 유사도를 산출할 수 있다.The similarity analysis unit may calculate the frequency of the entity names of the classified subsequent sentences appearing in the classified antecedent sentences through a TF (Term Frequency) method, and in the classified preceding sentences through an IDF (Inverse Document Frequency) method, the The probability of occurrence of the classified entity name in the subsequent sentence can be calculated, and the degree of similarity between the entity name of the subsequent sentence and the entity name of the preceding sentence can be calculated through the TF-IDF (Term Frequency - Inverse Document Frequency) method.

상기 자연어처리부는 상기 개체명 인식 결과를 기초로 선행문과 후행문 간의 유사성 여부를 판단할 수 있고, 상기 선행문과 후행문 간의 유사성 여부는 자카드 계수(Jaccard Coefficient)에 의해 판단할 수 있다.The natural language processing unit may determine whether the preceding sentence and the following sentence are similar based on the result of recognizing the entity name, and the similarity between the preceding sentence and the following sentence may be determined by a Jaccard coefficient.

상기 문장타입분류부는 연속적인 대화 문장 중 상기 자연어처리부를 통해 유사성 있다고 판단된 두 문장에 대해 용언이 포함된 선행문과 용언이 생략된 후속문으로 분류할 수 있다.The sentence type classification unit may classify two sentences determined to be similar by the natural language processing unit into a preceding sentence including a verb and a subsequent sentence in which the verb is omitted, among continuous dialogue sentences.

전술한 목적을 달성하기 위하여 본 발명의 다른 측면에 따른 대화 분절 문장의 복원을 위한 방법은, (a) 연속적인 대화 문장을 문장 단위로 분리하기 위한 단계; (b) 상기 분리된 문장에 대해 형태소 분석 및 (의미 분석에 의한) 개체명 인식을 수행하기 위한 단계; (c) 상기 분리된 문장에 대해 용언이 포함된 선행문과 용언이 생략된 후속문으로 분류하기 위한 단계; (d) 상기 분류된 선행문에 대해 단어 순서에 따른 위치 인덱스를 부여하여 버퍼링하기 위한 단계; (e) 상기 분류된 선행문 내에서 나타나는 상기 분류된 후속문의 개체명의 빈도수, 및 상기 분류된 선행문에서 상기 분류된 후속문의 개체명이 출현할 확률을 기초로, 후속문의 개체명과 선행문의 개체명 간의 유사도를 분석하기 위한 단계; 및 (f) 상기 분석된 유사도가 기 설정된 기준값 이상인 선행문의 해당 단어를 후속문의 해당 단어로 교체하기 위한 단계;를 포함할 수 있다. In order to achieve the above object, a method for restoring a dialogue segment sentence according to another aspect of the present invention includes the steps of: (a) separating a continuous dialogue sentence into sentence units; (b) performing morphological analysis and entity name recognition (by semantic analysis) on the separated sentence; (c) classifying the separated sentence into a preceding sentence including a verb and a subsequent sentence in which the verb is omitted; (d) buffering the classified antecedent sentences by assigning position indexes according to word order; (e) between the entity name of the subsequent sentence and the entity name of the antecedent sentence based on the frequency of the entity names of the classified successor sentence appearing in the classified antecedent sentence, and the probability of occurrence of the entity name of the classified successor sentence in the classified antecedent sentence analyzing the similarity; and (f) replacing the corresponding word in the preceding sentence in which the analyzed similarity is greater than or equal to a preset reference value with the corresponding word in the subsequent sentence.

상기 단계 (e)는 TF(Term Frequency) 방식을 통해 상기 분류된 선행문 내에서 나타나는 상기 분류된 후속문의 개체명의 빈도수를 산출할 수 있다.In step (e), the frequency number of the entity name of the classified subsequent sentence appearing in the classified preceding sentence may be calculated through the TF (Term Frequency) method.

상기 단계 (e)는 IDF(Inverse Document Frequency) 방식을 통해 상기 분류된 선행문에서 상기 분류된 후속문의 개체명이 출현할 확률을 산출할 수 있다. In step (e), the probability that the entity name of the classified subsequent sentence appears in the classified preceding sentence may be calculated through an inverse document frequency (IDF) method.

상기 단계 (e)는 TF-IDF(Term Frequency - Inverse Document Frequency) 방식을 통해 후속문의 개체명과 선행문의 개체명 간의 유사도를 산출할 수 있다.In step (e), the degree of similarity between the entity name of the subsequent sentence and the entity name of the preceding sentence may be calculated through the TF-IDF (Term Frequency - Inverse Document Frequency) method.

상기 단계 (b)는 상기 개체명 인식 결과를 기초로 선행문과 후행문 간의 유사성 여부를 판단할 수 있고, 상기 선행문과 후행문 간의 유사성 여부는 자카드 계수(Jaccard Coefficient)에 의해 판단할 수 있다.In step (b), it is possible to determine whether the preceding sentence and the following sentence are similar based on the result of recognizing the entity name, and the similarity between the preceding sentence and the following sentence can be determined by a Jaccard coefficient.

상기 단계 (c)는 연속적인 대화 문장 중 상기 단계 (b)를 통해 유사성 있다고 판단된 두 문장에 대해 용언이 포함된 선행문과 용언이 생략된 후속문으로 분류할 수 있다.In step (c), two sentences determined to be similar through step (b) among successive dialogue sentences may be classified into a preceding sentence including a verb and a subsequent sentence in which the verb is omitted.

전술한 목적을 달성하기 위하여 본 발명의 또 다른 측면에 따르면, 상기 대화 분절 문장의 복원을 위한 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체가 제공될 수 있다.According to another aspect of the present invention in order to achieve the above object, there may be provided a computer-readable recording medium in which a program for executing the method for restoring the dialogue fragment sentences is recorded in a computer.

전술한 목적을 달성하기 위하여 본 발명의 또 다른 측면에 따르면, 상기 대화 분절 문장의 복원을 위한 방법을 하드웨어와 결합하여 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록 매체에 저장된 애플리케이션이 제공될 수 있다.According to another aspect of the present invention to achieve the above object, there may be provided an application stored in a computer-readable recording medium in order to execute the method for restoring the dialogue segment sentence in combination with hardware.

전술한 목적을 달성하기 위하여 본 발명의 또 다른 측면에 따르면, 상기 대화 분절 문장의 복원을 위한 방법을 컴퓨터에서 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록 매체에 저장된 컴퓨터 프로그램이 제공될 수 있다.According to another aspect of the present invention in order to achieve the above object, there may be provided a computer program stored in a computer-readable recording medium in order to execute the method for reconstructing the dialogue fragment sentences in a computer.

이상에서 설명한 바와 같이 본 발명의 다양한 측면에 따르면, 연속적인 대화 문장에서 용언이 생략된 분절 문장인 후속문을 의도 분석이 가능한 선행문을 활용하여 의도 분석이 가능한 상태로 문장을 복원할 수 있다.As described above, according to various aspects of the present invention, it is possible to restore a sentence in a state in which intention analysis is possible by using a preceding sentence in which intention analysis is possible in a subsequent sentence, which is a segmental sentence in which a verb is omitted, in a continuous dialogue sentence.

향후, 5G가 상용화되면서 트래픽이 점차 고용량으로 증대되는 기계번역, 자연어처리, 대화형 시스템 등의 5G 환경의 각종 장치에 적용하여 선행문과 이해 관계가 연관된 분절 문장을 의도 분석이 가능한 상태의 문장으로 복원할 수 있도록 함으로써, 5G 환경의 전술한 각종 장치의 성능을 향상할 수 있다.In the future, as 5G is commercialized, it will be applied to various devices in the 5G environment, such as machine translation, natural language processing, and interactive systems, where traffic is gradually increasing at high capacity, and segmental sentences related to prior sentences and interests are restored to sentences in a state where intention analysis is possible. By making it possible, the performance of the various devices described above in the 5G environment can be improved.

도 1은 본 발명의 예시적인 실시예에 따른 대화 분절 문장의 복원을 위한 장치의 구성도,
도 2는 본 발명의 예시적인 실시예의 설명을 위한 주요 표기와 이에 대한 정의를 나타낸 테이블 도면,
도 3은 본 발명의 실시예에 따른 TF(Term Frequency) 계산 예시도,
도 4는 본 발명의 실시예에 따른 IDF(Inverse Document Frequency) 계산 예시도,
도 4는 본 발명의 실시예에 따른 IDF(Inverse Document Frequency) 계산 예시도,
도 5는 본 발명의 실시예에 따른 TF-IDF(Term Frequency - Inverse Document Frequency) 가중치 계산 예시도,
도 6은 본 발명의 실시예에 따른 후속문 복원 과정 예시도,
도 7은 본 발명의 예시적인 실시예에 따른 대화 분절 문장의 복원을 위한 방법의 흐름도이다.1 is a block diagram of an apparatus for reconstructing a dialogue segmented sentence according to an exemplary embodiment of the present invention;
2 is a table diagram showing key notations and their definitions for the description of an exemplary embodiment of the present invention;
3 is an exemplary diagram of TF (Term Frequency) calculation according to an embodiment of the present invention;
4 is an exemplary diagram of an IDF (Inverse Document Frequency) calculation according to an embodiment of the present invention;
4 is an exemplary diagram of an IDF (Inverse Document Frequency) calculation according to an embodiment of the present invention;
5 is an exemplary diagram of TF-IDF (Term Frequency - Inverse Document Frequency) weight calculation according to an embodiment of the present invention;
6 is an exemplary view of a subsequent door restoration process according to an embodiment of the present invention;
Fig. 7 is a flowchart of a method for reconstructing a dialogue segmental sentence according to an exemplary embodiment of the present invention.

이하, 첨부도면을 참조하여 본 발명의 실시예에 대해 구체적으로 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 한다. 또한, 본 발명의 실시예에 대한 설명 시 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In adding reference numerals to the components of each drawing, the same components are to have the same reference numerals as much as possible even though they are indicated in different drawings. In addition, when it is determined that a detailed description of a known configuration or function related to the embodiment of the present invention may obscure the gist of the present invention, the detailed description thereof will be omitted.

도 1은 본 발명의 예시적인 실시예에 따른 대화 분절 문장의 복원을 위한 장치의 구성도로, 동 도면에 도시된 바와 같이, 문장분리부(11), 자연어처리부(13), 문장타입분류부(15), 버퍼링부(17), 유사도분석부(19), 및 분절문장복원부(21)를 포함할 수 있다.1 is a block diagram of an apparatus for restoring a dialogue segmented sentence according to an exemplary embodiment of the present invention. As shown in the same figure, a sentence separation unit 11, a natural language processing unit 13, and a sentence type classification unit ( 15), a buffering unit 17, a similarity analysis unit 19, and a segmental sentence restoration unit 21 may be included.

먼저, 도 2의 테이블을 참조하여 본 발명의 예시적인 실시예의 설명을 위한 주요 기호 표기와 이에 대한 정의를 설명한다.First, with reference to the table of Fig. 2, the main symbols and their definitions for the description of the exemplary embodiment of the present invention will be described.

연속되는 대화 문장

에서 선행문

는 시점

에서 발생한 대화 문장

를 나타내는 것으로서, 문장 구성이 주어, 목적어, 서술어 등으로 형성된 완성도가 높은 문장을 나타내며,

에서 구성된 단어 벡터는

,

, …,

로 구성된다. 후속문

는 불완전한 형태인 분절된 문장을 나타내며,

에서 구성된 단어 벡터는

,

, …,

로 구성된다. 그리고 문장

는 선행문

에 기반한 분절 문장

를 복원한 문장으로

와 유사하게 완성도가 높은 문장의 표현으로,

는 단어 벡터는

,…,

(

,

)으로 구성된

,

, …,

이다. 여기서,

이면서,

의 관계가 성립되며,

는

를

기반으로 복원했기 때문에

와

는 동일한 단어 벡터의 구성으로 되지 않아

관계 또한 성립될 수 있다. 다만,

의 특정 단어

는

의 단어

와 일부 동일할 수 있으며, 반대로,

의 특정 단어

는

의 단어

와는 일부 다를 수 있으므로,

관계가 성립될 수 있다.continuous dialogue sentences

precedent in

is the point

conversational sentences from

It represents a sentence with a high degree of completeness formed by the subject, object, predicate, etc. in the sentence structure,

The word vector constructed from

,

, … ,

is composed of follow-up

represents a segmented sentence that is an incomplete form,

The word vector constructed from

,

, … ,

is composed of and sentence

is the preceding sentence

Segmental sentences based on

as a restored sentence

Similar to the expression of high-quality sentences,

is the word vector

,… ,

(

,

) consisting of

,

, … ,

to be. here,

while,

relationship is established,

is

to

because it was restored based on

Wow

is not composed of the same word vectors

Relationships can also be established. but,

specific words of

is

word of

may be partly equivalent to, conversely,

specific words of

is

word of

may be slightly different from

relationship can be established.

다시, 도1을 참조하여 설명한다.Again, it will be described with reference to FIG. 1 .

문장분리부(11)는 연속적인 대화 문장을 문장 단위로 분리하기 위한 것이다. 사람들은 자연스러운 대화 시 다양한 단어 길이의 문장을 구사하고 있어 문맥을 이해할 수 있지만, 기계는 이를 이해하기 어렵다. 그래서, 문장을 기계가 이해할 수 있도록 문장 단위의 입력 단위로 처리하여 정의하는 것이 필요하다. 우선, 이를 문장 단위로 분리하여 처리해야 하며, 일반적으로 마침표(.), 느낌표(!), 물음표(?) 등의 구두점(punctuation)으로 문장 분리를 수행한다. 예를 들어, 발화문 "오늘 날씨가 좋아. 어디든 놀러 가고 싶다."을 간단하게 문장 분리하면 "오늘 날씨가 좋아." 및 "어디든 놀러 가고 싶다."로 분리되며, 이들은 본 실시예에서 말하는 선행문

가 될 수 있다. 단, 문장 분리한 선행문 중 문장 구성의 완성도가 낮다고 후속문이 될 수는 없다.The sentence separation unit 11 is for separating continuous dialogue sentences into sentence units. People can understand the context by using sentences of varying word lengths in a natural conversation, but it is difficult for machines to understand. Therefore, it is necessary to process and define the sentence as an input unit of the sentence unit so that the machine can understand it. First, it must be processed separately in sentence units, and sentence separation is generally performed by punctuation such as a period (.), exclamation mark (!), and question mark (?). For example, if the phrase "The weather is nice today. I want to go out to play" is simply separated into sentences, "The weather is nice today." and "I want to go play anywhere." These are the preceding sentences in this embodiment.

can be However, it cannot be a follow-up sentence because the completeness of the sentence structure is low among the preceding sentences in which the sentences are separated.

자연어처리부(13)는 자연어 처리 과정을 수행하여 분리된 문장에 대해 형태소 분석 및 의미 분석에 의한 개체명 인식을 수행하기 위한 것으로, 개체명 인식 결과를 기초로 선행문과 후행문 양자 간의 유사성 여부를 판단할 수 있고, 선행문과 후행문 간의 유사성 여부는 일 예로 자카드 계수(Jaccard Coefficient)에 의해 판단할 수 있다.The natural language processing unit 13 performs a natural language processing process to perform entity name recognition by morpheme analysis and semantic analysis on the separated sentences, and determines whether both preceding sentences and trailing sentences are similar based on the entity name recognition result. The similarity between the preceding sentence and the trailing sentence may be determined by, for example, a Jaccard coefficient.

예를 들어, 자연어처리부(13)의 자연어 처리 과정은 [품사 태깅 과정], [의미 분석 과정], 및 [패턴 분석 과정]을 포함할 수 있다.For example, the natural language processing process of the natural language processing unit 13 may include [part of speech tagging process], [semantic analysis process], and [pattern analysis process].

[품사 태깅 과정][Part of speech tagging process]

문장의 주요 단어들을 추출하려면 각 단어들을 형태소 분석을 통해 단어의 품사 정보를 파악해야 한다. 주어진 발화 문장 벡터

에서

개의 단어들(

에 형태소 품사(POS, part-of-speech) 기호가 부착되는 데, 이를 품사 태깅이라 한다. 품사 태깅은 발화문

라고 정의하고

번째 단어는

라고 정의할 때, 품사 태깅은

번째 단어의 품사를

라고 정의한다. 그러면 하나의 단어

에 대한 품사 태깅의 일반적인 확률 모델은 식(1)과 같다. In order to extract the main words of a sentence, it is necessary to understand the part-of-speech information of each word through morphological analysis. given utterance sentence vector

in

dog words (

A morpheme part-of-speech (POS) symbol is attached to the morpheme, which is called part-of-speech tagging. Part-of-speech tagging is an utterance

define it as

the second word

When defined as, part-of-speech tagging is

the part of speech of the second word

define it as then one word

A general probabilistic model of part-of-speech tagging for is shown in Equation (1).

--- (식1)

--- (Equation 1)

예를 들어, 식(1)에 따라 선행문의 예문 "명동역까지 버스 요금은 얼마입니까?"은 형태소 분석을 통해, "명동역/NNP+까지/JX+버스/NNG+요금/NNG+은/JX +얼마/NNG+이/VCP+ㅂ니까/EF+?/SF"으로 품사 태깅이 가능하다. 또한, 후속문의 예문 "그럼, 택시는?"의 형태소 분석은 "그럼/MAJ+택시/NNG+는/JX+?/SF"로 품사 태깅이 가능하다.For example, according to Equation (1), the example sentence "How much is the bus fare to Myeongdong Station?" is analyzed through morphological analysis, and "To Myeongdong Station/NNP+/JX+Bus/NNG+Fare/NNG+Eun/JX+How much/NNG+ /VCP+E/EF+?/SF" can be used for part-of-speech tagging. In addition, part-of-speech tagging is possible as "So/MAJ+Taxi/NNG+/JX+?/SF" for morphological analysis of the example sentence "Then, taxi?"

[의미 분석 과정][Semantic Analysis Process]

다음은 자연어 이해(NLU, Natural Language Understanding)을 위한 기본 작업으로, 형태소 분석으로 단어의 품사가 결정되면, 다음으로

개의 단어 벡터

중에서 개체명 인식을 통해 명사를 추출한다. 개체명 인식(named entity recognition)은 단어

를 인명, 지명 등의 고유한 의미를 가지는 개체명으로 인식하겠다는 의미이다. 이 과정에서, 단어 벡터

는 형태소 분석에 따라 품사 태깅

이 되고 개체명 인식에 따라 의미 벡터

로 구성되며, 확률 값은 식(2)로 계산할 수 있다.The following is a basic task for Natural Language Understanding (NLU). When the part-of-speech of a word is determined by morphological analysis, the next

dog word vector

Nouns are extracted through entity name recognition from among them. Named entity recognition is a word

It means to recognize as an entity name with a unique meaning such as a person's name or a place name. In this process, the word vector

is tagged part-of-speech according to morphological analysis

becomes a semantic vector according to object name recognition

, and the probability value can be calculated by Equation (2).

--- 식 (2)

--- Equation (2)

예를 들어, 선행문의 개체명 인식은 형태소 분석의 태깅 결과문 "명동역/NNP+까지/JX+버스/NNG+요금/NNG+은/JX+얼마/NNG+이/VCP+ㅂ니까/EF+?/SF"에서 "[명동역]LOCATION>STATION>TO [버스]TRANSPORTATION>VEHICLES>BUS [요금] TRANSPORTATION>FEE [얼마]QUESTION>HOW"으로 개체명 정보가 부착된 주요 단위의 추출이 가능하다. 그리고, 후속문의 개체명 인식도 마찬가지로 "[택시] TRANSPORTATION>VEHICLES>TAXI"가 된다. For example, the object name recognition in the preceding sentence is the tagging result of the morpheme analysis from "Myeongdong Station/to NNP+/JX+bus/NNG+fare/NNG+silver/JX+how much/NNG+it/VCP+e/EF+?/SF" to "[Myeongdong Station] ]LOCATION>STATION>TO [Bus]TRANSPORTATION>VEHICLES>BUS [Fee] TRANSPORTATION>FEE [How much]QUESTION>HOW", it is possible to extract the main unit with the entity name information attached. Also, the entity name recognition in the subsequent statement becomes "[Taxi] TRANSPORTATION>VEHICLES>TAXI".

[패턴 분석 과정][Pattern analysis process]

패턴 분석은 선행문과 후속문의 개체명 인식 결과를 매칭하여 선행문과 후속문의 유사성을 분석한다. 분석 방법으로는 선행문과 후속문이 얼마나 비슷한 지에 대한 유사성을 측정하는 척도로 자카드 계수(Jaccard Coefficient)를 활용한다. 자카드 계수는 식 (3)과 같다.The pattern analysis analyzes the similarity between the preceding sentence and the subsequent sentence by matching the entity name recognition result of the preceding sentence and the subsequent sentence. As an analysis method, the Jaccard coefficient is used as a measure of how similar the preceding sentence and the subsequent sentence are. The jacquard coefficient is as Equation (3).

--- 식 (3)

--- Equation (3)

식 (3)의 자카드 계수는 두 문장

의 의미 벡터

,

가 동일한 값이면 1이 나오고, 전혀 유사하지 않으면 0의 값이 나온다. 본 내용에서는 자카드 계수를 이용하여 아래와 같이, 선행문과 후속문의 의미 분석(즉, 개체명 인식)된 결과를 가지고 유사성을 측정할 수 있다. The jacquard coefficient in equation (3) is two sentences

Meaning Vector

,

A value of 1 is returned if the values are the same, and a value of 0 is returned if they are not at all similar. In this content, similarity can be measured using the result of semantic analysis (ie, entity name recognition) of the preceding sentence and the subsequent sentence as follows using the jacquard coefficient.

1) 선행문의 개체명 인식: [명동역]LOCATION>STATION>TO [버스]TRANSPORTATION>VEHICLES>BUS [요금] TRANSPORTATION>FEE [얼마]QUESTION>HOW1) Recognition of entity name in preceding sentence: [Myeongdong Station]LOCATION>STATION>TO [Bus]TRANSPORTATION>VEHICLES>BUS [Fee] TRANSPORTATION>FEE [How much]QUESTION>HOW

2) 후속문의 개체명 인식: [택시] TRANSPORTATION>VEHICLES>TAXI 2) Recognition of entity name in follow-up statement: [Taxi] TRANSPORTATION>VEHICLES>TAXI

개체명 인식 결과, 후속문의 개체명 TRANSPORTATION>VEHICLES>TAXI와 선행문의 개체명 TRANSPORTATION>VEHICLES>BUS를 비교하면 상위 레벨의 두 계층이 서로 동일하므로 자카드 계수는 1이 되고 따라서 두 문장은 유사한 것으로 판단할 수 있다.As a result of recognizing the entity name, if the entity name TRANSPORTATION>VEHICLES>TAXI in the subsequent sentence is compared with the entity name TRANSPORTATION>VEHICLES>BUS in the preceding sentence, the two hierarchies of the upper level are the same, so the jacquard coefficient is 1, and therefore the two sentences can be judged to be similar. can

문장타입분류부(15)는 문장분리부(11)를 통해 분리되고 자연어처리부(13)를 통해 자연어 처리 과정이 수행되어 형태소 분석 및 개체명 인식이 완료된 문장에 대해 용언이 포함된 선행문과 용언이 생략된 후속문으로 분류하기 위한 즉, 문장 타입 분류 과정을 수행하기 위한 것으로, 연속적인 대화 문장 중 자연어처리부(13)를 통해 유사성 있다고 판단된 두 문장에 대해 용언이 포함된 선행문과 용언이 생략된 후속문으로 분류할 수 있다. 즉, 연속적인 대화 문장 중 용언이 포함된 선행문이 분류되면 이에 후속하는 일련의 용언이 생략된 문장 중 해당 선행문과 유사성 있는 문장을 후속문으로 분류할 수 있다.The sentence type classification unit 15 is separated through the sentence separation unit 11, and the natural language processing process is performed through the natural language processing unit 13, so that the antecedent sentences and verbs containing verbs are separated for sentences in which morpheme analysis and entity name recognition are completed. In order to classify the omitted subsequent sentences, that is, to perform the sentence type classification process, the preceding sentence containing the verb and the verb are omitted for two sentences determined to be similar through the natural language processing unit 13 among continuous dialogue sentences. It can be classified as a sequel. That is, when a preceding sentence including a verb among continuous dialogue sentences is classified, a sentence similar to the corresponding preceding sentence among sentences in which a series of verbs following the verb are omitted may be classified as a follow-up sentence.

문장 타입 분류 과정에서는 대화 중 용언을 포함한 선행문과 용언을 생략한 후속문을 분류한다. 여기서, 후속문은 선행문의 주어를 서술하는 문장으로 선행문과 달리 용언이 생략되어 조사로 끝나는 분절된 문장을 나타낸다.In the sentence type classification process, preceding sentences including verbs and subsequent sentences omitting verbs are classified during conversation. Here, the follow sentence is a sentence describing the subject of the preceding sentence, and unlike the preceding sentence, the verb is omitted and thus a segmented sentence ending with a proposition.

버퍼링부(17)는 문장타입분류부(15)를 통해 분류된 선행문에 대해 단어 순서에 따른 위치 인덱스를 부여하여 버퍼 메모리에 버퍼링하기 위한 것이다.The buffering unit 17 is for buffering the preceding sentences classified by the sentence type classification unit 15 in the buffer memory by assigning a position index according to the word order.

유사도분석부(19)는 문장타입분류부(15)를 통해 분류된 해당 선행문 내에서 나타나는 해당 후속문의 개체명의 빈도수 및 해당 선행문에서 해당 후속문의 개체명이 출현할 확률을 기초로 해당 후속문의 개체명과 해당 선행문의 개체명 간의 유사도를 분석하기 위한 것이다. The similarity analysis unit 19 is configured to analyze the entity of the subsequent sentence based on the frequency of the entity name of the corresponding subsequent sentence appearing in the corresponding preceding sentence classified through the sentence type classifying unit 15 and the probability of the occurrence of the entity name of the corresponding succeeding sentence in the corresponding preceding sentence. This is to analyze the degree of similarity between the name and the name of the entity in the preceding sentence.

유사도분석부(19)는, 예를 들어, TF(Term Frequency) 방식을 통해 해당 선행문 내에서 나타나는 해당 후속문의 개체명의 빈도수를 산출할 수 있고, IDF(Inverse Document Frequency) 방식을 통해 해당 선행문에서 해당 후속문의 개체명이 출현할 확률을 산출할 수 있으며, 최종적으로 TF-IDF(Term Frequency - Inverse Document Frequency) 방식을 통해 후속문의 개체명과 선행문의 개체명 간의 유사도를 산출할 수 있는데, 이러한 일련의 유사도 분석/산출 과정에 대해 아래에서 보다 상세히 설명한다.The similarity analysis unit 19, for example, may calculate the frequency of the entity name of the corresponding subsequent sentence appearing in the corresponding preceding sentence through the TF (Term Frequency) method, and the corresponding preceding sentence through the IDF (Inverse Document Frequency) method It is possible to calculate the probability of the occurrence of the entity name in the subsequent sentence in the TF-IDF (Term Frequency - Inverse Document Frequency) method to calculate the similarity between the entity name in the subsequent sentence and the entity name in the preceding sentence. The similarity analysis/calculation process will be described in more detail below.

본 실시예에서는 선행문과 후속문의 유사도를 측정하기 위한 방식으로 TF-IDF 가중치를 사용한다. TF-IDF(Term Frequency - Inverse Document Frequency)는 검색과 텍스트 마이닝에서 이용하는 가중치로서, 단어의 중요도를 평가하기 위해 사용된다. 예를 들어, TF-IDF는 모든 문서에서 자주 등장하는 단어는 중요도가 낮다고 판단하며, 특정 문서에서만 자주 등장하는 단어는 중요도가 높다고 판단한다. 이에 따라서, TF-IDF 값이 낮으면 중요도가 낮은 것이며, TF-IDF 값이 크면 중요도가 크다는 점을 적용하여, 선행문과 후속문 내에 포함된 단어의 중요도(즉, 유사도)를 정하는데 TF-IDF 가중치를 사용할 수 있다.In this embodiment, the TF-IDF weight is used as a method for measuring the similarity between the preceding sentence and the subsequent sentence. TF-IDF (Term Frequency - Inverse Document Frequency) is a weight used in search and text mining, and is used to evaluate the importance of a word. For example, TF-IDF determines that a word that appears frequently in all documents has low importance, and determines that a word that appears frequently in a specific document has high importance. Accordingly, by applying the point that a low TF-IDF value indicates low importance, and a large TF-IDF value indicates high importance, the TF-IDF is used to determine the importance (ie, similarity) of words included in the preceding sentence and the subsequent sentence. weights can be used.

단어 빈도 TF(term frequency)는 특정 단어가 문서 내에서 나타나는 빈도를 의미한다. 본 실시예에서는 문서 내의 빈도가 아닌 선행문 내에서 나타나는 후속문 내 개체명의 빈도를

와 같이 나타내며, 개체명 인식명이 얼마나 중요한 지에 대한 지표로 사용한다. 등장 빈도를 포함하고 있는 후속문은 선행문에 대하여 관련성이 높다. 따라서, TF는 선행문 내 후속문의 단어 출현 빈도를 기본적으로 선행문과의 연관성을 분석하기 위해 사용된다. 여기서,

는 개체명 등장 빈도수이며,

는 선행문이다.

는 선행문 내에 나타나는 해당 후속문의 개체명의 총 빈도수를 나타내는 것 즉, 선행문

내에서

의 총 빈도를 나타내는 것이며,

값의 산출은 다음의 식 (4)를 통해 계산할 수 있다.Word frequency TF (term frequency) refers to the frequency with which a specific word appears in a document. In this embodiment, the frequency of the entity name in the subsequent sentence that appears in the preceding sentence is not the frequency in the document.

and is used as an indicator of how important the individual name recognition name is. Subsequent sentences including frequency of appearance have high relevance to preceding sentences. Therefore, TF is used to basically analyze the frequency of word appearances in the subsequent sentence in the preceding sentence and the correlation with the preceding sentence. here,

is the frequency of occurrence of the entity name,

is a foreword

indicates the total frequency of the entity name of the subsequent sentence appearing in the preceding sentence, that is, the preceding sentence

within

represents the total frequency of

The value can be calculated through the following equation (4).

--- 식 (4)

--- Equation (4)

여기서,

값은 선행문 내에서 후속문의 단어가 나온 횟수이다. 경우에 따라

값을 변형해서 사용하지만, 본 실시예에서는 출현 횟수로 사용한다. 도 3은

값을 계산한 예시도이다. 선행문에서 해당 후속문의 개체명이 얼마나 나왔는지 알아야 하므로, 선행문에서 해당 후속문의 단어/개체명들이 나오면 카운팅한다. 여기서, 비교 대상은 개체명의 하위 레벨의 카테고리부터 상위 레벨의 카테고리를 순차적으로 비교하여 적어도 상위 두 레벨 이상이 같으면 카운팅을 한다. 예를 들어, 후속문의 단어인 택시의 개체명이 선행문 내 버스의 개체명과 상위 두 레벨에서 동일하므로 이에 대한 출현 횟수 1을 카운팅한다. 즉, 후속문 단어 "택시"의 경우에는 선행문에 개체명 "TRANSPORTATION>VEHICLES"으로 등장했기 때문에 출현 횟수는 1이다.here,

The value is the number of occurrences of the word in the subsequent sentence within the preceding sentence. in some cases

Although the value is changed and used, in this embodiment, it is used as the number of appearances. 3 is

This is an example of calculating a value. Since it is necessary to know how many entity names of the corresponding follow-up sentence appear in the preceding sentence, it is counted when words/object names of the corresponding follow-up sentence are found in the preceding sentence. Here, the comparison target sequentially compares the categories of the upper level from the lower level category of the entity name, and counts if at least two upper level levels are the same. For example, since the entity name of a taxi, which is a word in a subsequent sentence, is the same as the entity name of a bus in the preceding sentence in two upper levels, the number of appearances is counted by 1. That is, in the case of the word "taxi" in the subsequent sentence, the number of appearances is 1 because the entity name "TRANSPORTATION>VEHICLES" appears in the preceding sentence.

IDF(Inverse Document Frequency)는 DF(Document Frequency)의 역수로, DF는 한 단어가 문서들에서 얼마나 공통적으로 나타나는 지를 나타내는 값이며,

=(해당 단어가 나타난 문서 수 / 전체 문서 수)가 된다. IDF는 DF의 역수이기 때문에

를 사용하면

(전체 문서 수/해당 단어가 나타난 문서 수)가 되며,

로 표기된다. 여기서,

는 문서

에서

가 나타난 빈도수를 의미한다.

는 두 가지 이상의 용어를 가진 질의에서 문서들의 순위에 영향을 미친다. IDF (Inverse Document Frequency) is the inverse of DF (Document Frequency), and DF is a value indicating how common a word appears in documents,

=( the number of documents in which the word appears / the total number of documents ). Since IDF is the reciprocal of DF,

If you use

(Total number of documents / Number of documents in which the word appears ),

is marked with here,

is the document

in

indicates the frequency of occurrence.

affects the ranking of documents in queries with more than one term.

본 실시예에서는 선행문 내에서 후속문의 개체(인식)명이 출현할 확률이 얼마인가를 IDF로 계산하며,

로 표기된다. 여기서,

는 선행문의 개체명

에서 후속문의 개체명

가 나타난 빈도수를 의미한다. 이들을 통해서, 후속문의 개체명이 선행문의 전체 개체명 집합에서 얼마나 나타나는지를 나타내는 값을 이용하여 후속문의 분절 문장을 선행문의 일부 문장으로 대체할 수 있다.In this embodiment, the probability that the entity (recognized) name of the subsequent sentence appears in the preceding sentence is calculated as IDF,

is marked with here,

is the object name of the preceding sentence

object name of the subsequent statement in

indicates the frequency of occurrence. Through these, the segmental sentence of the following sentence can be replaced with some sentences of the preceding sentence by using a value indicating how often the entity name of the subsequent sentence appears in the entire set of entity names of the preceding sentence.

도 4의 IDF 계산 예시도를 참조하면, 후속문의 단어 총 수는 3이므로,

내의 분자는 3으로 동일하다. 분모의 경우에는 선행문에서 각 단어가 등장한 후속문 단어의 수(DF)를 의미하는 데, 예를 들어, "택시"의 경우에는 선행문에 개체명 "TRANSPORTATION>VEHICLES"으로 등장했기 때문에 출현 회수는 1이라는 값을 가진다. 따라서, IDF 값은 log(3/1)=0.48이 된다.Referring to the IDF calculation example diagram of FIG. 4 , since the total number of words in the subsequent sentence is 3,

The molecules within are equal to 3. In the case of the denominator, it means the number of words in the subsequent sentence in which each word appears in the preceding sentence (DF). For example, in the case of "taxi", the number of appearances because it appeared as the entity name "TRANSPORTATION>VEHICLES" in the preceding sentence. has a value of 1. Therefore, the IDF value becomes log(3/1)=0.48.

마지막으로 TF-IDF는 정보 검색에서 가장 알려진 계산 방법으로, TF 가중치와 해당 idf의 곱으로

는 아래와 같은 식 (5)를 만족한다.Finally, TF-IDF is the most known computational method in information retrieval, as the product of TF weights and corresponding idfs.

satisfies the following equation (5).

--- 식 (5)

--- Equation (5)

이에 선행문을 기반으로 하는 후속문의 개체명 유사도는 도 5에 예시된 바와 같다. 도 5에서 보는 바와 같이, 본 실시예에서의 TF-IDF 값은 IDF 값과 동일하다. 그 이유는 문장 내의 단어 횟수를 비교하는 것이 아닌 후속문의 단어는 순차적으로 비교하기 때문에 생긴 현상이다. 만약, 예제에서 "그럼, 택시는? 택시를 타고 가자"할 경우에는 택시는 중복된 횟수가 아닌 TF-IDF 계산시 후속문의 단어가 별개의 문장으로 분리하기 때문에 중복된 단어가 아님을 주의해야 한다. 도 5에 예시된 바와 같이 유사도 분석 결과, 선행문 내 후속문의 단어 "빠른"과 "택시"에 대응하는 유사 개체명은 각각 "느린(0.48)"과 "버스(0.48)"이다. 이들 단어는 분절 문장의 대체 시 중요한 단어 요소가 되며, 문장 복원 시 선행문의 시작 단어와 끝 단어가 될 수 있다.Accordingly, the entity name similarity of the subsequent sentence based on the preceding sentence is as illustrated in FIG. 5 . As shown in FIG. 5 , the TF-IDF value in this embodiment is the same as the IDF value. The reason is a phenomenon that occurs because the words in the subsequent sentence are sequentially compared, not the number of words in the sentence. If, in the example, you say "So, how about a taxi? Let's take a taxi", it should be noted that taxi is not a duplicate word because the word in the subsequent sentence is separated into a separate sentence when calculating TF-IDF, not the number of duplicates. . As exemplified in FIG. 5 , as a result of the similarity analysis, the similar entity names corresponding to the words “fast” and “taxi” in the subsequent sentence in the preceding sentence are “slow (0.48)” and “bus (0.48)”, respectively. These words become important word elements when replacing segmental sentences, and can be the start and end words of the preceding sentence when restoring the sentence.

본 실시예에서 TF-IDF의 계산 값은 도 4에 예시된 바와 같이, 선행문 내 구성된 단어의 개체명을 후속문의 분절 문장 내 단어의 개체명과 서로 비교하여 계산한 유사도이다. 즉, 유사도를 나타내는 TF-IDF 값은 후속문의 "빠른"과 "택시"의 개체명은 선행문의 대응하는 유사 개체명 "느린"과 "버스"와 비교했을 때 각각 TF 값이 1이고 IDF의 값이 0.48이므로 이들을 곱한 1x0.48=0.48이 유사도 TF-IDF의 값이 된다. As illustrated in FIG. 4 , the calculated value of TF-IDF in the present embodiment is a degree of similarity calculated by comparing the entity names of words in the preceding sentence with the entity names of words in the segmental sentences of the subsequent sentence. That is, the TF-IDF value indicating the degree of similarity shows that the entity names of “fast” and “taxi” in the subsequent sentence have a TF value of 1 and the IDF value is 1 when compared with the corresponding similar entity names “slow” and “bus” in the preceding sentence, respectively. Since it is 0.48, 1x0.48=0.48 multiplied by these becomes the value of the similarity TF-IDF.

전술한 바와 같이 유사도분석부(19)를 통한 선행문 내 단어에 대한 후속문의 단어 간의 유사도 값의 계산 과정은 모두 끝났다.As described above, the process of calculating the similarity value between the words in the preceding sentence and the words in the subsequent sentence through the similarity analysis unit 19 is all finished.

분절문장복원부(21)는 유사도분석부(19)를 통해 분석/계산된 유사도가 기 설정된 기준값 이상인 선행문의 해당 단어를 후속문의 해당 단어로 교체하기 위한 것이다.The segmental sentence restoration unit 21 is for replacing the corresponding word in the preceding sentence whose similarity analyzed/calculated by the similarity analyzing unit 19 is equal to or greater than a preset reference value with the corresponding word in the subsequent sentence.

전술한 바와 같이 본 실시예에 따른 TF-IDF를 통한 유사도의 계산 과정은 후속문의 분절 문장을 선행문 내 구성된 단어의 개체명과 비교하여 유사도를 계산하는 방식으로서, 선행문과 후속문 간에 서로 대응하는 주요 단어를 선정할 수 있도록 설계하였다. 즉, 본 실시예에 따르면 선행문과 후속문의 구성 단어의 개체명을 전체 비교한 결과 선행문에서 "느린"과 "버스"는 후속문의 "빠른"과 "택시"로 치환 가능한 유사도 값을 갖도록 계산됨을 알 수 있으며, 그 결과 분절문장복원부(21)를 통해 도 6에 예시된 바와 같이 후속문 "그럼, 빠른 택시는?"은 선행문을 기반으로 "명동역까지 빠른 택시 요금은 얼마입니까?"로 복원될 수 있다.As described above, the similarity calculation process through the TF-IDF according to the present embodiment is a method of calculating the similarity by comparing the segmental sentence of the subsequent sentence with the entity names of the words formed in the preceding sentence. Designed to select words. That is, according to this embodiment, as a result of comparing the entire entity names of the constituent words of the preceding sentence and the subsequent sentence, "slow" and "bus" in the preceding sentence are calculated to have similarity values that can be substituted with "fast" and "taxi" in the subsequent sentence. As a result, as exemplified in FIG. 6 through the segmental sentence restoration unit 21, the follow-up sentence "So, what is a fast taxi?" is based on the preceding sentence, "How much is the fast taxi fare to Myeongdong Station?" can be restored.

도 7은 본 발명의 예시적인 실시예에 따른 대화 분절 문장의 복원을 위한 방법의 흐름도로, 도 1의 장치에서 수행되므로 해당 장치의 동작과 병행하여 설명한다.7 is a flowchart of a method for reconstructing a dialogue segment sentence according to an exemplary embodiment of the present invention. Since it is performed in the apparatus of FIG. 1, the operation of the apparatus will be described in parallel.

먼저, 연속적인 대화 문장이 도 1의 장치에 입력되면(S701), 문장분리부(11)는 연속적인 대화 문장을 문장 단위로 분리하고 분리된 문장을 자연어처리부(13)의 입력으로 제공한다(S703).First, when continuous dialogue sentences are input to the device of FIG. 1 ( S701 ), the sentence separation unit 11 separates the continuous dialogue sentences into sentence units and provides the separated sentences as input to the natural language processing unit 13 ( S701 ). S703).

이어, 자연어처리부(13)는 단계 S703을 통해 문장분리부(11)로부터 순차 입력된 문장에 대해 자연어 처리 과정을 수행하여 형태소 분석 및 의미 분석에 의한 개체명 인식을 수행하고, 개체명 인식 결과를 기초로 순차 입력된 두 문장(예를 들어, 선행문과 후행문) 간의 유사성 여부를 판단한 후 문장타입분류부(15)의 입력으로 제공한다(S705).Next, the natural language processing unit 13 performs a natural language processing process on the sentences sequentially input from the sentence separation unit 11 through step S703 to perform entity name recognition by morpheme analysis and semantic analysis, and obtain the entity name recognition result. After determining the similarity between two sequentially input sentences (for example, a preceding sentence and a following sentence) based on the determination, it is provided as an input of the sentence type classifying unit 15 (S705).

이어, 문장타입분류부(15)는 단계 S705를 통해 자연어처리부(13)로부터 순차 입력된 각 문장에 대해 문장 타입을 분류하는데, 문장 타입은 용언이 포함된 선행문과 용언이 생략된 후속문으로 분류하되, 연속적인 대화 문장 중 단계 S705에서 자연어처리부(13)를 통해 유사성 있다고 판단된 두 문장에 대해 용언이 포함된 선행문과 용언이 생략된 후속문으로 분류할 수 있다. 즉, 연속적인 대화 문장 중 용언이 포함된 선행문이 분류되면 이에 후속하는 일련의 용언이 생략된 문장 중 해당 선행문과 유사성 있는 최초 문장을 후속문으로 분류한다(S707). Next, the sentence type classification unit 15 classifies the sentence type for each sentence sequentially input from the natural language processing unit 13 through step S705, and the sentence type is classified into a preceding sentence including a verb and a subsequent sentence in which a verb is omitted. However, among continuous dialogue sentences, two sentences determined to be similar by the natural language processing unit 13 in step S705 may be classified into a preceding sentence including a verb and a subsequent sentence in which the verb is omitted. That is, when a preceding sentence including a verb among continuous dialogue sentences is classified, the first sentence similar to the corresponding preceding sentence among the sentences in which a series of verbs following the verb is omitted is classified as a follow-up sentence (S707).

이어, 버퍼링부(17)는 단계 S707에서 문장타입분류부(15)를 통해 분류된 선행문에 대해 단어 순서에 따른 위치 인덱스를 부여하여 버퍼 메모리에 버퍼링하고(S709), 유사도분석부(19)는 단계 S709에서 버퍼링된 선행문과 단계 S707에서 문장타입분류부(15)를 통해 분류된 후속문을 이용하여 해당 선행문 내에서 나타나는 해당 후속문의 개체명의 빈도수를 구하고(도 3 및 관련 설명 참조) 아울러 해당 선행문에서 해당 후속문의 개체명이 출현할 확률을 구하며(도 4 및 관련 설명 참조) 이를 기초로 해당 후속문의 개체명과 해당 선행문의 개체명 간의 유사도를 산출/분석한다(도 5 및 관련 설명 참조)(S711).Next, the buffering unit 17 buffers the preceding sentences classified by the sentence type classification unit 15 in step S707 by assigning a position index according to the word order in the buffer memory (S709), and the similarity analysis unit 19 obtains the frequency of the entity name of the subsequent sentence appearing in the preceding sentence using the preceding sentence buffered in step S709 and the subsequent sentence classified through the sentence type classifying unit 15 in step S707 (see FIG. 3 and related description), and The probability of the occurrence of the entity name of the subsequent sentence in the corresponding antecedent sentence is calculated (refer to FIG. 4 and related description), and based on this, the similarity between the entity name of the corresponding successor sentence and the entity name of the corresponding antecedent sentence is calculated/analyzed (refer to FIG. 5 and related explanation) (S711).

마지막으로, 단계 S711에서 유사도분석부(19)를 통해 분석/계산된 유사도가 기 설정된 기준값 이상인 선행문의 해당 단어를 후속문의 해당 단어로 교체하여, 용언이 생략된 분절 문장 형태의 후속문을 용언이 포함되어 의미 분석이 가능한 문장으로 복원한다(도 6 및 관련 설명 참조)(S713).Finally, in step S711, the corresponding word in the preceding sentence in which the similarity analyzed/calculated by the similarity analysis unit 19 is equal to or greater than the preset reference value is replaced with the corresponding word in the subsequent sentence, and the subsequent sentence in the form of a segmental sentence in which the verb is omitted is replaced with a verb. It is included and restored to a sentence capable of semantic analysis (see FIG. 6 and related description) (S713).

한편, 전술한 대화 분절 문장의 복원을 위한 방법에 따르면 해당 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체를 구현할 수 있다.Meanwhile, according to the above-described method for reconstructing dialogue segmental sentences, a computer-readable recording medium in which a program for executing the method in a computer is recorded may be implemented.

또 한편, 전술한 대화 분절 문장의 복원을 위한 방법에 따르면 해당 방법을 하드웨어와 결합하여 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록 매체에 저장된 애플리케이션을 구현할 수 있다.On the other hand, according to the above-described method for reconstructing dialogue segmental sentences, an application stored in a computer-readable recording medium may be implemented in order to execute the method in combination with hardware.

또 다른 한편, 전술한 대화 분절 문장의 복원을 위한 방법에 따르면 해당 방법을 컴퓨터에서 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록 매체에 저장된 컴퓨터 프로그램을 구현할 수 있다.On the other hand, according to the above-described method for restoring a dialogue segmented sentence, a computer program stored in a computer-readable recording medium may be implemented in order to execute the method in a computer.

예를 들어, 전술한 바와 같이 본 발명의 예시적인 실시예에 따른 대화 분절 문장의 복원을 위한 방법은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터 판독가능 기록 매체 또는 이러한 기록 매체에 저장된 애플리케이션으로 구현될 수 있다. 상기 컴퓨터 판독 가능 기록 매체는 프로그램 명령, 로컬 데이터 파일, 로컬 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 기록 매체는 본 발명의 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크와 같은 자기-광 매체, 및 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다.For example, as described above, the method for reconstructing a dialogue segment sentence according to an exemplary embodiment of the present invention is a computer-readable recording medium including program instructions for performing various computer-implemented operations or such recording medium. It can be implemented as an application stored in . The computer-readable recording medium may include program instructions, local data files, local data structures, and the like alone or in combination. The recording medium may be specially designed and configured for the embodiment of the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include hard disks, magnetic media such as floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floppy disks, and ROMs, RAMs, flash memories, and the like. Hardware devices specially configured to store and execute the same program instructions are included. Examples of program instructions may include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical spirit of the present invention, and various modifications and variations will be possible without departing from the essential characteristics of the present invention by those skilled in the art to which the present invention pertains. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

11: 문장분리부
13: 자연어처리부
15: 문장타입분류부
17: 버퍼링부
19: 유사도분석부
21: 분절문장복원부11: Sentence Breaker
13: natural language processing unit
15: sentence type classification part
17: buffering unit
19: similarity analysis unit
21: Segmental sentence restoration part

Claims

a sentence separation unit for separating continuous dialogue sentences into sentence units;
a sentence type classification unit for classifying the separated sentence into a preceding sentence including a verb and a subsequent sentence in which a verb is omitted;
Analysis of the similarity between the entity name of the subsequent sentence and the entity name of the antecedent sentence based on the frequency of the entity names of the classified follow-up sentence appearing in the classified antecedent sentence, and the probability that the entity name of the classified follow-up sentence appears in the classified antecedent sentence a similarity analysis unit for and
a segmented sentence restoration unit for replacing the corresponding word in the preceding sentence in which the analyzed similarity is greater than or equal to a preset reference value with the corresponding word in the subsequent sentence;
A device for the restoration of a dialogue segmentation sentence comprising a.

According to claim 1,
The apparatus for reconstructing a dialogue segmented sentence, characterized in that the similarity analysis unit calculates the frequency of the entity names of the classified subsequent sentences appearing in the classified preceding sentences through a TF (Term Frequency) method.

3. The method of claim 2,
The apparatus for reconstructing a dialogue segmented sentence, characterized in that the similarity analyzer calculates a probability that the entity name of the classified subsequent sentence appears in the classified preceding sentence through an inverse document frequency (IDF) method.

4. The method of claim 3,
and the similarity analyzer calculates the similarity between the entity name of the subsequent sentence and the entity name of the preceding sentence through a TF-IDF (Term Frequency - Inverse Document Frequency) method.

According to claim 1,
and a natural language processing unit that performs morphological analysis and entity name recognition on the separated sentences, and determines whether a preceding sentence and a trailing sentence are similar based on the entity name recognition result.

6. The method of claim 5,
The apparatus for reconstructing a dialogue segmented sentence, characterized in that the similarity between the preceding sentence and the following sentence is determined by a Jaccard coefficient.

6. The method of claim 5,
The sentence type classification unit classifies two sentences determined to be similar through the natural language processing unit into a preceding sentence including a verb and a subsequent sentence in which the verb is omitted, among consecutive conversational sentences. .

(a) separating continuous dialogue sentences into sentence units;
(b) classifying the separated sentence into a preceding sentence including a verb and a subsequent sentence in which the verb is omitted;
(c) between the entity name of the subsequent sentence and the entity name of the antecedent sentence based on the frequency of the entity names of the classified successor sentence appearing in the classified antecedent sentence, and the probability of occurrence of the entity name of the classified successor sentence in the classified antecedent sentence analyzing the similarity; and
(d) replacing the corresponding word in the preceding sentence in which the analyzed similarity is greater than or equal to a preset reference value with the corresponding word in the subsequent sentence;
A method for the restoration of a dialogue segmental sentence comprising a.

9. The method of claim 8,
wherein the step (c) calculates the frequency of the entity names of the classified subsequent sentences appearing in the classified preceding sentences through a TF (Term Frequency) method.

10. The method of claim 9,
The step (c) is a method for reconstructing a dialogue segmented sentence, characterized in that the probability that the entity name of the classified subsequent sentence appears in the classified preceding sentence is calculated through an inverse document frequency (IDF) method.

11. The method of claim 10,
The step (c) comprises calculating the similarity between the entity name of the subsequent sentence and the entity name of the preceding sentence through the TF-IDF (Term Frequency - Inverse Document Frequency) method.

9. The method of claim 8,
performing entity name recognition through morpheme analysis and semantic analysis on the separated sentences; and
The method for reconstructing a dialogue segmentation sentence further comprising the step of determining whether a preceding sentence and a following sentence are similar based on the entity name recognition result.

13. The method of claim 12,
The method for reconstructing a dialogue segmented sentence, characterized in that the similarity between the preceding sentence and the following sentence is determined by a Jaccard coefficient.

13. The method of claim 12,
In the step (b), the similarity between the preceding sentence and the following sentence is determined based on the entity name recognition result among consecutive dialogue sentences, and the preceding sentence including the verb and the subsequent sentence in which the verb is omitted for the two sentences judged to have similarity are determined. A method for reconstructing a dialogue segmented sentence, characterized in that it is classified as a sentence.

15. A computer-readable recording medium in which a program for executing the method for restoring the dialogue segment sentence of any one of claims 8 to 14 in a computer is recorded.

An application stored in a computer-readable recording medium in order to execute the method for restoring the dialogue fragment sentence of any one of claims 8 to 14 in combination with hardware.

A computer program stored in a computer-readable recording medium in order to execute the method for restoring the dialogue segment sentence of any one of claims 8 to 14 in a computer.