KR102168504B1

KR102168504B1 - Aparatus for coherence analyzing between each sentence in a text document and method thereof

Info

Publication number: KR102168504B1
Application number: KR1020180169502A
Authority: KR
Inventors: 이새벽; 최현수; 김정욱; 장정훈
Original assignee: 주식회사 와이즈넛
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2020-10-21
Also published as: KR20200084436A

Abstract

본 발명은 텍스트 문서에서 각 문장 간의 일관성 분석 장치 및 그 방법에 관한 것으로, 문서 단위의 자연 언어로 된 텍스트 문서를 입력으로 받아 문단 및 문장 단위로 분해하고, 분해된 문단 및 문장을 형태소 분석하여 텍스트 문서로 출력하는 전처리부와, 전처리부로부터 형태소 분석을 통해 출력된 텍스트 문서에 대하여, 문단의 첫 번째 문장을 제외한 문장의 키워드와 앞 문장의 키워드를 분석하여 일관성을 정량적으로 계산하는 제1 일관성 분석법, 각 문장의 자질(Feature)을 추출하여 벡터로 표현한 후 이를 통해 각 문장의 유사도를 계산하고 이를 이용하여 일관성을 정량적으로 계산하는 제2 일관성 분석법, 기계학습 방법으로 비일관성 문장을 임의로 생성하고 이를 딥러닝 기반의 합성곱 신경망을 이용하여 학습하고 합습된 결과를 통하여 일관성을 정량적으로 계산하는 제3 일관성 분석법 중 적어도 하나의 일관성 분석법을 이용하여 각 단위 문장들이 해당 텍스트 문서 전체의 문맥상 일관성을 유지하는지 정량적으로 분석하는 텍스트 일관성 분석부를 포함함으로써, 비정형 텍스트 데이터의 품질 평가 및 분석을 위한 문서 단위의 일관성을 효과적으로 분석할 수 있다.The present invention relates to an apparatus and a method for analyzing consistency between sentences in a text document. A text document in a natural language in a document unit is received as an input and decomposed into paragraphs and sentences, and the decomposed paragraphs and sentences are morpheme analyzed to provide text A first consistency analysis method that quantitatively calculates consistency by analyzing the keywords of the sentence excluding the first sentence of the paragraph and the keywords of the preceding sentence for the preprocessor outputting the document and the text document output through the morpheme analysis from the preprocessor , After extracting the feature of each sentence and expressing it as a vector, the second consistency analysis method, which calculates the similarity of each sentence and quantitatively calculates the consistency using this, and the machine learning method randomly generates inconsistent sentences. Each unit sentence maintains consistency in the context of the entire text document by using at least one consistency analysis method among the third consistency analysis methods that learn using a deep learning-based convolutional neural network and quantitatively calculate consistency through the combined results. By including a text consistency analysis unit that quantitatively analyzes whether or not it is, it is possible to effectively analyze the consistency of each document for quality evaluation and analysis of unstructured text data.

Description

A device for analyzing consistency between sentences in a text document and its method {APARATUS FOR COHERENCE ANALYZING BETWEEN EACH SENTENCE IN A TEXT DOCUMENT AND METHOD THEREOF}

본 발명은 자연 언어로 된 텍스트 문서에서 각 문장 간의 일관성을 자동으로 평가하기 위한 일관성 분석 장치 및 그 방법에 관한 것이다.The present invention relates to a consistency analysis apparatus and method for automatically evaluating the consistency between sentences in a text document in a natural language.

일반적으로, 일관성(Coherence)은 문장 간의 긴밀한 질서를 말한다.In general, coherence refers to the close order between sentences.

즉, 일관성은 내용들 간의 의미적인 연결 관계를 설명하는 통일성보다 광범위한 개념으로 응집성과 유사한 의미로 쓰인다.In other words, coherence is a broader concept than unity that describes the semantic connection between contents, and is used in a similar meaning to coherence.

특히, 문자 기반의 언어에서의 일관성은 이어지는 문장 간에 중요 어구의 반복이나 유의어, 대명사를 사용하여 문맥을 구성할 뿐만 아니라 완결성을 가지므로 내용과 내용 사이의 긴밀함까지 포함되는 개념이다.In particular, consistency in text-based language is a concept that includes not only the repetition of important phrases, thesaurus, and pronouns between successive sentences, but also the closeness between the content and the content as it has completeness.

이처럼 일관성은 구문론적으로 문장과 문장 사이, 내용(혹은 문단)과 내용 사이에 적용되는 원리이다.In this way, consistency is a principle applied syntactically between sentence and sentence, content (or paragraph) and content.

한편, 자연어처리는 인공지능의 한 분야로 컴퓨터가 인간의 언어를 이해하기 위한 목적을 가지는 기술이다. 주로 형태소 분석(Morphological analysis), 구문분석(Parsing) 등 구문론(Syntax)적 분석과 개체명 인식(Named entity recognition), 의미역 결정(Semantic role labeling) 같은 의미론적(Semantics) 분석, 문서 요약 및 상호참조해결(Coreference resolution) 등의 담화론적 분석을 포함한다.On the other hand, natural language processing is a field of artificial intelligence, a technology that has the purpose of computers to understand human language. Mainly, syntaxic analysis such as morphological analysis and parsing, semantic analysis such as named entity recognition, semantic role labeling, document summarization and correlation Includes discourse analysis, such as coreference resolution.

앞에서 설명한 자연어처리 문제들은 비교적 잘 정의되어 있고, 활발하게 연구되고 있는 반면, 현재까지 일관성 분석에 대한 연구는 미흡한 실정이다.While the natural language processing problems described above are relatively well defined and are being actively studied, studies on consistency analysis have been insufficient to date.

국내 공개특허 제10-2017-0030297호(2017.03.17. 공개)Korean Patent Publication No. 10-2017-0030297 (published on March 17, 2017)

본 발명은 전술한 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은 문서 단위의 텍스트를 입력으로 받아 문장 및 문단 단위로 분해하여 각 단위 문장들이 텍스트 전체의 문맥상 일관성을 유지하는지를 정량적으로 측정함으로써, 비정형 텍스트 데이터의 품질 평가 및 분석을 위한 문서 단위의 일관성을 효과적으로 분석할 수 있도록 한 텍스트 문서에서 각 문장 간의 일관성 분석 장치 및 그 방법을 제공하는데 있다.The present invention was devised to solve the above-described problem, and an object of the present invention is to quantitatively measure whether each unit sentence maintains consistency in the context of the entire text by decomposing it into sentences and paragraphs by receiving text in a document unit as an input. By doing so, it is to provide an apparatus and a method for analyzing consistency between sentences in a text document so that the consistency of each document can be effectively analyzed for quality evaluation and analysis of unstructured text data.

전술한 목적을 달성하기 위하여 본 발명의 제1 측면은, 문서 단위의 자연 언어로 된 텍스트 문서를 입력으로 받아 문단 및 문장 단위로 분해하고, 분해된 문단 및 문장을 형태소 분석하여 텍스트 문서로 출력하는 전처리부; 및 상기 전처리부로부터 형태소 분석을 통해 출력된 텍스트 문서에 대하여, 문단의 첫 번째 문장을 제외한 문장의 키워드와 앞 문장의 키워드를 분석하여 일관성을 정량적으로 계산하는 제1 일관성 분석법, 각 문장의 자질(Feature)을 추출하여 벡터로 표현한 후 이를 통해 각 문장의 유사도를 계산하고 이를 이용하여 일관성을 정량적으로 계산하는 제2 일관성 분석법, 기계학습 방법으로 비일관성 문장을 임의로 생성하고 이를 딥러닝 기반의 합성곱 신경망을 이용하여 학습하고 합습된 결과를 통하여 일관성을 정량적으로 계산하는 제3 일관성 분석법 중 적어도 하나의 일관성 분석법을 이용하여 각 단위 문장들이 해당 텍스트 문서 전체의 문맥상 일관성을 유지하는지 정량적으로 분석하는 텍스트 일관성 분석부를 포함하는 텍스트 문서에서 각 문장 간의 일관성 분석 장치를 제공하는 것이다.In order to achieve the above object, the first aspect of the present invention is to receive a text document in a natural language in a document unit as an input and decompose it into paragraphs and sentences, and morpheme analysis of the decomposed paragraphs and sentences to output as a text document. Pretreatment unit; And a first consistency analysis method of quantitatively calculating consistency by analyzing keywords of sentences other than the first sentence of the paragraph and keywords of the preceding sentence with respect to the text document output through morpheme analysis from the preprocessor, and the features of each sentence ( Feature) is extracted and expressed as a vector, and then the second consistency analysis method, which calculates the similarity of each sentence, and uses it to quantitatively calculate the consistency, randomly generates inconsistent sentences using a machine learning method, and creates a convolution based on deep learning. Text that quantitatively analyzes whether each unit sentence maintains the contextual consistency of the entire text document by using at least one of the third consistency analysis methods, which learns using a neural network and calculates consistency quantitatively through the combined results. A device for analyzing consistency between sentences in a text document including a consistency analysis unit is provided.

여기서, 상기 제1 일관성 분석법은, 인접한 문장들이 의미적으로 얼마나 관련이 있는지 판단하기 위해 각 문장의 형태소를 비교하고, 하기의 식 1을 이용하여 일관성을 정량적으로 계산함이 바람직하다.Here, it is preferable that the first consistency analysis method compares the morphemes of each sentence in order to determine how much adjacent sentences are semantically related, and calculates the consistency quantitatively using Equation 1 below.

(식 1)(Equation 1)

여기서, N은 문서에서 전체 문장의 수이고, R(i, i+1)은 i번째 문장과 i+1번째 문장의 상호 참조 관계의 상태를 나타낸 것으로, 인접한 두 문장에 적어도 하나 이상의 상호 참조 관계가 있는 경우(또는 의미 있는 품사를 가진 형태소가 동일하게 존재할 경우) R(i, i+1)의 값은 '1' 이고, 그렇지 않을 경우에는 '0'으로 정의한다.Here, N is the total number of sentences in the document, and R(i, i+1) indicates the state of the cross-reference relationship between the i-th sentence and the i+1th sentence, and at least one cross-reference relationship to two adjacent sentences. If there is (or if morphemes with meaningful parts of speech exist identically), the value of R(i, i+1) is '1', otherwise it is defined as '0'.

바람직하게, 상기 제2 일관성 분석법은, 텍스트의 단어와 문장들의 문맥 정보를 학습하여 차원 축소 및 추상화를 통해 각 문장을 자질 벡터(Feature Vector)로 표현하는 단어 임베딩(Word Embedding) 및 문장 임베딩(Sentence Embedding)을 이용하여 해당 텍스트 문서를 구성하고 있는 문장들을 각 문장 벡터로 표현하고, 표현된 각 문장 벡터를 통해 각 문장 벡터 간의 유사도를 계산한 후, 계산된 각 문장 벡터 간의 유사도를 이용하여 해당 텍스트 문서의 의미적 일관성을 정량적으로 계산할 수 있다.Preferably, the second consistency analysis method includes word embedding and sentence embedding in which each sentence is expressed as a feature vector through dimension reduction and abstraction by learning context information of words and sentences of text. Embedding) is used to express the sentences constituting the text document as each sentence vector, calculate the similarity between each sentence vector through each sentence vector expressed, and then use the calculated similarity between each sentence vector to the corresponding text The semantic consistency of the document can be calculated quantitatively.

바람직하게, 상기 문장 임베딩은, 문장을 구성하고 있는 단어들의 임베딩인 단어 임베딩을 통한 단어 벡터의 평균으로 정의될 수 있다.Preferably, the sentence embedding may be defined as an average of word vectors through word embedding, which is the embedding of words constituting the sentence.

바람직하게, 상기 제2 일관성 분석법을 통해 해당 텍스트 문서의 의미적 일관성은, 해당 텍스트 문서를 구성하고 있는 모든 문장들 간의 유사도의 평균을 이용하여 하기의 식 2에 의해 계산할 수 있다.Preferably, the semantic consistency of the text document through the second consistency analysis method can be calculated by Equation 2 below by using the average of the similarity between all sentences constituting the text document.

(식 2)(Equation 2)

여기서, N은 문서에서 전체 문장의 수이고, Sim(s_i,s_j)는 코사인 유사도(Cosine Similarity)를 이용하여 N개의 문장으로 구성되어 있는 텍스트 문서 D={s₁,s₂,…,s_N}에 대하여 문장 s_i에 해당하는 문장 벡터

라고 할 때 문장 간의 유사도를 계산한 것이다.Here, N is the total number of sentences in the document, and Sim(s _i , s _j ) is a text document consisting of N sentences using Cosine Similarity D=(s ₁ , s ₂ ,... The sentence vector corresponding to the sentence s _i for ,s _N }

When said, the degree of similarity between sentences was calculated.

바람직하게, 상기 제3 일관성 분석법은, 해당 텍스트 문서의 일관성을 측정하기 위해 해당 텍스트 문서를 구성하고 있는 각 문장, 각 문장의 앞 문장, 및 각 문장의 뒷 문장을 이용하여 상기 합성곱 신경망 학습을 수행하되, 상기 합성곱 신경망의 입력으로 사용하기 위해 입력 문장을 형태소 분석한 결과에서 하나의 형태소를 단어로 가정하고 단어 임베딩(Word Embedding)을 이용한 단어 벡터(Word Vector)로 바꾸어 표현한 후, 각 문장을 구성하고 있는 단어의 벡터들을 연결한 문장 매트릭스(Sentence Matrix)로 바꾸어 사용할 수 있다.Preferably, the third consistency analysis method performs the convolutional neural network learning using each sentence constituting the text document, the preceding sentence of each sentence, and the back sentence of each sentence in order to measure the consistency of the corresponding text document. However, from the result of morphological analysis of the input sentence for use as an input of the convolutional neural network, one morpheme is assumed as a word, and each sentence is expressed by converting it into a word vector using word embedding. It can be used as a sentence matrix that connects vectors of words that compose a sentence matrix (Sentence Matrix).

바람직하게, 상기 합성곱 신경망의 입력으로 해당 텍스트 문서를 구성하고 있는 각 문장, 각 문장의 앞 문장, 및 각 문장의 뒷 문장으로 이루어진 총 3개의 문장을 사용하되, N(문서에서 전체 문장의 수)개의 문장으로 구성되어 있는 일관성이 있는 문서 D={s₁,s₂,…,s_N}에 대하여 3개의 문장을 하나의 세트(q)로 정의하여 모델의 입력 데이터를 구성하고, 하나의 세트(q) 안의 3개 문장 중 가운데 문장을 임의의 다른 비일관성 문장으로 교체하여 오류 학습 데이터를 생성하며, 일관성이 있는 문서를 구성하고 있는 문장들에 대하여 하나의 학습 데이터인 3개의 문장 세트에 대해 'y_q=1'로 설정함과 아울러 3개의 문장 중 가운데 문장을 임의의 다른 비일관성 문장으로 대치하여 생성한 오류 학습 데이터에 대해 'y_q=0'으로 설정하여 모델을 학습하고 합습된 결과를 통하여 일관성을 정량적으로 계산할 수 있다.Preferably, as input of the convolutional neural network, a total of three sentences consisting of each sentence constituting the corresponding text document, the front sentence of each sentence, and the back sentence of each sentence are used, but N (the number of total sentences in the document Consistent document consisting of) sentences D=(s ₁ ,s ₂ ,... For ,s _N }, three sentences are defined as one set (q) to form the input data of the model, and the middle sentence among the three sentences in one set (q) is replaced with another random inconsistent sentence. Error learning data is generated, and'y _q =1' is set for a set of three sentences, which is one learning data, for sentences constituting a consistent document, and the middle sentence among the three sentences is For error training data generated by replacing with other inconsistent sentences, the model is trained by setting'y _q = 0', and the consistency can be quantitatively calculated through the combined results.

바람직하게, 상기 합성곱 신경망의 합습된 결과를 통하여 N개의 문장으로 구성되어 있는 일관성이 있는 문서 D={s₁,s₂,…,s_N}에 대한 일관성(S_D)은 하기의 식 3에 의해 정량적으로 계산할 수 있다.Preferably, a consistent document consisting of N sentences through the combined result of the convolutional neural network D={s ₁ ,s ₂ ,... , s _N} consistency (S _D) for a can be calculated quantitatively by the following equation 3.

(식 3)(Equation 3)

여기서, N개의 문장으로 구성되어 있는 일관성이 있는 문서 D={s₁,s₂,…,s_N}에 대하여 정의된 세트(q)는

인 경우이고, p는 상기 합성곱 신경망의 합습된 결과로서 3개의 문장 세트에 대한 일관성 확률이다.Here, a consistent document consisting of N sentences D={s ₁ ,s ₂ ,... The set (q) defined for ,s _N } is

Is the case, and p is the probability of consistency for three sets of sentences as a result of the convolutional neural network combined.

바람직하게, 상기 텍스트 일관성 분석부의 제어에 따라 해당 텍스트 문서에서 정량적으로 분석된 각 문장 간의 일관성 점수를 사용자가 시각적으로 볼 수 있도록 디스플레이 화면에 표시하는 디스플레이부가 더 포함될 수 있다.Preferably, according to the control of the text consistency analysis unit, a display unit for displaying a consistency score between each sentence quantitatively analyzed in a corresponding text document may be further included on the display screen so that the user can visually view the consistency score.

바람직하게, 상기 텍스트 일관성 분석부의 제어에 따라 해당 텍스트 문서에서 정량적으로 분석된 각 문장 간의 일관성 정보데이터를 텍스트 문서별 또는 각 문장별로 데이터베이스(DB)화하여 저장하는 저장부가 더 포함될 수 있다.Preferably, according to the control of the text consistency analyzer, a storage unit for storing the consistency information data between each sentence quantitatively analyzed in the corresponding text document into a database (DB) for each text document or each sentence may be further included.

바람직하게, 상기 텍스트 일관성 분석부의 제어에 따라 해당 텍스트 문서에서 정량적으로 분석된 각 문장 간의 일관성 정보데이터를 유선 또는 무선으로 외부의 사용자 단말에 전송하는 통신부가 더 포함될 수 있다.Preferably, a communication unit for transmitting the consistency information data between each sentence quantitatively analyzed in a corresponding text document to an external user terminal by wire or wireless under the control of the text consistency analysis unit may be further included.

본 발명의 제2 측면은, 전처리부 및 텍스트 일관성 분석부를 포함한 장치를 이용하여 텍스트 문서에서 각 문장 간의 일관성을 분석하는 방법으로서, (a) 상기 전처리부를 통해 문서 단위의 자연 언어로 된 텍스트 문서를 입력으로 받아 문단 및 문장 단위로 분해한 후, 분해된 문단 및 문장을 형태소 분석하여 텍스트 문서로 출력하는 단계; 및 (b) 상기 텍스트 일관성 분석부를 통해 상기 단계(a)에서 형태소 분석하여 출력된 텍스트 문서에 대하여, 문단의 첫 번째 문장을 제외한 문장의 키워드와 앞 문장의 키워드를 분석하여 일관성을 정량적으로 계산하는 제1 일관성 분석법, 각 문장의 자질(Feature)을 추출하여 벡터로 표현한 후 이를 통해 각 문장의 유사도를 계산하고 이를 이용하여 일관성을 정량적으로 계산하는 제2 일관성 분석법, 기계학습 방법으로 비일관성 문장을 임의로 생성하고 이를 딥러닝 기반의 합성곱 신경망을 이용하여 학습하고 합습된 결과를 통하여 일관성을 정량적으로 계산하는 제3 일관성 분석법 중 적어도 하나의 일관성 분석법을 이용하여 각 단위 문장들이 해당 텍스트 문서 전체의 문맥상 일관성을 유지하는지 정량적으로 분석하는 단계를 포함하는 것을 특징으로 하는 텍스트 문서에서 각 문장 간의 일관성 분석 방법을 제공하는 것이다.A second aspect of the present invention is a method of analyzing the consistency between sentences in a text document using a device including a preprocessor and a text consistency analysis unit, comprising: (a) a text document in a document-level natural language through the preprocessor. Receiving an input and decomposing it into paragraphs and sentences, analyzing the decomposed paragraphs and sentences in morphemes, and outputting them as text documents; And (b) quantitatively calculating consistency by analyzing keywords of sentences other than the first sentence of the paragraph and keywords of the preceding sentence with respect to the text document that is morphologically analyzed and output in step (a) through the text consistency analysis unit. The first consistency analysis method, the second consistency analysis method, which extracts the features of each sentence and expresses it as a vector, then calculates the similarity of each sentence and uses it to quantitatively calculate the consistency. Using at least one of the third consistency analysis methods, which randomly generates and learns it using a deep learning-based convolutional neural network, and quantitatively calculates consistency through the combined results, each unit sentence is the context of the entire text document. It is to provide a method for analyzing consistency between sentences in a text document, comprising the step of quantitatively analyzing whether image consistency is maintained.

바람직하게, 상기 단계(b)에서, 상기 제1 일관성 분석법은, 인접한 문장들이 의미적으로 얼마나 관련이 있는지 판단하기 위해 각 문장의 형태소를 비교하고, 하기의 식 4를 이용하여 일관성을 정량적으로 계산할 수 있다.Preferably, in step (b), the first consistency analysis method compares the morphemes of each sentence in order to determine how semantically related adjacent sentences are, and calculates the consistency quantitatively using Equation 4 below. I can.

(식 4)(Equation 4)

바람직하게, 상기 단계(b)에서, 상기 제2 일관성 분석법은, 텍스트의 단어와 문장들의 문맥 정보를 학습하여 차원 축소 및 추상화를 통해 각 문장을 자질 벡터(Feature Vector)로 표현하는 단어 임베딩(Word Embedding) 및 문장 임베딩(Sentence Embedding)을 이용하여 해당 텍스트 문서를 구성하고 있는 문장들을 각 문장 벡터로 표현하고, 표현된 각 문장 벡터를 통해 각 문장 벡터 간의 유사도를 계산한 후, 계산된 각 문장 벡터 간의 유사도를 이용하여 해당 텍스트 문서의 의미적 일관성을 정량적으로 계산할 수 있다.Preferably, in the step (b), the second coherence analysis method is a word embedding (Word) representing each sentence as a feature vector through dimension reduction and abstraction by learning the context information of words and sentences in the text. Embedding) and sentence embedding (Sentence Embedding) to express the sentences constituting the text document as each sentence vector, calculate the similarity between each sentence vector through each expressed sentence vector, and then calculated each sentence vector The semantic consistency of the text document can be calculated quantitatively by using the similarity between the two.

바람직하게, 상기 제2 일관성 분석법을 통해 해당 텍스트 문서의 의미적 일관성은, 해당 텍스트 문서를 구성하고 있는 모든 문장들 간의 유사도의 평균을 이용하여 하기의 식 5에 의해 계산할 수 있다.Preferably, the semantic consistency of the text document through the second consistency analysis method can be calculated by Equation 5 below using the average of the similarities between all sentences constituting the text document.

(식 5)(Equation 5)

When said, the degree of similarity between sentences was calculated.

바람직하게, 상기 단계(b)에서, 상기 제3 일관성 분석법은, 해당 텍스트 문서의 일관성을 측정하기 위해 해당 텍스트 문서를 구성하고 있는 각 문장, 각 문장의 앞 문장, 및 각 문장의 뒷 문장을 이용하여 상기 합성곱 신경망 학습을 수행하되, 상기 합성곱 신경망의 입력으로 사용하기 위해 입력 문장을 형태소 분석한 결과에서 하나의 형태소를 단어로 가정하고 단어 임베딩(Word Embedding)을 이용한 단어 벡터(Word Vector)로 바꾸어 표현한 후, 각 문장을 구성하고 있는 단어의 벡터들을 연결한 문장 매트릭스(Sentence Matrix)로 바꾸어 사용할 수 있다.Preferably, in the step (b), the third consistency analysis method uses each sentence constituting the text document, the front sentence of each sentence, and the back sentence of each sentence in order to measure the consistency of the text document. Then, the convolutional neural network is trained, but from the result of morphological analysis of the input sentence for use as the input of the convolutional neural network, one morpheme is assumed as a word, and a word vector using word embedding After converting to, it can be used as a sentence matrix that connects vectors of words constituting each sentence.

바람직하게, 상기 합성곱 신경망의 합습된 결과를 통하여 N개의 문장으로 구성되어 있는 일관성이 있는 문서 D={s₁,s₂,…,s_N}에 대한 일관성(S_D)은 하기의 식 6에 의해 정량적으로 계산할 수 있다.Preferably, a consistent document consisting of N sentences through the combined result of the convolutional neural network D={s ₁ ,s ₂ ,... , s _N} consistency (S _D) for a can be calculated quantitatively by the following formula 6.

(식 6)(Equation 6)

바람직하게, 상기 단계(b) 이후에, 상기 단계(b)에서 정량적으로 분석한 각 문장 간의 일관성 점수를 사용자가 시각적으로 볼 수 있도록 별도의 디스플레이부의 디스플레이 화면에 표시하는 단계를 더 포함할 수 있다.Preferably, after step (b), the step of displaying the consistency score between each sentence quantitatively analyzed in step (b) on a display screen of a separate display unit so that the user can visually see it. .

바람직하게, 상기 단계(b) 이후에, 상기 단계(b)에서 정량적으로 분석한 각 문장 간의 일관성 정보데이터를 텍스트 문서별 또는 각 문장별로 데이터베이스(DB)화하여 별도의 저장부에 저장하는 단계를 더 포함할 수 있다.Preferably, after the step (b), the step of converting the consistency information data between each sentence quantitatively analyzed in the step (b) into a database (DB) for each text document or for each sentence and storing it in a separate storage unit. It may contain more.

바람직하게, 상기 단계(b) 이후에, 상기 단계(b)에서 정량적으로 분석된 각 문장 간의 일관성 정보데이터를 별도의 통신부를 통해 유선 또는 무선으로 외부의 사용자 단말에 전송하는 단계를 더 포함할 수 있다.Preferably, after the step (b), the step of transmitting the consistency information data between each sentence quantitatively analyzed in the step (b) to an external user terminal by wire or wirelessly through a separate communication unit may be further included. have.

본 발명의 제3 측면은, 상술한 텍스트 문서에서 각 문장 간의 일관성 분석 방법을 실행시킬 수 있는 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.A third aspect of the present invention provides a computer-readable recording medium in which a program capable of executing a method for analyzing consistency between sentences in the above-described text document is recorded.

본 발명에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 방법은 컴퓨터로 판독할 수 있는 기록매체에 컴퓨터로 판독할 수 있는 코드로 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체에는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.The method for analyzing consistency between sentences in a text document according to the present invention may be implemented as a computer-readable code on a computer-readable recording medium. The computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system.

예컨대, 컴퓨터가 읽을 수 있는 기록매체로는 롬(ROM), 램(RAM), 시디-롬(CD-ROM), 자기 테이프, 하드디스크, 플로피 디스크, 이동식 저장장치, 비휘발성 메모리(Flash Memory), 광 데이터 저장장치 등이 있다.For example, computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, hard disk, floppy disk, removable storage device, and non-volatile memory. And optical data storage devices.

이상에서 설명한 바와 같은 본 발명의 텍스트 문서에서 각 문장 간의 일관성 분석 장치 및 그 방법에 따르면, 문서 단위의 텍스트를 입력으로 받아 문장 및 문단 단위로 분해하여 각 단위 문장들이 텍스트 전체의 문맥상 일관성을 유지하는지를 정량적으로 측정함으로써, 비정형 텍스트 데이터의 품질 평가 및 분석을 위한 문서 단위의 일관성을 효과적으로 분석할 수 있는 이점이 있다.According to the apparatus and method for analyzing consistency between sentences in the text document of the present invention as described above, the text in the document unit is received as input and decomposed into sentences and paragraphs, so that each unit sentence maintains consistency in the context of the entire text. By quantitatively measuring whether or not, there is an advantage of being able to effectively analyze the consistency of the document unit for quality evaluation and analysis of unstructured text data.

또한, 본 발명에 따르면, 문서의 일관성을 분석하여 텍스트 품질을 효과적으로 측정할 수 있는 이점이 있다.In addition, according to the present invention, there is an advantage of being able to effectively measure text quality by analyzing document consistency.

또한, 본 발명에 따르면, 문서 요약 및 텍스트 문서의 의미를 분석함에 있어, 특정 임계치 이하의 문서를 필터링(Filtering)하는 데 사용될 수 있는 이점이 있다.In addition, according to the present invention, in analyzing the meaning of the document summary and the text document, there is an advantage that can be used to filter documents below a specific threshold.

또한, 본 발명에 따르면, 작문을 할 때, 일관성이 있는 문장으로 잘 작성했는지를 예컨대, 컴퓨터 장치 등을 통하여 간편하게 측정할 수 있는 이점이 있다.In addition, according to the present invention, when writing, there is an advantage that it is possible to easily measure whether or not the sentence is well written in a consistent sentence through, for example, a computer device.

또한, 본 발명에 따르면, 사람뿐만 아니라 기계(또는 인공지능)가 생성한 문장(Natural language generation)의 적합성을 판단하는 용도로 사용할 수 있는 이점이 있다.In addition, according to the present invention, there is an advantage that it can be used for determining the suitability of not only humans but also natural language generation generated by machines (or artificial intelligence).

도 1은 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치를 설명하기 위한 전체적인 블록 구성도이다.
도 2는 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 방법을 설명하기 위한 전체적인 흐름도이다.
도 3은 본 발명의 일 실시예에 적용된 제1 일관성 분석법 중에서 문장 간의 상호 참조 관계를 행렬로 표현한 도면이다.
도 4는 본 발명의 일 실시예에 적용된 제3 일관성 분석법 중에서 합성곱 신경망 기반의 문장 임베딩 모델을 나타낸 개념도이다.
도 5는 본 발명의 일 실시예에 적용된 제3 일관성 분석법 중에서 문서 일관성 측정을 위한 합성곱 신경망 기반 모델 구조를 나타낸 개념도이다.
도 6은 본 발명의 일 실시예에 적용된 제3 일관성 분석법 중에서 문서 일관성 측정 모델의 입력 데이터를 표 형태로 나타낸 일 예의 도면이다.1 is an overall block diagram illustrating an apparatus for analyzing consistency between sentences in a text document according to an embodiment of the present invention.
2 is an overall flowchart illustrating a method for analyzing consistency between sentences in a text document according to an embodiment of the present invention.
3 is a diagram illustrating a cross-referencing relationship between sentences in a matrix in a first consistency analysis method applied to an embodiment of the present invention.
4 is a conceptual diagram illustrating a sentence embedding model based on a convolutional neural network among the third coherence analysis method applied to an embodiment of the present invention.
5 is a conceptual diagram illustrating a convolutional neural network-based model structure for measuring document consistency among the third consistency analysis methods applied to an embodiment of the present invention.
6 is a diagram of an example in which input data of a document consistency measurement model is displayed in a table form among a third consistency analysis method applied to an embodiment of the present invention.

전술한 목적, 특징 및 장점은 첨부된 도면을 참조하여 상세하게 후술되며, 이에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 상세한 설명을 생략한다.The above-described objects, features, and advantages will be described later in detail with reference to the accompanying drawings, and accordingly, one of ordinary skill in the art to which the present invention pertains will be able to easily implement the technical idea of the present invention. In describing the present invention, if it is determined that a detailed description of known technologies related to the present invention may unnecessarily obscure the subject matter of the present invention, a detailed description will be omitted.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.Terms including ordinal numbers, such as first and second, may be used to describe various elements, but the elements are not limited by the terms. These terms are used only for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present invention, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element. The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the present invention have been selected from general terms that are currently widely used while considering functions in the present invention, but this may vary depending on the intention or precedent of a technician working in the field, the emergence of new technologies, and the like. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning of the terms will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present invention should be defined based on the meaning of the term and the overall contents of the present invention, not a simple name of the term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When a part of the specification is said to "include" a certain component, it means that other components may be further included rather than excluding other components unless otherwise stated. In addition, terms such as "... unit" and "module" described in the specification mean units that process at least one function or operation, which may be implemented as hardware or software, or as a combination of hardware and software. .

이하, 첨부 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. 그러나, 다음에 예시하는 본 발명의 실시예는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 상술하는 실시예에 한정되는 것은 아니다. 본 발명의 실시예는 당업계에서 통상의 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되어지는 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the embodiments of the present invention exemplified below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below. The embodiments of the present invention are provided to more completely describe the present invention to those of ordinary skill in the art.

첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들(실행 엔진)에 의해 수행될 수도 있으며, 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다.Combinations of each block of the attached block diagram and each step of the flowchart may be executed by computer program instructions (execution engine), and these computer program instructions are executed on the processor of a general purpose computer, special purpose computer or other programmable data processing equipment. As it may be mounted, its instructions executed by the processor of a computer or other programmable data processing equipment generate means for performing the functions described in each block of the block diagram or each step of the flowchart. These computer program instructions may also be stored in computer-usable or computer-readable memory that can be directed to a computer or other programmable data processing equipment to implement a function in a particular manner, so that the computer-usable or computer-readable memory It is also possible to produce an article of manufacture containing instruction means for performing the functions described in each block of the block diagram or each step of the flow chart.

그리고, 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명되는 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.In addition, since computer program instructions can be mounted on a computer or other programmable data processing equipment, a series of operation steps are performed on a computer or other programmable data processing equipment to create a process that is executed by a computer, It is also possible for the instructions to perform possible data processing equipment to provide steps for executing the functions described in each block of the block diagram and each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능들을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있으며, 몇 가지 대체 실시 예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하며, 또한 그 블록들 또는 단계들이 필요에 따라 해당하는 기능의 역순으로 수행되는 것도 가능하다.In addition, each block or each step may represent a module, segment, or part of code containing one or more executable instructions for executing specified logical functions, and in some alternative embodiments mentioned in the blocks or steps. It should be noted that it is also possible for functions to occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially simultaneously, and the blocks or steps may be performed in the reverse order of a corresponding function as necessary.

도 1은 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치를 설명하기 위한 전체적인 블록 구성도이다.1 is an overall block diagram illustrating an apparatus for analyzing consistency between sentences in a text document according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치는, 크게 전처리부(100) 및 텍스트 일관성 분석부(200) 등을 포함하여 이루어진다. 또한, 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치는 디스플레이부(300), 저장부(400), 통신부(500) 등을 더 포함할 수 있다. 한편, 도 1에 도시된 구성요소들이 필수적인 것은 아니어서, 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치는 그보다 많은 구성요소들을 갖거나 그보다 적은 구성요소들을 가질 수도 있다.Referring to FIG. 1, an apparatus for analyzing consistency between sentences in a text document according to an embodiment of the present invention largely includes a preprocessor 100 and a text consistency analysis unit 200. In addition, the apparatus for analyzing consistency between sentences in a text document according to an embodiment of the present invention may further include a display unit 300, a storage unit 400, and a communication unit 500. Meanwhile, since the elements shown in FIG. 1 are not essential, the apparatus for analyzing consistency between sentences in a text document according to an embodiment of the present invention may have more elements or fewer elements.

이하, 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치의 구성요소들에 대해 구체적으로 살펴보면 다음과 같다.Hereinafter, components of an apparatus for analyzing consistency between sentences in a text document according to an embodiment of the present invention will be described in detail.

전처리부(100)는 비정형 데이터 즉, 문서 단위의 자연 언어로 된 텍스트 문서를 입력으로 받아 문단 및 문장 단위로 분해하고, 분해된 문단 및 문장을 형태소 분석하여 텍스트 형태의 문서로 출력하는 기능을 수행한다.The preprocessor 100 receives unstructured data, that is, a text document in a document-based natural language, decomposes it into paragraphs and sentences, analyzes the decomposed paragraphs and sentences morphologically, and outputs a text document. do.

이때, 상기 문서 단위의 자연 언어로 된 텍스트 문서는 별도의 사용자 입력부(미도시)를 통해 입력받도록 구현함이 바람직하다. 상기 사용자 입력부는 사용자의 요구 또는 사용자의 조작에 따라 해당 사용자의 특정 입력신호를 출력하는 기능을 수행하는 것으로서, 통상적으로 마우스 및/또는 키보드 등으로 이루어짐이 바람직하지만, 이에 국한하지 않으며, 경우에 따라서는 리모콘 또는 터치스크린(Touch screen) 등으로 이루어질 수도 있다.In this case, it is preferable to implement the text document in the natural language of the document unit to be inputted through a separate user input unit (not shown). The user input unit performs a function of outputting a specific input signal of a corresponding user according to a user's request or a user's manipulation, and is generally preferably made of a mouse and/or a keyboard, but is not limited thereto. May be made of a remote control or a touch screen.

예컨대, 상기 터치스크린은 저항막 방식과 정전용량 방식, 적외선 방식, 초음파 방식 등이 적용될 수 있으며, 그 두께를 최소화함에 있어서 정전용량 방식이 적용됨이 가장 바람직하다.For example, the touch screen may employ a resistive film method, a capacitive method, an infrared method, an ultrasonic method, and the like, and the capacitive method is most preferably applied in order to minimize the thickness.

상기 정전용량 방식의 터치스크린은 통상적으로 그 구조가 도전투광판으로 이루어진 ITO(Indium Tin Oxide)와, 상기 ITO의 테두리에 은분 페인트를 페인트 형성한 전극부와, 상기 전극의 하부를 절연하는 절연코팅부로 구성될 수 있다. 한편, 상기 ITO는 투광성 수지로 이루어진 ITO필름과, 상기 ITO필름의 하부에 도전성 물질이 코팅 형성된 ITO코팅층으로 구성될 수 있다.The capacitive touch screen is typically an ITO (Indium Tin Oxide) whose structure is made of a conductive transparent plate, an electrode part formed with silver powder paint on the edge of the ITO, and an insulating coating that insulates the lower part of the electrode. It can be composed of wealth. Meanwhile, the ITO may be composed of an ITO film made of a light-transmitting resin, and an ITO coating layer in which a conductive material is coated on a lower portion of the ITO film.

상기한 바와 같은 정전용량 방식의 터치스크린은 손가락으로 ITO의 상면을 터치하게 되면 손가락을 통하여 정전용량의 변동에 따라 4변에 구비된 각 전극이 이를 감지함으로써 터치 위치를 감지할 수 있다.In the capacitive touch screen as described above, when the upper surface of the ITO is touched with a finger, each electrode provided on the four sides according to the change in capacitance through the finger senses the touch position.

또한, 상기 문서 단위의 자연 언어로 된 텍스트 문서는 별도의 사용자 단말(미도시)를 통해 유선 및/무선으로 입력받도록 구현할 수도 있다. 이때, 상기 사용자 단말은 무선 인터넷 또는 휴대 인터넷을 통하여 통신하는 스마트폰(Smart Phone), 스마트 패드(Smart Pad) 또는 스마트 노트(Smart Note) 중 적어도 어느 하나의 이동 단말 장치로 이루어짐이 바람직하며, 이외에도 개인용 PC, 노트북 PC, 팜(Palm) PC, 모바일 게임기(Mobile play-station), 통신 기능이 있는 DMB(Digital Multimedia Broadcasting)폰, 태블릿 PC, 아이패드(iPad) 등 전처리부(100)에 접속하기 위한 사용자 인터페이스를 갖는 모든 유무선 가전/통신 장치를 포괄적으로 의미할 수 있다.In addition, the text document in the natural language of the document unit may be implemented to receive wired/wireless input through a separate user terminal (not shown). In this case, the user terminal is preferably made of at least one mobile terminal device among a smart phone, a smart pad, or a smart note communicating through wireless Internet or portable Internet. Access to the preprocessor 100 such as personal PC, notebook PC, Palm PC, mobile play-station, DMB (Digital Multimedia Broadcasting) phone with communication function, tablet PC, iPad, etc. It can mean all wired/wireless home appliances/communications devices having a user interface for comprehensively.

그리고, 텍스트 일관성 분석부(200)는 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치의 전반적인 분석 및 제어를 수행하는 바, 특히 전처리부(100)로부터 형태소 분석을 통해 출력된 텍스트 문서에 대하여, 문단의 첫 번째 문장을 제외한 문장의 키워드와 앞 문장의 키워드를 분석하여 일관성을 정량적으로 계산하는 제1 일관성 분석법, 각 문장의 자질(Feature)을 추출하여 벡터로 표현한 후 이를 통해 각 문장의 유사도를 계산하고 이를 이용하여 일관성을 정량적으로 계산하는 제2 일관성 분석법, 기계학습 방법으로 비일관성 문장을 임의로 생성하고 이를 딥러닝 기반의 합성곱 신경망을 이용하여 학습하고 합습된 결과를 통하여 일관성을 정량적으로 계산하는 제3 일관성 분석법 중 적어도 하나의 일관성 분석법을 이용하여 각 단위 문장들이 해당 텍스트 문서 전체의 문맥상 일관성을 유지하는지 정량적으로 분석하는 기능을 수행한다.In addition, the text consistency analysis unit 200 performs overall analysis and control of the apparatus for analyzing consistency between sentences in the text document according to an embodiment of the present invention, and in particular, the output from the preprocessor 100 through morpheme analysis. For text documents, the first consistency analysis method, which quantitatively calculates consistency by analyzing the keywords of the sentence excluding the first sentence of the paragraph and the keywords of the preceding sentence, extracts the features of each sentence and expresses them as vectors. The second consistency analysis method, which calculates the similarity of each sentence and uses it to quantitatively calculate the consistency, randomly generates inconsistent sentences using a deep learning-based convolutional neural network, and learns it using a convolutional neural network based on deep learning, and uses the combined results. It performs a function of quantitatively analyzing whether each unit sentence maintains consistency in the context of the entire text document by using at least one consistency analysis method among the third consistency analysis methods that quantitatively calculates consistency.

(식 1)(Equation 1)

그리고, 상기 제2 일관성 분석법은, 텍스트의 단어와 문장들의 문맥 정보를 학습하여 차원 축소 및 추상화를 통해 각 문장을 자질 벡터(Feature Vector)로 표현하는 단어 임베딩(Word Embedding) 및 문장 임베딩(Sentence Embedding)을 이용하여 해당 텍스트 문서를 구성하고 있는 문장들을 각 문장 벡터로 표현하고, 표현된 각 문장 벡터를 통해 각 문장 벡터 간의 유사도를 계산한 후, 계산된 각 문장 벡터 간의 유사도를 이용하여 해당 텍스트 문서의 의미적 일관성을 정량적으로 계산할 수 있다.In addition, the second coherence analysis method includes word embedding and sentence embedding in which each sentence is expressed as a feature vector through dimension reduction and abstraction by learning context information of words and sentences in text. ) To express the sentences constituting the text document as sentence vectors, calculate the similarity between each sentence vector through each sentence vector expressed, and then use the similarity between the calculated sentence vectors to the corresponding text document The semantic consistency of can be calculated quantitatively.

이때, 상기 문장 임베딩은, 문장을 구성하고 있는 단어들의 임베딩인 단어 임베딩을 통한 단어 벡터의 평균으로 정의됨이 바람직하다.In this case, the sentence embedding is preferably defined as an average of word vectors through word embedding, which is the embedding of words constituting the sentence.

그리고, 상기 제2 일관성 분석법을 통해 해당 텍스트 문서의 의미적 일관성은, 해당 텍스트 문서를 구성하고 있는 모든 문장들 간의 유사도의 평균을 이용하여 하기의 식 2에 의해 계산할 수 있다.In addition, the semantic consistency of the text document through the second consistency analysis method can be calculated by Equation 2 below using the average of the similarities between all sentences constituting the text document.

(식 2)(Equation 2)

When said, the degree of similarity between sentences was calculated.

그리고, 상기 제3 일관성 분석법은, 해당 텍스트 문서의 일관성을 측정하기 위해 해당 텍스트 문서를 구성하고 있는 각 문장, 각 문장의 앞 문장, 및 각 문장의 뒷 문장을 이용하여 상기 합성곱 신경망 학습을 수행하되, 상기 합성곱 신경망의 입력으로 사용하기 위해 입력 문장을 형태소 분석한 결과에서 하나의 형태소를 단어로 가정하고 단어 임베딩(Word Embedding)을 이용한 단어 벡터(Word Vector)로 바꾸어 표현한 후, 각 문장을 구성하고 있는 단어의 벡터들을 연결한 문장 매트릭스(Sentence Matrix)로 바꾸어 사용함이 바람직하다.In addition, the third consistency analysis method performs the convolutional neural network learning using each sentence constituting the text document, the preceding sentence of each sentence, and the back sentence of each sentence in order to measure the consistency of the corresponding text document. However, from the result of morphological analysis of the input sentence to be used as an input of the convolutional neural network, one morpheme is assumed as a word, and each sentence is expressed by converting it into a word vector using word embedding. It is preferable to use the vectors of the words that are composed of them by replacing them with a sentence matrix.

이때, 상기 합성곱 신경망의 입력으로 해당 텍스트 문서를 구성하고 있는 각 문장, 각 문장의 앞 문장, 및 각 문장의 뒷 문장으로 이루어진 총 3개의 문장을 사용하되, N(문서에서 전체 문장의 수)개의 문장으로 구성되어 있는 일관성이 있는 문서 D={s₁,s₂,…,s_N}에 대하여 3개의 문장을 하나의 세트(q)로 정의하여 모델의 입력 데이터를 구성하고, 하나의 세트(q) 안의 3개 문장 중 가운데 문장을 임의의 다른 비일관성 문장으로 교체하여 오류 학습 데이터를 생성하며, 일관성이 있는 문서를 구성하고 있는 문장들에 대하여 하나의 학습 데이터인 3개의 문장 세트에 대해 'y_q=1'로 설정함과 아울러 3개의 문장 중 가운데 문장을 임의의 다른 비일관성 문장으로 대치하여 생성한 오류 학습 데이터에 대해 'y_q=0'으로 설정하여 모델을 학습하고 합습된 결과를 통하여 일관성을 정량적으로 계산할 수 있다.At this time, a total of three sentences consisting of each sentence constituting the text document, the front sentence of each sentence, and the back sentence of each sentence are used as the input of the convolutional neural network, and N (the number of total sentences in the document) Consistent document D={s ₁ ,s ₂ ,… For ,s _N }, three sentences are defined as one set (q) to form the input data of the model, and the middle sentence among the three sentences in one set (q) is replaced with another random inconsistent sentence. Error learning data is generated, and'y _q =1' is set for a set of three sentences, which is one learning data, for sentences constituting a consistent document, and the middle sentence among the three sentences is For error training data generated by replacing with other inconsistent sentences, the model is trained by setting'y _q = 0', and the consistency can be quantitatively calculated through the combined results.

그리고, 상기 합성곱 신경망의 합습된 결과를 통하여 N개의 문장으로 구성되어 있는 일관성이 있는 문서 D={s₁,s₂,…,s_N}에 대한 일관성(S_D)은 하기의 식 3에 의해 정량적으로 계산할 수 있다.And, through the combined result of the convolutional neural network, a consistent document consisting of N sentences D={s ₁ ,s ₂ ,... , s _N} consistency (S _D) for a can be calculated quantitatively by the following equation 3.

(식 3)(Equation 3)

추가적으로, 디스플레이부(300)는 텍스트 일관성 분석부(200)의 제어에 따라 해당 텍스트 문서에서 정량적으로 분석된 각 문장 간의 일관성 점수를 사용자가 시각적으로 볼 수 있도록 디스플레이 화면에 표시하는 기능을 수행한다.Additionally, the display unit 300 performs a function of displaying a consistency score between each sentence quantitatively analyzed in a corresponding text document on the display screen so that the user can visually view it under the control of the text consistency analysis unit 200.

이러한 디스플레이부(300)는 예컨대, 액정 디스플레이(Liquid Crystal Display, LCD), 발광다이오드 디스플레이(Light Emitting Diode, LED), 박막 트랜지스터 액정 디스플레이(Thin Film Transistor-Liquid Crystal Display, TFT LCD), 유기 발광 다이오드(Organic Light Emitting Diode, OLED), 플렉시블 디스플레이(Flexible Display), 플라즈마 디스플레이 패널 (Plasma Display Panel, PDP), 표면 얼터네이트 라이팅(ALiS), 디지털 광원 처리(DLP), 실리콘 액정(LCoS), 표면 전도형 전자방출소자 디스플레이(SED), 전계방출 디스플레이(FED), 레이저 TV(양자 점 레이저, 액정 레이저), 광유전성 액체 디스플레이(FLD), 간섭계 변조기 디스플레이(iMoD), 두꺼운 필름 유전체 전기(TDEL), 양자점 디스플레이(QD-LED), 텔레스코픽 픽셀 디스플레이(TPD), 유기발광 트랜지스터(OLET), 레이저 형광 디스플레이(LPD), 3차원 디스플레이(3D display) 중에서 적어도 하나를 포함할 수 있지만, 이에 한정되는 것은 아니고 숫자, 문자 또는 도형 등을 디스플레이(Display)할 수 있는 것이라면, 어떠한 것이라도 포함할 수 있다.Such display unit 300 is, for example, a liquid crystal display (LCD), a light emitting diode (LED), a thin film transistor liquid crystal display (TFT LCD), an organic light emitting diode. (Organic Light Emitting Diode, OLED), Flexible Display, Plasma Display Panel (PDP), Surface Alternate Lighting (ALiS), Digital Light Source Treatment (DLP), Silicon Liquid Crystal (LCoS), Surface Conduction Electron-emitting device display (SED), field emission display (FED), laser TV (quantum dot laser, liquid crystal laser), optoelectric liquid display (FLD), interferometric modulator display (iMoD), thick film dielectric electricity (TDEL), It may include at least one of a quantum dot display (QD-LED), a telescopic pixel display (TPD), an organic light emitting transistor (OLET), a laser fluorescent display (LPD), and a 3D display, but is not limited thereto. Anything that can display numbers, characters, or figures can be included.

더욱이, 저장부(400)는 텍스트 일관성 분석부(200)의 제어에 따라 해당 텍스트 문서에서 정량적으로 분석된 각 문장 간의 일관성 정보데이터를 텍스트 문서별 및/또는 각 문장별로 데이터베이스(DB)화하여 저장하는 기능을 수행한다.Furthermore, the storage unit 400 stores the consistency information data between each sentence quantitatively analyzed in the text document under the control of the text consistency analysis unit 200 into a database (DB) for each text document and/or for each sentence. Performs the function of

이러한 저장부(400)는 예컨대, 플래시 메모리 타입(Flash Memory type), 하드디스크 타입(Hard Disk type), 멀티미디어 카드 마이크로 타입(Multimedia Card Micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(Random Access Memory, RAM), SRAM(Static Random Access Memory), 롬(Read-Only Memory, ROM), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다.Such a storage unit 400 is, for example, a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg, SD or XD memory). Etc.), RAM (Random Access Memory, RAM), SRAM (Static Random Access Memory), ROM (Read-Only Memory, ROM), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), It may include at least one type of storage medium among magnetic memory, magnetic disk, and optical disk.

또한, 통신부(500)는 텍스트 일관성 분석부(200)의 제어에 따라 해당 텍스트 문서에서 정량적으로 분석된 각 문장 간의 일관성 정보데이터를 유선 및/또는 무선으로 외부의 사용자 단말(미도시)에 전송하는 기능을 수행한다.In addition, the communication unit 500 transmits the consistency information data between each sentence quantitatively analyzed in the text document under the control of the text consistency analysis unit 200 to an external user terminal (not shown) by wire and/or wirelessly. Functions.

이때, 외부의 사용자 단말은 기 설치된 문장 일관성 분석관련 어플리케이션 서비스를 통해 통신부(500)로부터 전송된 해당 텍스트 문서에서 정량적으로 분석된 각 문장 간의 일관성 정보데이터를 제공받아 이를 기반으로 해당 사용자가 시각적으로 볼 수 있도록 텍스트 및/또는 그래프 형태로 해당 사용자 단말의 디스플레이 화면에 표시하는 기능을 수행할 수 있다.At this time, the external user terminal receives the consistency information data between each sentence quantitatively analyzed in the text document transmitted from the communication unit 500 through the previously installed sentence consistency analysis application service, and the corresponding user can visually view it based on this. The display screen of the user terminal may be displayed in the form of text and/or graph so that it can be performed.

여기에 설명되는 다양한 실시예는 예를 들어, 소프트웨어, 하드웨어 또는 이들의 조합된 것을 이용하여 컴퓨터 또는 이와 유사한 장치로 읽을 수 있는 기록매체 내에서 구현될 수 있다.Various embodiments described herein may be implemented in a recording medium that can be read by a computer or a similar device using, for example, software, hardware, or a combination thereof.

하드웨어적인 구현에 의하면, 여기에 설명되는 실시예는 ASICs(application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs(field programmable gate arrays), 프로세서(processors), 제어기(controllers), 마이크로 컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기능 수행을 위한 전기적인 유닛 중 적어도 하나를 이용하여 구현될 수 있다. 일부의 경우에 그러한 실시예들이 텍스트 일관성 분석부(200)에 의해 구현될 수 있다.According to hardware implementation, the embodiments described herein include application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs). , Processors, controllers, micro-controllers, microprocessors, and electric units for performing functions may be used. In some cases, such embodiments may be implemented by the text consistency analysis unit 200.

소프트웨어적인 구현에 의하면, 절차나 기능과 같은 실시예들은 적어도 하나의 기능 또는 작동을 수행하게 하는 별개의 소프트웨어 모듈과 함께 구현될 수 있다. 소프트웨어 코드는 적절한 프로그램 언어로 쓰여진 소프트웨어 어플리케이션에 의해 구현될 수 있다. 또한, 소프트웨어 코드는 저장부(400)에 저장되고, 텍스트 일관성 분석부(200)에 의해 실행될 수 있다.According to a software implementation, embodiments such as procedures or functions may be implemented together with separate software modules that perform at least one function or operation. The software code can be implemented by a software application written in an appropriate programming language. In addition, the software code may be stored in the storage unit 400 and executed by the text consistency analysis unit 200.

한편, 도면에 도시되진 않았지만, 전술한 각 부 즉, 전처리부(100), 텍스트 일관성 분석부(200), 디스플레이부(300), 저장부(400), 및 통신부(500) 등에 필요한 전원을 공급하기 위한 전원공급부(미도시)가 더 포함됨이 바람직하다. 이러한 상기 전원공급부는 외부의 교류전원(예컨대, AC 220V)을 제공받아 각종 직류전원으로 변환되도록 구현함이 바람직하지만, 이에 국한하지 않으며, 통상의 휴대용 배터리(Battery)로 구현될 수도 있다.On the other hand, although not shown in the drawing, power required for each of the above-described units, that is, the preprocessing unit 100, the text consistency analysis unit 200, the display unit 300, the storage unit 400, and the communication unit 500, is supplied. It is preferable that a power supply unit (not shown) is further included. The power supply unit is preferably implemented to receive external AC power (eg, AC 220V) and convert it into various DC power sources, but the present invention is not limited thereto, and may be implemented as a conventional portable battery.

전술한 바와 같이 구성된 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치는 예컨대, 개인용 PC 또는 노트북 PC 등으로 구현됨이 바람직하지만, 이에 국한하지 않으며, 무선 인터넷 또는 휴대 인터넷을 통하여 통신하는 스마트폰(Smart Phone), 스마트 패드(Smart Pad) 또는 스마트 노트(Smart Note) 중 적어도 어느 하나의 이동 단말 장치로 구현될 수도 있다. 이외에도 팜(Palm) PC, 모바일 게임기(Mobile play-station), 통신 기능이 있는 DMB(Digital Multimedia Broadcasting)폰, 태블릿 PC, 아이패드(iPad) 등으로 구현될 수도 있다.The apparatus for analyzing consistency between sentences in the text document according to the embodiment of the present invention configured as described above is preferably implemented by, for example, a personal PC or a notebook PC, but is not limited thereto, and through wireless Internet or portable Internet. It may be implemented as at least one mobile terminal device of a smart phone, a smart pad, or a smart note that communicates. In addition, it may be implemented with a Palm PC, a mobile play-station, a Digital Multimedia Broadcasting (DMB) phone with a communication function, a tablet PC, an iPad, and the like.

만약, 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치가 스마트폰으로 이루어질 경우, 상기 스마트폰은 일반 핸드폰(일명 피처폰(feature phone))과는 달리 사용자가 원하는 다양한 어플리케이션(Application) 프로그램을 다운로드받아 자유롭게 사용하고 삭제가 가능한 오픈 운영체제를 기반으로 한 폰(Phone)으로서, 일반적으로 사용되는 음성/영상통화, 인터넷 데이터통신 등의 기능뿐만 아니라, 모바일 오피스 기능을 갖춘 모든 모바일 폰 또는 음성통화 기능이 없으나 인터넷 접속 가능한 모든 인터넷폰 또는 테블릿 PC(Tablet PC)를 포함하는 통신기기로 이해함이 바람직하다.If, in a text document according to an embodiment of the present invention, when the device for analyzing the consistency between each sentence is made by a smartphone, the smartphone is a variety of applications desired by the user, unlike a general mobile phone (a feature phone). Application) As a phone based on an open operating system that can be downloaded, freely used, and deleted, all mobile phones with mobile office functions as well as commonly used functions such as voice/video calls and internet data communication Alternatively, it is preferable to understand it as a communication device including all Internet phones or tablet PCs that do not have a voice call function but can access the Internet.

이러한 스마트폰은 다양한 개방형 운영체제를 탑재한 스마트폰으로 구현될 수 있으며, 상기 개방형 운영체제로는 예컨대, 노키아(NOKIA)사의 심비안, 림스(RIMS)사의 블랙베리, 애플(Apple)사의 아이폰, 마이크로소프트사(MS)의 윈도즈 모바일, 구글(Google)사의 안드로이드, 삼성전자의 바다 등으로 이루어질 수 있다.Such a smartphone can be implemented as a smartphone equipped with a variety of open operating systems, and examples of the open operating systems include Symbian from NOKIA, Blackberry from RIMS, iPhone from Apple, and Microsoft. (MS)'s Windows Mobile, Google's Android, and Samsung Electronics' Sea.

이와 같이 스마트폰은 개방형 운영체제를 사용하므로 폐쇄적인 운영체제를 가진 휴대폰과 달리 사용자가 임의로 다양한 어플리케이션 프로그램을 설치하고 관리할 수 있다.As described above, since a smartphone uses an open operating system, unlike a mobile phone having a closed operating system, a user can arbitrarily install and manage various application programs.

즉, 전술한 상기 스마트폰은 기본적으로 제어부, 메모리부, 화면출력부, 키입력부, 사운드 출력부, 사운드 입력부, 카메라부, 무선망 통신모듈, 근거리 무선 통신모듈 및 전원 공급을 위한 배터리 등을 구비한다.That is, the aforementioned smart phone basically includes a control unit, a memory unit, a screen output unit, a key input unit, a sound output unit, a sound input unit, a camera unit, a wireless network communication module, a short-range wireless communication module, and a battery for power supply. do.

상기 제어부는 스마트폰의 동작을 제어하는 기능 구성의 총칭으로서, 적어도 하나의 프로세서와 실행 메모리를 포함하며, 스마트폰에 구비된 각 기능 구성부와 버스(BUS)를 통해 연결된다.The control unit is a generic term for functional configurations for controlling the operation of the smartphone, includes at least one processor and an execution memory, and is connected to each functional configuration unit provided in the smartphone through a bus.

이러한 상기 제어부는 상기 프로세서를 통해 스마트폰에 구비되는 적어도 하나의 프로그램 코드를 상기 실행 메모리에 로딩하여 연산하고, 그 결과를 상기 버스를 통해 적어도 하나의 기능 구성부로 전달하여 스마트폰의 동작을 제어한다.The control unit controls the operation of the smartphone by loading at least one program code provided in the smartphone through the processor to the execution memory for calculation, and transmitting the result to at least one functional configuration unit through the bus. .

상기 메모리부는 스마트폰에 구비되는 비휘발성 메모리의 총칭으로서, 상기 제어부를 통해 실행되는 적어도 하나의 프로그램 코드와, 상기 프로그램 코드가 이용하는 적어도 하나의 데이터 셋트를 저장하여 유지한다. 상기 메모리부는 기본적으로 스마트폰의 운영체제에 대응하는 시스템 프로그램 코드와 시스템 데이터 셋트, 스마트폰의 무선 통신 연결을 처리하는 통신 프로그램 코드와 통신 데이터 셋트 및 적어도 하나의 응용프로그램 코드와 응용 데이터 셋트를 저장하며, 본 발명을 구현하기 위한 프로그램 코드와 데이터 셋트 역시 상기 메모리부에 저장된다.The memory unit is a generic term for a nonvolatile memory provided in a smartphone, and stores and maintains at least one program code executed through the control unit and at least one data set used by the program code. The memory unit basically stores a system program code and a system data set corresponding to an operating system of a smartphone, a communication program code and a communication data set for processing wireless communication connection of the smartphone, and at least one application program code and application data set, and , The program code and data set for implementing the present invention are also stored in the memory unit.

상기 화면 출력부는 화면출력 장치(예컨대, LCD(Liquid Crystal Display) 장치)와 이를 구동하는 출력 모듈로 구성되며, 상기 제어부와 버스로 연결되어 상기 제어부의 각종 연산 결과 중 화면 출력에 대응하는 연산 결과를 상기 화면출력 장치로 출력한다.The screen output unit is composed of a screen output device (e.g., a liquid crystal display (LCD) device) and an output module driving the same, and is connected to the control unit via a bus to receive an operation result corresponding to the screen output among various operation results of the control unit. Output to the screen output device.

상기 키입력부는 적어도 하나의 키 버튼을 구비한 키 입력장치(또는 상기 화면 출력부와 연동하는 터치스크린 장치)와 이를 구동하는 입력 모듈로 구성되며, 상기 제어부와 버스로 연결되어 상기 제어부의 각종 연산을 명령하는 명령을 입력하거나, 또는 상기 제어부의 연산에 필요한 데이터를 입력한다.The key input unit is composed of a key input device having at least one key button (or a touch screen device interlocking with the screen output unit) and an input module driving the same, and is connected to the control unit via a bus to perform various operations of the control unit. A command for commanding is inputted or data required for calculation of the control unit is inputted.

상기 사운드 출력부는 사운드 신호를 출력하는 스피커와 상기 스피커를 구동하는 사운드 모듈로 구성되며, 상기 제어부와 버스로 연결되어 상기 제어부의 각종 연산 결과 중 사운드 출력에 대응하는 연산 결과를 상기 스피커를 통해 출력한다. 상기 사운드 모듈은 기 스피커를 통해 출력할 사운드 데이터를 디코딩(Decoding)하여 사운드 신호로 변환한다.The sound output unit is composed of a speaker that outputs a sound signal and a sound module that drives the speaker, and is connected to the control unit by a bus to output an operation result corresponding to the sound output among various calculation results of the control unit through the speaker. . The sound module decodes sound data to be output through an existing speaker and converts it into a sound signal.

상기 사운드 입력부는 사운드 신호를 입력받는 마이크로폰과 상기 마이크로폰을 구동하는 사운드 모듈로 구성되며, 상기 마이크로폰을 통해 입력되는 사운드 데이터를 상기 제어부로 전달한다. 상기 사운드 모듈은 상기 마이크로폰을 통해 입력되는 사운드 신호를 엔코딩(Encoding)하여 부호화한다.The sound input unit includes a microphone receiving a sound signal and a sound module driving the microphone, and transmits sound data input through the microphone to the control unit. The sound module encodes and encodes a sound signal input through the microphone.

상기 카메라부는 광학부와 CCD(Charge Coupled Device)와 이를 구동하는 카메라 모듈로 구성되며, 상기 광학부를 통해 상기 CCD에 입력된 비트맵 데이터를 획득한다. 상기 비트맵 데이터는 정지 영상의 이미지 데이터와 동영상 데이터를 모두 포함할 수 있다.The camera unit is composed of an optical unit, a charge coupled device (CCD), and a camera module driving the same, and acquires bitmap data input to the CCD through the optical unit. The bitmap data may include both image data of a still image and video data.

상기 무선망 통신모듈은 무선 통신을 연결하는 통신 구성의 총칭으로서, 특정 주파수 대역의 무선 주파수 신호를 송수신하는 안테나, RF모듈, 기저대역모듈, 신호처리모듈을 적어도 하나 포함하여 구성되며, 상기 제어부와 버스로 연결되어 상기 제어부의 각종 연산 결과 중 무선 통신에 대응하는 연산 결과를 무선 통신을 통해 전송하거나, 또는 무선 통신을 통해 데이터를 수신하여 상기 제어부로 전달함과 동시에, 상기 무선 통신의 접속, 등록, 통신, 핸드오프의 절차를 유지한다.The wireless network communication module is a generic term for a communication configuration connecting wireless communication, and includes at least one antenna for transmitting and receiving radio frequency signals of a specific frequency band, an RF module, a baseband module, and a signal processing module, and the control unit and It is connected by a bus and transmits the calculation result corresponding to wireless communication among various calculation results of the control unit through wireless communication, or receives data through wireless communication and transmits it to the control unit, and at the same time, accessing and registering the wireless communication Maintain the procedures of communication, handoff.

또한, 상기 무선망 통신모듈은 CDMA/WCDMA 규격에 따라 이동 통신망에 접속, 위치등록, 호처리, 통화연결, 데이터통신, 핸드오프를 적어도 하나 수행하는 이동 통신 구성을 포함한다. 한편, 당업자의 의도에 따라 상기 무선망 통신모듈은 IEEE 802.16 규격에 따라 휴대인터넷에 접속, 위치등록, 데이터통신, 핸드오프를 적어도 하나 수행하는 휴대 인터넷 통신 구성을 더 포함할 수 있으며, 상기 무선망 통신모듈이 제공하는 무선 통신 구성에 의해 본 발명이 한정되지 아니함을 명백히 밝혀두는 바이다.In addition, the wireless network communication module includes a mobile communication configuration that performs at least one access to a mobile communication network, location registration, call processing, call connection, data communication, and handoff according to the CDMA/WCDMA standard. Meanwhile, according to the intention of a person skilled in the art, the wireless network communication module may further include a portable Internet communication configuration for performing at least one access to the portable Internet, location registration, data communication, and handoff according to the IEEE 802.16 standard, and the wireless network It is to be clear that the present invention is not limited by the wireless communication configuration provided by the communication module.

상기 근거리 무선 통신모듈은 일정 거리 이내에서 무선 주파수 신호를 통신매체로 이용하여 통신세션을 연결하는 근거리 무선 통신모듈로 구성되며, 바람직하게는 ISO 180000 시리즈 규격의 RFID 통신, 블루투스 통신, 와이파이 통신, 공중 무선 통신 중 적어도 하나를 포함할 수 있다. 또한, 상기 근거리 무선 통신모듈은 상기 무선망 통신모듈과 통합될 수 있다.The short-range wireless communication module is composed of a short-range wireless communication module that connects a communication session using a radio frequency signal as a communication medium within a certain distance, and preferably, RFID communication, Bluetooth communication, Wi-Fi communication, and public communication according to ISO 180000 series. It may include at least one of wireless communication. In addition, the short-range wireless communication module may be integrated with the wireless network communication module.

이와 같이 구성된 스마트폰은 무선 통신이 가능한 단말기를 의미하며, 스마트폰 이외에도 인터넷을 포함한 네트워크를 통하여 데이터의 송수신이 가능한 단말기라면 어떠한 장치라도 적용이 가능할 것이다. 즉, 상기 스마트폰은 단문 메시지 전송 기능과 네트워크 접속 기능을 가지는 노트북 PC, 태블릿 PC, 그 외에도 휴대 및 이동이 가능한 휴대 단말을 적어도 하나 이상을 포함할 수 있다.The smart phone configured as described above means a terminal capable of wireless communication, and any device may be applied as long as it is a terminal capable of transmitting and receiving data through a network including the Internet in addition to a smart phone. That is, the smart phone may include at least one notebook PC, a tablet PC having a function of transmitting a short message and a function of connecting to a network, and a portable terminal capable of carrying and moving.

이하에는 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 방법을 상세하게 설명하기로 한다.Hereinafter, a method for analyzing consistency between sentences in a text document according to an embodiment of the present invention will be described in detail.

도 2는 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 방법을 설명하기 위한 전체적인 흐름도이고, 도 3은 본 발명의 일 실시예에 적용된 제1 일관성 분석법 중에서 문장 간의 상호 참조 관계를 행렬로 표현한 도면이며, 도 4는 본 발명의 일 실시예에 적용된 제3 일관성 분석법 중에서 합성곱 신경망 기반의 문장 임베딩 모델을 나타낸 개념도이며, 도 5는 본 발명의 일 실시예에 적용된 제3 일관성 분석법 중에서 문서 일관성 측정을 위한 합성곱 신경망 기반 모델 구조를 나타낸 개념도이며, 도 6은 본 발명의 일 실시예에 적용된 제3 일관성 분석법 중에서 문서 일관성 측정 모델의 입력 데이터를 표 형태로 나타낸 일 예의 도면이다.2 is an overall flowchart illustrating a method for analyzing consistency between sentences in a text document according to an embodiment of the present invention, and FIG. 3 is a cross-reference relationship between sentences among a first consistency analysis method applied to an embodiment of the present invention. A diagram expressed as a matrix, and FIG. 4 is a conceptual diagram showing a sentence embedding model based on a convolutional neural network among the third coherence analysis methods applied to an embodiment of the present invention, and FIG. 5 is a third coherence analysis method applied to an embodiment of the present invention. It is a conceptual diagram showing the structure of a convolutional neural network-based model for measuring document consistency, and FIG. 6 is a diagram illustrating an example in which input data of a document consistency measurement model is shown in a table form among a third consistency analysis method applied to an embodiment of the present invention.

도 1 내지 도 6을 참조하면, 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 방법은, 먼저, 전처리부(100)를 통해 비정형 데이터 즉, 문서 단위의 자연 언어로 된 텍스트 문서를 입력으로 받아 문단 및 문장 단위로 분해한 후, 분해된 문단 및 문장을 형태소 분석하여 텍스트 문서로 출력한다(S100).1 to 6, the method for analyzing consistency between sentences in a text document according to an embodiment of the present invention is, first, unstructured data, that is, a text document in a document-based natural language through the preprocessor 100. Is received as an input and decomposed into paragraphs and sentences, and then the decomposed paragraphs and sentences are morphologically analyzed and output as a text document (S100).

이후에, 텍스트 일관성 분석부(200)를 통해 상기 단계S100에서 형태소 분석하여 출력된 텍스트 문서에 대하여, 문단의 첫 번째 문장을 제외한 문장의 키워드와 앞 문장의 키워드를 분석하여 일관성을 정량적으로 계산하는 제1 일관성 분석법, 각 문장의 자질(Feature)을 추출하여 벡터로 표현한 후 이를 통해 각 문장의 유사도를 계산하고 이를 이용하여 일관성을 정량적으로 계산하는 제2 일관성 분석법, 기계학습 방법으로 비일관성 문장을 임의로 생성하고 이를 딥러닝 기반의 합성곱 신경망을 이용하여 학습하고 합습된 결과를 통하여 일관성을 정량적으로 계산하는 제3 일관성 분석법 중 적어도 하나의 일관성 분석법을 이용하여 각 단위 문장들이 해당 텍스트 문서 전체의 문맥상 일관성을 유지하는지 정량적으로 분석한다(S200).Thereafter, with respect to the text document morphologically analyzed and output in step S100 through the text consistency analysis unit 200, the consistency is quantitatively calculated by analyzing the keyword of the sentence excluding the first sentence of the paragraph and the keyword of the previous sentence. The first consistency analysis method, the second consistency analysis method, which extracts the features of each sentence and expresses it as a vector, then calculates the similarity of each sentence and uses it to quantitatively calculate the consistency. Using at least one of the third consistency analysis methods, which randomly generates and learns it using a deep learning-based convolutional neural network, and quantitatively calculates consistency through the combined results, each unit sentence is the context of the entire text document. Quantitatively analyze whether the phase consistency is maintained (S200).

이때, 상기 단계S200에서, 상기 제1 일관성 분석법은, 인접한 문장들이 의미적으로 얼마나 관련이 있는지 판단하기 위해 각 문장의 형태소를 비교한다. 따라서, 하기의 식 4를 이용하여 일관성을 정량적으로 계산할 수 있다.At this time, in step S200, the first consistency analysis method compares the morphemes of each sentence to determine how semantically related adjacent sentences are. Therefore, the consistency can be quantitatively calculated using Equation 4 below.

(식 4)(Equation 4)

여기서, N은 문서에서 전체 문장의 수이고, R(i, i+1)은 i번째 문장과 i+1번째 문장의 상호 참조 관계의 상태를 나타낸다. 문장의 상호 참조 관계의 요소로 한국어 특성(교착어)상 형태소 단위로 정하고, 예컨대, 명사, 형용사, 동사, 어근 등과 같이 문장에서 의미를 가지는 품사로 한정한다.Here, N is the total number of sentences in the document, and R(i, i+1) indicates the state of the cross-reference relationship between the i-th sentence and the i+1th sentence. As an element of the cross-referencing relationship of a sentence, it is determined in units of morphemes due to the characteristics of Korean (aggregating words), and is limited to parts of speech that have meaning in the sentence, such as nouns, adjectives, verbs, and roots.

즉, 인접한 두 문장에 적어도 하나 이상의 상호 참조 관계가 있는 경우(또는 의미 있는 품사를 가진 형태소가 동일하게 존재할 경우) R(i, i+1)의 값은 '1' 이고, 그렇지 않을 경우에는 '0'으로 정의한다.That is, if there is at least one cross-reference relationship between two adjacent sentences (or if morphemes with meaningful parts of speech exist identically), the value of R(i, i+1) is '1', otherwise, ' It is defined as 0'.

아래의 예시를 통해 문서의 일관성 측정 과정을 자세히 설명한다.The document consistency measurement process is described in detail through the example below.

(예시)(example)

문장 1 : 물을 데우면 끓게 되고 결국 증발한다.Sentence 1: When water is heated, it boils and eventually evaporates .

문장 2 : 증발된 기체가 냉각되면 다시 물로 변한다.Sentence 2: When the evaporated gas is cooled, it turns into water again.

문장 3: 어떤 물질이 전혀 다른 새로운 물질로 변하는 현상을 화학 변화라고 한다.Sentence 3: The transformation of a substance into a completely different substance is called chemical change.

문장 4: 얼음이 녹아서 물이 되는 것도 마찬가지이다.Sentence 4: The same goes for melting ice into water .

여기서, 문장 1과 문장 2에서 동일한 형태소가 2개 이상 존재하기 때문에 R(1, 2)의 값은 '1'이 된다. 문장 2와 문장 3에서 동일한 형태소가 존재하지 않기 때문에 R(2, 3)의 값은 '0'이 된다. 같은 방법으로 R(3, 4)의 값은 '0'이 된다. 문장 간의 상호 참조 관계를 행렬로 표현하면 도 3과 같다.Here, since two or more identical morphemes exist in sentence 1 and sentence 2, the value of R(1, 2) is '1'. Since the same morpheme does not exist in sentence 2 and sentence 3, the value of R(2, 3) is '0'. In the same way, the value of R(3, 4) becomes '0'. When the cross-reference relationship between sentences is expressed as a matrix, it is shown in FIG. 3.

결국, 인접한 문장의 상호 참조 관계 값을 평균으로 계산한다.In the end, the cross-reference relationship values of adjacent sentences are calculated as an average.

즉, (R(1, 2)+R(2, 3)+R(3, 4))/3 = (1+0+0)/3 = 0.333…That is, (R(1, 2)+R(2, 3)+R(3, 4))/3 = (1+0+0)/3 = 0.333...

또한, 상기 단계S200에서, 상기 제2 일관성 분석법은, 텍스트의 단어와 문장들의 문맥 정보를 학습하여 차원 축소 및 추상화를 통해 각 문장을 자질 벡터(Feature Vector)로 표현하는 단어 임베딩(Word Embedding) 및 문장 임베딩(Sentence Embedding)을 이용하여 해당 텍스트 문서를 구성하고 있는 문장들을 각 문장 벡터로 표현하고, 표현된 각 문장 벡터를 통해 각 문장 벡터 간의 유사도를 계산한 후, 계산된 각 문장 벡터 간의 유사도를 이용하여 해당 텍스트 문서의 의미적 일관성을 정량적으로 계산할 수 있다. 이때, 상기 문장 임베딩은, 문장을 구성하고 있는 단어들의 임베딩인 단어 임베딩을 통한 단어 벡터의 평균으로 정의됨이 바람직하다.In addition, in the step S200, the second consistency analysis method includes word embedding in which each sentence is expressed as a feature vector through dimension reduction and abstraction by learning context information of words and sentences of text, and Sentence embedding (Sentence Embedding) is used to express the sentences constituting the text document as each sentence vector, and after calculating the similarity between each sentence vector through each expressed sentence vector, the similarity between each calculated sentence vector is calculated. Using this, the semantic consistency of the text document can be calculated quantitatively. In this case, the sentence embedding is preferably defined as an average of word vectors through word embedding, which is the embedding of words constituting the sentence.

상기 제2 일관성 분석법에 대하여 예를 들어 보다 구체적으로 설명하면 다음과 같다.The second consistency analysis method will be described in more detail by way of example.

즉, 문장 임베딩(Sentence Embedding)을 이용한 문장 자질 표현은 텍스트의 단어와 문장들의 문맥 정보를 학습하여 차원 축소 및 추상화를 통해 문장을 자질 벡터(Feature Vector)로 표현하는 것이다. 문서를 구성하고 있는 문장 벡터(Sentence Vector)로 표현함으로써 이를 통해 문장 간의 의미적 유사도를 계산할 수 있다.That is, sentence feature expression using Sentence Embedding is to express the sentence as a feature vector through dimension reduction and abstraction by learning the context information of words and sentences in text. By expressing the document as a sentence vector (Sentence Vector), it is possible to calculate the semantic similarity between sentences.

이때, 문장 임베딩을 위한 모델은 'Sent2Vec'을 사용함이 바람직하다. 'Sent2Vec'은 단어가 아닌 문장의 의미를 잘 학습하기 위해 'FastText' 모델과 'Word2Vec' 모델의 CBOW(Continuous Bag of Words Model)를 확장한 모델이다.At this time, it is preferable to use'Sent2Vec' as the model for sentence embedding. 'Sent2Vec' is a model that extends the CBOW (Continuous Bag of Words Model) of the'FastText' model and the'Word2Vec' model in order to better learn the meaning of sentences, not words.

그리고, 'Sent2Vec' 모델에서 문장 임베딩은 문장을 구성하고 있는 단어들의 임베딩 즉, 단어 임베딩(Word Embedding)을 통한 단어 벡터의 평균으로 정의된다. 또한, 이 모델은 'uni-gram'단어 벡터뿐만 아니라 각 문장에 있는 단어의 'bi-gram'을 하나의 새로운 단어로 임베딩하여 함께 학습하고, 모든 'uni-gram'단어 벡터와 'bi-gram'단어 페어의 벡터를 평균하여 사용한다.In the'Sent2Vec' model, sentence embedding is defined as an average of word vectors through embedding of words constituting the sentence, that is, word embedding. In addition, this model embeds not only the'uni-gram' word vector, but also the'bi-gram' of the word in each sentence into one new word, and learns together, all'uni-gram' word vectors and'bi-gram' 'Use the average of the vector of word pairs.

예컨대, 한 문장 S의 문장 벡터 v_s, 어휘(vocabulary) 안의 단어 w에 대하여 CBOW에서 예측하고자 하는 타겟 단어 벡터를 v_w라고 할 때, 하기의 식 5와 같은 수식으로 문장 벡터를 생성한다. R(S)는 문장 S에서 생성한 'n-gram(uni-gram + bi-gram)'의 리스트이다.For example, when a sentence vector v _s of a sentence S and a target word vector to be predicted in CBOW for a word w in a vocabulary is v _w , a sentence vector is generated by the following equation (5). R(S) is a list of'n-grams (uni-gram + bi-gram)' created in sentence S.

(식 5)(Equation 5)

그리고, 한 문서를 구성하고 있는 문장을 벡터로 표현하고, 문장 벡터 간의 유사도를 이용하여 문서의 의미적 일관성을 측정한다. 문서 벡터 간의 유사도 척도는 코사인 유사도(Cosine Similarity)를 사용한다.And, the sentences constituting a document are expressed as vectors, and the semantic consistency of the document is measured by using the similarity between the sentence vectors. Cosine similarity is used as a measure of similarity between document vectors.

N개의 문장으로 구성되어 있는 문서 D={s₁,s₂,…,s_N}에 대하여 문장 s_i에 해당하는 문장 벡터

라고 할 때, 문장 간의 유사도(Sim(s_i,s_j))는 하기의 식 6의 아랫 부분 식과 같이 계산한다. 그리고, 한 문서의 일관성은 문서를 구성하고 있는 모든 문장들 간의 유사도의 평균을 이용하여 하기의 식 6의 윗 부분 식과 같이 계산한다.Document consisting of N sentences D={s ₁ ,s ₂ ,… The sentence vector corresponding to the sentence s _i for ,s _N }

In the case of, the similarity between sentences (Sim(s _i , s _j )) is calculated as the lower part of Equation 6 below. And, the consistency of a document is calculated as the upper part of Equation 6 below by using the average of the similarity between all sentences constituting the document.

(식 6)(Equation 6)

그리고, 상기 단계S200에서, 상기 제3 일관성 분석법은, 해당 텍스트 문서의 일관성을 측정하기 위해 해당 텍스트 문서를 구성하고 있는 각 문장, 각 문장의 앞 문장, 및 각 문장의 뒷 문장을 이용하여 상기 합성곱 신경망 학습을 수행한다.And, in the step S200, the third consistency analysis method, in order to measure the consistency of the text document, is synthesized by using each sentence constituting the text document, the preceding sentence of each sentence, and the back sentence of each sentence. Perform multiplication neural network training.

상기 합성곱 신경망의 입력으로 사용하기 위해 하나의 문장은 단어들의 벡터로 바꾸어 표현한다. 즉, 입력 문장을 형태소 분석한 결과에서 하나의 형태소를 단어로 가정하고 단어 임베딩(Word Embedding)을 이용한 단어 벡터(Word Vector)로 바꾸어 표현한다. 그리고, 각 문장을 구성하고 있는 단어의 벡터들을 연결(concatenate)한 문장 매트릭스(Sentence Matrix)로 바꾸어 사용한다.In order to be used as an input of the convolutional neural network, one sentence is converted into a vector of words and expressed. That is, from the result of morpheme analysis of an input sentence, one morpheme is assumed as a word, and it is expressed by converting it into a word vector using word embedding. And, it is used by replacing the vectors of words constituting each sentence into a concatenated sentence matrix (Sentence Matrix).

예컨대, 문장 매트릭스는 입력 문장 s로 만들어지며, 문장을 구성하는 단어

로 이루어져 있다. 이때,

는 문장을 구성하는 단어 개수이다. 단어 임베딩을 이용하여 d차원의 단어 벡터를 사용한다고 하였을 때, 문장 매트릭스

은 입력 문장 s를 구성하고 있는 단어들의 벡터로 구성되어 있다. 즉, S의 i번째 열은 s의 i번째 단어의 벡터이다.For example, the sentence matrix is made of the input sentence s, and the words constituting the sentence

Consists of. At this time,

Is the number of words that make up the sentence. Sentence matrix when d-dimensional word vectors are used using word embedding

Is composed of a vector of words that make up the input sentence s. That is, the ith column of S is a vector of the ith word of s.

그리고, 컨볼루션 레이어(Convolution Layer)는 컨볼루션 필터(Convolution Filter)를 이용하여 문장 매트릭스에서 중요한 자질(feature)을 추출한다. 컨볼루션 필터

는 너비 m과 문장 매트릭스

와 같은 차원 크기 d를 가지는 가중치 행렬이다. 도 3과 같이 컨볼루션 필터는 문장 매트릭스를 stride 1씩 움직이면서 벡터

를 출력으로 생성한다. c의 구성요소는 하기의 식 7과 같이 계산된다.In addition, the convolution layer extracts important features from the sentence matrix using a convolution filter. Convolution filter

Is the width m and the sentence matrix

It is a weight matrix with a dimension d of equal to. As shown in Figure 3, the convolution filter moves the sentence matrix by stride 1,

Is generated as output. The component of c is calculated as in Equation 7 below.

(식 7)(Equation 7)

여기서,

는 문장 매트릭스의 열을 따라 움직이는 크기가 m인 행렬의 한 부분이고,

는 행렬 간의 성분곱(element-wise multiplication)을 뜻하는 연산자이다. 컨볼루션(Convolution) 연산 뒤에는 비선형 활성함수 'ReLU'를 적용한다.here,

Is a part of a matrix of size m moving along the columns of the sentence matrix,

Is an operator that means element-wise multiplication between matrices. After the convolution operation, the nonlinear activation function'ReLU' is applied.

그리고, 풀링 레이어(Pooling Layer)에서는 최대 풀링(max pooling)을 이용하여 컨볼루션 레이어의 결과인 피쳐 맵(Feature Maps)의 차원을 줄이고 합친다. 최대 풀링(max pooling)은 하나의 피쳐 맵 행렬 C의 열에서 동작하며, 하기의 식 8과 같이 컨볼루션 레이어의 출력의 최대값을 반환한다.In the pooling layer, the dimension of the feature maps resulting from the convolution layer is reduced and combined using max pooling. The maximum pooling operates in one column of the feature map matrix C, and returns the maximum value of the output of the convolutional layer as shown in Equation 8 below.

(식 8)(Equation 8)

도 5는 문서의 일관성을 측정하기 위한 합성곱 신경망 기반 모델 구조이다. 즉, 문서를 구성하고 있는 각 문장과 앞, 뒤의 문장, 총 3개의 문장을 입력으로 사용한다. N(문서에서 전체 문장의 수)개의 문장으로 구성되어 있는 일관성이 있는(Coherent) 문서 D={s₁,s₂,…,s_N}에 대하여 3개의 문장을 하나의 세트(set) q로 정의하여 모델의 입력 데이터를 구성한다.5 is a convolutional neural network-based model structure for measuring document consistency. In other words, each sentence constituting the document and the first and second sentences, and a total of three sentences, are used as inputs. A coherent document consisting of N (total sentences in the document) D=(s ₁ ,s ₂ ,... For ,s _N }, three sentences are defined as one set q to form the input data of the model.

이때, 한 세트 안의 3개 문장 중 가운데 문장을 임의의 다른 비일관성 문장으로 교체하여 오류 학습 데이터를 생성한다. 도 6은 문서 일관성 측정 모델을 위해 생성한 입력 데이터의 예시이다.At this time, error learning data is generated by replacing the middle sentence with another random inconsistent sentence among the three sentences in a set. 6 is an example of input data generated for a document consistency measurement model.

예컨대, 일관성이 있는 문서를 구성하고 있는 문장들에 대하여 하나의 학습 데이터인 3개의 문장 세트에 대해 'y_q=1'로 설정한다. 그리고, 3개의 문장 중 가운데 문장을 임의의 다른 비일관성 문장으로 대치하여 생성한 오류 학습 데이터에 대해 'y_q=0'으로 설정하여 모델을 학습한다.For example,'y _q =1' is set for a set of three sentences that are one training data for sentences constituting a consistent document. Then, the model is trained by setting'y _q = 0'for the error learning data generated by replacing the middle sentence of the three sentences with other random inconsistent sentences.

그리고, 각 3개의 문장 임베딩 모델에서 출력으로 나온 3개의 벡터 x₁, x₂, x₃는 조인 레이어(Join Layer)의 입력으로 들어가 연결(Concatenate)하여 3개의 문장에 대한 하나의 벡터가 된다(식 9 참조).And, the three vectors x ₁ , x ₂ , x ₃ output from each of the three sentence embedding models enter the input of the join layer and are concatenated to become one vector for three sentences ( See equation 9).

(식 9)(Equation 9)

그리고, 히든 레이어(Hidden Layer)에서는 가중치 행렬 W_h, 바이어스 b_h와 비선형함수 f를 이용하여 하기의 식 10과 같이 벡터 h로 계산된다. 이때, h는 입력층에서부터 일련의 컨볼루션(convolution), 풀링(pooling), 차원 변환 등의 연산 과정을 거친 3개의 문장에 대한 최종 벡터라고 볼 수 있다.In addition, in the hidden layer, a vector h is calculated as shown in Equation 10 below using a weight matrix W _h , a bias b _h and a nonlinear function f. In this case, h may be regarded as a final vector for three sentences that have undergone a series of operations such as convolution, pooling, and dimensional transformation from the input layer.

(식 10)(Equation 10)

그리고, 히든 레이어의 출력으로 나온 h는 소프트맥스 분류 레이어(Softmax classification layer)입력으로 들어와 하기의 식 11과 같이 최종적으로 3개의 문장에 대한 일관성 확률을 계산한다. W_s는 가중치 행렬, b_s는 바이어스이다.Then, h outputted from the output of the hidden layer is input to the Softmax classification layer, and the consistency probability for the three sentences is finally calculated as shown in Equation 11 below. W _s is the weight matrix and b _s is the bias.

(식 11)(Equation 11)

상기와 같이 전술한 합성곱 신경반 기반의 문서 일관성 측정 모델은 예컨대, 'negative conditional log-likelihood'를 이용한 하기의 식 12로 학습을 진행한다.As described above, the convolutional neural spot-based document consistency measurement model is trained by Equation 12 below using, for example,'negative conditional log-likelihood'.

(식 12)(Equation 12)

여기서,

는 전체 모델의 파라미터들이다. 단어 임베딩 행렬 W, 각 문장 임베딩 모델의 컨볼루션 레이어에서 사용하는 컨볼루션 필터 가중치와 바이어스(

), 그리고 히든 레이어의 가중치 행렬과 바이어스(W_h;b_h), 소프트맥스 레이어의 가중치 행렬과 바이어스(W_s;b_s)로 구성되어 있다. g⁽ⁱ⁾는 N개의 문장으로 구성되어 있는 문서 D={s₁,s₂,…,s_N}에서 모델 입력을 위해 구성된 i번째 3개의 문장 세트이다. 모델 학습은 상기의 식 12를 최소화하면서 모델 파라미터

들을 최적화한다.here,

Are the parameters of the entire model. The word embedding matrix W, the convolution filter weight and bias used in the convolution layer of each sentence embedding model (

), and the weight matrix and bias of the hidden layer (W _h ;b _h ), and the weight matrix and bias of the softmax layer (W _s ;b _s ). g ⁽ⁱ⁾ is a document consisting of N sentences D=(s ₁ ,s ₂ ,... ,s _N } is the i-th set of three sentences constructed for model input. Model training minimizes Equation 12 above and

Optimize them.

그리고, 상기 합성곱 신경망의 합습된 결과를 통하여 N개의 문장으로 구성되어 있는 일관성이 있는 문서 D={s₁,s₂,…,s_N}에 대한 일관성 점수(S_D)는 하기의 식 13에 의해 정량적으로 계산할 수 있다.And, through the combined result of the convolutional neural network, a consistent document consisting of N sentences D={s ₁ ,s ₂ ,... , s _N} consistency score (S _D) for the can be calculated quantitatively by Equation 13 below.

(식 13)(Equation 13)

이때, 문서의 일관성은 문서로 구성한 모든 세트의 일관성에 따라 결정된다. 일관성이 없는 문장은 문서 전체 일관성에 너무 큰 악영향을 미치기 때문에, 한 문서의 일관성은 'y_q=1'인 모든 세트 q들의 곱으로 계산한다.At this time, the consistency of the document is determined by the consistency of all sets of documents. Since inconsistent sentences have too much negative effect on the overall consistency of the document, the consistency of a document is calculated as the product of all set qs with'y _q =1'.

추가적으로, 도면에 도시되진 않았지만, 상기 단계S200 이후에, 상기 단계S200에서 정량적으로 분석한 각 문장 간의 일관성 점수를 사용자가 시각적으로 볼 수 있도록 별도의 디스플레이부(300)의 디스플레이 화면에 표시하는 단계를 더 포함할 수 있다.In addition, although not shown in the drawing, after the step S200, displaying the consistency score between each sentence quantitatively analyzed in the step S200 on a display screen of a separate display unit 300 so that the user can visually see it. It may contain more.

더욱이, 상기 단계S200 이후에, 상기 단계S200에서 정량적으로 분석한 각 문장 간의 일관성 정보데이터를 텍스트 문서별 및/또는 각 문장별로 데이터베이스(DB)화하여 별도의 저장부(400)에 저장하는 단계를 더 포함할 수도 있다.Furthermore, after the step S200, the step of converting the consistency information data between each sentence quantitatively analyzed in the step S200 into a database (DB) for each text document and/or for each sentence and storing it in a separate storage unit 400 It may include more.

또한, 상기 단계S200 이후에, 상기 단계S200에서 정량적으로 분석된 각 문장 간의 일관성 정보데이터를 별도의 통신부(500)를 통해 유선 및/또는 무선으로 외부의 사용자 단말(미도시)에 전송하는 단계를 더 포함할 수도 있다.In addition, after the step S200, the step of transmitting the consistency information data between each sentence quantitatively analyzed in the step S200 to an external user terminal (not shown) through wired and/or wirelessly through a separate communication unit 500 It may include more.

한편, 본 발명의 일 실시예에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 방법은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다.Meanwhile, the method for analyzing consistency between sentences in a text document according to an embodiment of the present invention may be implemented as a computer-readable code on a computer-readable recording medium. The computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system.

예컨대, 컴퓨터가 읽을 수 있는 기록매체로는 롬(ROM), 램(RAM), 시디-롬(CD-ROM), 자기 테이프, 하드디스크, 플로피디스크, 이동식 저장장치, 비휘발성 메모리(Flash Memory), 광 데이터 저장장치 등이 있다.For example, computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, hard disk, floppy disk, removable storage device, and non-volatile memory. And optical data storage devices.

또한, 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.In addition, the computer-readable recording medium can be distributed to a computer system connected through a computer communication network, and stored and executed as code that can be read in a distributed manner.

전술한 본 발명에 따른 텍스트 문서에서 각 문장 간의 일관성 분석 장치 및 그 방법에 대한 바람직한 실시예에 대하여 설명하였지만, 본 발명은 이에 한정되는 것이 아니고 특허청구범위와 발명의 상세한 설명 및 첨부한 도면의 범위 안에서 여러 가지로 변형하여 실시하는 것이 가능하고 이 또한 본 발명에 속한다.In the above-described text document according to the present invention, a device for analyzing consistency between sentences and a preferred embodiment of the method have been described, but the present invention is not limited thereto, and the scope of the claims, the detailed description of the invention, and the accompanying drawings It is possible to carry out various modifications within, and this also belongs to the present invention.

100 : 전처리부,
200 : 텍스트 일관성 분석부,
300 : 디스플레이부,
400 : 저장부,
500 : 통신부100: pretreatment unit,
200: text consistency analysis unit,
300: display unit,
400: storage unit,
500: communication department

Claims

A pre-processing unit that receives a text document in a natural language in a document unit as an input, decomposes it into paragraphs and sentences, morphemely analyzes the decomposed paragraphs and sentences, and outputs a text document; And
For the text document output through the morpheme analysis from the preprocessor, a first consistency analysis method that quantitatively calculates consistency by analyzing the keywords of the sentences excluding the first sentence of the paragraph and the keywords of the preceding sentence, the feature of each sentence ) Is extracted and expressed as a vector, and through this, the second consistency analysis method that calculates the similarity of each sentence and uses it to quantitatively calculate the consistency, and randomly generates inconsistent sentences using a machine learning method, and this is a convolutional neural network based on deep learning. Text consistency that quantitatively analyzes whether each unit sentence maintains the contextual consistency of the entire text document by using at least one consistency analysis method among the third consistency analysis methods that quantitatively calculates consistency through the result of learning using and combined results. Including an analysis unit,
In the first consistency analysis method, each sentence in a text document, characterized in that the morphemes of each sentence are compared to determine how semantically adjacent sentences are related, and the consistency is quantitatively calculated using Equation 1 below. Device for analyzing the consistency of the liver.
(Equation 1)

Here, N is the total number of sentences in the document, and R(i, i+1) indicates the state of the cross-reference relationship between the i-th sentence and the i+1th sentence, and at least one cross-reference relationship to two adjacent sentences. If there is (or if morphemes with meaningful parts of speech exist identically), the value of R(i, i+1) is '1', otherwise it is defined as '0'.

delete

The method of claim 1,
The second consistency analysis method includes word embedding and sentence embedding in which each sentence is expressed as a feature vector through dimension reduction and abstraction by learning context information of words and sentences in text. The sentences constituting the text document are expressed as sentence vectors, the similarity between each sentence vector is calculated through each sentence vector, and the meaning of the text document is used using the similarity between each sentence vector. Apparatus for analyzing consistency between sentences in a text document, characterized in that quantitatively calculating the appropriate consistency.

The method of claim 3,
The sentence embedding is defined as an average of word vectors through word embedding, which is the embedding of words constituting the sentence.

The method of claim 3,
Each sentence in a text document, characterized in that the semantic consistency of the text document through the second consistency analysis method is calculated by Equation 2 below by using the average of the similarities between all sentences constituting the text document. Device for analyzing the consistency of the liver.
(Equation 2)

Here, N is the total number of sentences in the document, and Sim(s _i , s _j ) is a text document consisting of N sentences using Cosine Similarity D=(s ₁ , s ₂ ,... The sentence vector corresponding to the sentence s _i for ,s _N }

When said, the degree of similarity between sentences was calculated.

The method of claim 1,
In the third consistency analysis method, in order to measure the consistency of the text document, the convolutional neural network learning is performed using each sentence constituting the text document, the preceding sentence of each sentence, and the back sentence of each sentence, In order to use the convolutional neural network as an input, after morphological analysis of the input sentence, assuming one morpheme as a word, converting it into a word vector using word embedding, and constructing each sentence. A device for analyzing consistency between sentences in a text document, characterized in that the vectors of existing words are replaced with a connected sentence matrix (Sentence Matrix).

The method of claim 6,
As the input of the convolutional neural network, a total of three sentences consisting of each sentence constituting the text document, the front sentence of each sentence, and the back sentence of each sentence are used, but N (the number of total sentences in the document) sentences Consistent document consisting of D={s ₁ ,s ₂ ,… For ,s _N }, three sentences are defined as one set (q) to form the input data of the model, and the middle sentence among the three sentences in one set (q) is replaced with another random inconsistent sentence. Error learning data is generated, and'y _q =1' is set for a set of three sentences, which is one learning data, for sentences constituting a consistent document, and the middle sentence among the three sentences is Consistency analysis between sentences in a text document, characterized in that the model is trained by setting'y _q = 0'for the error learning data generated by replacing with other inconsistent sentences, and the consistency is calculated quantitatively through the combined results. Device.

The method of claim 7,
Consistent document D={s ₁ , s ₂ ,... consisting of N sentences through the combined result of the convolutional neural network. Consistency (S _D ) for,s _N } is a device for analyzing consistency between sentences in a text document, characterized in that it is quantitatively calculated by Equation 3 below.
(Equation 3)

Here, a consistent document consisting of N sentences D={s ₁ ,s ₂ ,... The set (q) defined for ,s _N } is

The method of claim 1,
Consistency between sentences in a text document, characterized in that further comprising a display unit that displays the consistency score between each sentence quantitatively analyzed in the text document under the control of the text consistency analysis unit on the display screen so that the user can visually see the consistency score Analysis device.

The method of claim 1,
In a text document, characterized in that it further comprises a storage unit that converts and stores consistency information data between each sentence quantitatively analyzed in the text document under the control of the text consistency analysis unit into a database (DB) for each text document or each sentence. Consistency analysis device between each sentence.

The method of claim 1,
Consistency between each sentence in a text document, characterized in that further comprising a communication unit for transmitting the consistency information data between each sentence quantitatively analyzed in a corresponding text document to an external user terminal by wire or wireless under the control of the text consistency analyzer Analysis device.

As a method of analyzing the consistency between each sentence in a text document using a device including a preprocessor and a text consistency analysis unit,
(a) receiving a text document in a natural language in a document unit as an input through the preprocessor, decomposing it into paragraphs and sentences, morpheme analysis of the decomposed paragraphs and sentences, and outputting the text document; And
(b) A system that quantitatively calculates consistency by analyzing keywords of sentences other than the first sentence of the paragraph and keywords of the preceding sentence with respect to the text document output by morpheme analysis in step (a) through the text consistency analysis unit. 1 Consistency analysis method, the second consistency analysis method, which calculates the similarity of each sentence by extracting the features of each sentence and expresses it as a vector, and uses this to quantitatively calculate the consistency, and randomly randomizes inconsistent sentences using the machine learning method. By using at least one of the third consistency analysis methods that generate and learn them using a deep learning-based convolutional neural network, and quantitatively calculate the consistency through the combined results, each unit sentence is in the context of the entire text document. It includes the step of quantitatively analyzing whether it is consistent,
In the step (b), the first consistency analysis method is characterized in that the morphemes of each sentence are compared to determine how semantically related adjacent sentences are, and consistency is quantitatively calculated using Equation 4 below. How to analyze the consistency between each sentence in a text document.
(Equation 4)

delete

The method of claim 12,
In the step (b), the second coherence analysis method includes word embedding in which each sentence is expressed as a feature vector through dimension reduction and abstraction by learning context information of words and sentences of text, and Sentence embedding (Sentence Embedding) is used to express the sentences constituting the text document as each sentence vector, and after calculating the similarity between each sentence vector through each expressed sentence vector, the similarity between each calculated sentence vector is calculated. A method for analyzing consistency between sentences in a text document, characterized in that the semantic consistency of a corresponding text document is calculated quantitatively by using.

The method of claim 14,
The sentence embedding is defined as an average of word vectors through word embedding, which is the embedding of words constituting the sentence.

The method of claim 14,
Each sentence in a text document, characterized in that the semantic consistency of the text document through the second consistency analysis method is calculated by Equation 5 below using the average of the similarity between all sentences constituting the text document. How to analyze the consistency of the liver.
(Equation 5)

When said, the degree of similarity between sentences was calculated.

The method of claim 12,
In the step (b), the third consistency analysis method is performed by using each sentence constituting the text document, the preceding sentence of each sentence, and the back sentence of each sentence to measure the consistency of the corresponding text document. Multiply neural network learning is performed, but from the result of morphological analysis of the input sentence for use as an input of the convolutional neural network, one morpheme is assumed as a word and expressed by converting it into a word vector using word embedding. Thereafter, a method for analyzing consistency between sentences in a text document, characterized in that the vectors of words constituting each sentence are replaced with a sentence matrix in which they are connected.

The method of claim 17,
As the input of the convolutional neural network, a total of three sentences consisting of each sentence constituting the text document, the front sentence of each sentence, and the back sentence of each sentence are used, but N (the number of total sentences in the document) sentences Consistent document consisting of D={s ₁ ,s ₂ ,… For ,s _N }, three sentences are defined as one set (q) to form the input data of the model, and the middle sentence among the three sentences in one set (q) is replaced with another random inconsistent sentence. Error learning data is generated, and'y _q =1' is set for a set of three sentences, which is one learning data, for sentences constituting a consistent document, and the middle sentence among the three sentences is Consistency analysis between sentences in a text document, characterized in that the model is trained by setting'y _q = 0'for the error learning data generated by replacing with other inconsistent sentences, and the consistency is calculated quantitatively through the combined results. Way.

The method of claim 18,
Consistent document D={s ₁ , s ₂ ,... consisting of N sentences through the combined result of the convolutional neural network. , s _N} consistency (S _D) is consistent way analysis between each sentence in the text document, characterized in that quantitatively calculated by the following formula 6 for the.
(Equation 6)

The method of claim 12,
After the step (b), displaying the consistency score between each sentence quantitatively analyzed in the step (b) on a display screen of a separate display unit so that the user can visually see the text How to analyze the consistency between each sentence in the document.

The method of claim 12,
After the step (b), the step of converting the consistency information data between each sentence quantitatively analyzed in the step (b) into a database (DB) for each text document or each sentence, and storing the data in a separate storage unit. A method for analyzing consistency between sentences in a text document, characterized in that.

The method of claim 12,
After the step (b), transmitting the consistency information data between each sentence quantitatively analyzed in the step (b) by wire or wirelessly through a separate communication unit to an external user terminal, characterized in that it further comprises A method of analyzing the consistency between each sentence in a text document.

A computer-readable recording medium storing a program capable of executing the method of claim 12, 14 to 22 by a computer.