KR102498839B1

KR102498839B1 - Apparatus and method for detection of sentence boundary

Info

Publication number: KR102498839B1
Application number: KR1020150063994A
Authority: KR
Inventors: 황금하
Original assignee: 한국전자통신연구원
Priority date: 2015-05-07
Filing date: 2015-05-07
Publication date: 2023-02-13
Also published as: KR20160131501A

Abstract

본 발명은 문장 경계 표시가 없는 텍스트에 대한 문장 경계를 인식하기 위한 장치 및 방법 에 관한 것이다. 이를 위한 본 발명의 문장 경계 인식 방법은 입력 텍스트의 엔그램(n-gram) 정보를 포함하는 특징 정보를 추출하는 단계; 문장 경계 정보를 포함하는 학습 코퍼스를 근거로, 문장 경계 임시 후보들에 대한 후보 확률들을 추출하는 단계; 후보 확률들을 근거로 문장 경계 임시 후보들에 대한 후보 점수들을 산출하는 단계; 후보 점수가 기설정된 임계 점수 이상인 문장 경계 임시 후보들을 문장 경계 후보들로 선택하는 단계; 및 문장 경계 후보들의 엔그램 정보를 포함하는 특징 정보와 후보 점수 정보를 사용하여 문장 경계 여부를 최종 분류하는 단계를 포함하는 것을 특징으로 한다.The present invention relates to an apparatus and method for recognizing sentence boundaries for text without sentence boundary markings. The sentence boundary recognition method of the present invention for this purpose includes extracting feature information including engram information of input text; extracting candidate probabilities for tentative sentence boundary candidates based on the learning corpus including sentence boundary information; calculating candidate scores for sentence boundary tentative candidates based on the candidate probabilities; selecting sentence boundary tentative candidates whose candidate scores are greater than or equal to a preset threshold score as sentence boundary candidates; and finally classifying whether or not a sentence boundary is present using feature information including engram information of sentence boundary candidates and candidate score information.

Description

Sentence boundary recognition device and method {APPARATUS AND METHOD FOR DETECTION OF SENTENCE BOUNDARY}

본 발명은 문장 경계 표시가 없는 텍스트에 대한 문장 경계를 인식하기 위한 장치 및 방법 에 관한 것이다.The present invention relates to an apparatus and method for recognizing sentence boundaries for text without sentence boundary markings.

문장 경계 인식 기술은 문서 또는 음성 입력된 텍스트를 입력으로 하여 해당 문서에 포함된 텍스트를 문장 단위로 구분하는 기술을 나타낸다. 다만, 기존의 음성 인식 시스템에서는 음성 입력된 텍스트에 대하여 문장 경계에 대한 인식이 존재하지 않는다. 이에 따라, 입력 문장에 대한 정확한 이해가 필요한 자동 통번역 시스템, 대화 시스템, 대화 기능을 기반으로 하는 교육 시스템, 그리고 휴먼-컴퓨터 인터페이스 등 언어 이해가 필요한 응용 시스템의 경우, 사용자가 사용하는 일상적인 발화와 달리, 문장 단위로 발화하는 불편함을 감수해야 하는 문제점이 존재한다. 뿐만 아니라, 상기 시스템들에서는 문장경계 정보의 결여로 인한 시스템의 서비스 오류와 성능 저하를 감수해야 하는 문제점도 존재한다.Sentence boundary recognition technology refers to a technology of dividing text included in a document into sentence units by taking a document or voice input text as an input. However, in the existing speech recognition system, there is no recognition of a sentence boundary with respect to voice input text. Accordingly, in the case of application systems that require language understanding, such as automatic interpretation and translation systems, conversation systems, conversation-based education systems, and human-computer interfaces that require accurate understanding of input sentences, Otherwise, there is a problem of having to endure the inconvenience of uttering in sentence units. In addition, in the above systems, there is a problem in that the service error and performance degradation of the system due to the lack of sentence boundary information have to be endured.

음성 인식 시스템뿐만 아니라, 휴대용 컴퓨팅 단말기를 사용하는 많은 사용자들은 문자 전송, 메모 작성 등 텍스트 입력 시 문장경계 부호 없이 입력하는 경우가 많다. 이렇게 입력된 텍스트에 대하여 문장경계를 인식하는 것도 향후 언어 이해가 필요한 응용 시스템에서 해결해야 할 과제이다.In addition to voice recognition systems, many users of portable computing terminals often input text without sentence boundary marks when inputting text such as text transmission or memo writing. Recognizing the sentence boundary of the input text is also a task to be solved in application systems that require language understanding in the future.

정보검색을 위해 뉴스 텍스트 등 문장 부호가 있는 대용량 텍스트에서 실제 문장경계 부호를 찾는데 관한 연구에서는 ".", "?", "!"과 같은 특정 문장 부호를 문장경계 후보로 하고, 다음 이런 문장 경계 후보의 좌우 n-gram 문맥 정보를 특징으로 학습 기반으로 문장 경계인지 여부를 분류하는 기법을 많이 사용한다. 대용량 학습 코퍼스를 확보 가능하고, 문장경계 후보 선택 용이하며, 모든 문장경계 후보를 문장경계로 분류하는 경우, 즉 베이스라인(baseline) 분류 정확도가 높기에 이 분야의 문장경계 후보 정확도는 코퍼스에 따라 상이하지만 대체적으로 97% 내지 99.8%이다. In a study on finding actual sentence boundary marks in large texts with punctuation marks such as news texts for information retrieval, specific punctuation marks such as ".", "?", and "!" A technique for classifying whether or not it is a sentence boundary based on learning based on the left and right n-gram context information of the candidate is widely used. It is possible to secure a large-capacity learning corpus, it is easy to select sentence boundary candidates, and when all sentence boundary candidates are classified as sentence boundaries, that is, the baseline classification accuracy is high, the accuracy of sentence boundary candidates in this field varies depending on the corpus. However, it is generally 97% to 99.8%.

대신 음성 입력 텍스트에 대한 문장경계 인식은 그 정확도가 코퍼스에 따라 약 40% 내지 60%로 아직 낮은 편인바, 특히 대용량 학습 코퍼스의 확보가 어려워 실생활 환경에서의 활용이 많이 이루어 지지 않고 있다. 이에서는 규칙 또는 학습 기반으로 문장경계를 인식하는 기술을 사용하는데 규칙 기반 경계 인식 기술은 낮은 성능과, 분야 또는 언어가 바뀔 때마다 사람에 의해 규칙을 새로이 구축해야 한다는 단점이 있고, 종래의 학습기반 방법은 학습 코퍼스로서의 음성 전사 코퍼스 대량 확보가 어려우며, 문장경계 후보가 많고 그 정확도가 낮기에 문장경계 인식의 최종 정확도도 낮은 문제점이 있다.Instead, the accuracy of sentence boundary recognition for voice input text is still low at about 40% to 60% depending on the corpus, so it is not particularly used in real life environments because it is difficult to secure a large-capacity learning corpus. In this case, a rule-based or learning-based sentence boundary recognition technology is used. The rule-based boundary recognition technology has disadvantages such as low performance and the need to build a new rule by a person whenever the field or language changes. The method has problems in that it is difficult to secure a large amount of speech transcription corpus as a learning corpus, and the final accuracy of sentence boundary recognition is low because there are many sentence boundary candidates and their accuracy is low.

본 발명과 관련되는 선행기술로는, 대한민국 등록특허 제1259558호(문장경계 인식 장치 및 방법), 그리고 미국 등록특허 제8364485호(Method for automatically identifying sentence boundaries in noisy conversational data)가 있다.Prior art related to the present invention includes Korean Patent Registration No. 1259558 (Sentence Boundary Recognition Apparatus and Method) and US Patent No. 8364485 (Method for automatically identifying sentence boundaries in noisy conversational data).

본 발명은 문장 경계가 없는 텍스트에서도, 자동으로 문장 경계를 인식할 수 있는 문장 경계 인식 장치 및 방법을 제공하는데 그 목적이 있다.An object of the present invention is to provide a sentence boundary recognizing apparatus and method capable of automatically recognizing sentence boundaries even in text without sentence boundaries.

상기와 같은 과제를 해결하기 위한 본 발명의 문장 경계 인식 방법은 입력 텍스트의 엔그램(n-gram) 정보를 포함하는 특징 정보를 추출하는 단계; 문장 경계 정보를 포함하는 학습 코퍼스를 근거로, 문장 경계 임시 후보들에 대한 후보 확률들을 추출하는 단계; 후보 확률들을 근거로 문장 경계 임시 후보들에 대한 후보 점수들을 산출하는 단계; 후보 점수가 기설정된 임계 점수 이상인 문장 경계 임시 후보들을 문장 경계 후보들로 선택하는 단계; 및 문장 경계 후보들의 엔그램 정보를 포함하는 특징 정보와 후보 점수 정보를 사용하여 문장 경계 여부를 최종 분류하는 단계를 포함하는 것을 특징으로 한다.A sentence boundary recognition method of the present invention for solving the above problems includes extracting feature information including engram information of input text; extracting candidate probabilities for tentative sentence boundary candidates based on the learning corpus including sentence boundary information; calculating candidate scores for sentence boundary tentative candidates based on the candidate probabilities; selecting sentence boundary tentative candidates whose candidate scores are greater than or equal to a preset threshold score as sentence boundary candidates; and finally classifying whether or not a sentence boundary is present using feature information including engram information of sentence boundary candidates and candidate score information.

또한, 후보 점수들을 산출하는 단계는 문장 경계 임시 후보 별로 좌측 엔그램과 우측 엔그램을 설정하는 단계; 및 설정된 좌측 엔그램이 문장 경계 임시 후보의 좌측에 올 확률과, 우측 엔그램이 문장 경계 임시 후보의 우측에 올 확률을 포함하는 후보 확률을 근거로 문장 경계 임시 후보 별로 엔그램 후보 점수를 산출하는 단계를 포함할 수 있다.In addition, the step of calculating the candidate scores may include setting left engrams and right engrams for each tentative sentence boundary candidate; and calculating engram candidate scores for each sentence boundary provisional candidate based on candidate probabilities including the probability that the set left engram is on the left side of the sentence boundary provisional candidate and the right engram is on the right side of the sentence boundary provisional candidate. steps may be included.

또한, 입력 텍스트에 대한 음성 정보가 존재하는 경우, 음성 정보에 포함된 정음 및 정음 길이 정보를 추출하는 단계를 더 포함할 수 있다.In addition, when there is voice information for the input text, a step of extracting static sound and static sound length information included in the voice information may be further included.

또한, 후보 점수들을 산출하는 단계는 입력 텍스트에 대한 음성 정보가 존재하는 경우, 정음 및 정음 길이 정보를 근거로 가중 후보 점수를 산출하는 단계를 포함할 수 있다.Also, calculating the candidate scores may include calculating weighted candidate scores based on information about the silence and the length of the silence when there is voice information about the input text.

또한, 후보 점수들을 산출하는 단계는 문장 경계 임시 후보 별로 엔그램 후보 점수와 가중 후보 점수를 합산함으로써 이루어질 수 있다.In addition, the step of calculating the candidate scores may be performed by summing the engram candidate score and the weighted candidate score for each provisional sentence boundary candidate.

또한, 문장 경계 후보들을 분석함으로써, 문장 경계 부호를 선택하는 단계는 설정된 좌측 엔그램이 문장 경계 임시 후보의 좌측에 올 확률, 우측 엔그램이 문장 경계 임시 후보의 우측에 올 확률, 문장 경계 임시 후보 별 엔그램 후보 점수, 그리고 음성 정보를 근거로 산출된 가중 후보 점수 중 적어도 하나를 포함하는 특징 정보를 이용함으로써 이루어질 수 있다.In addition, by analyzing the sentence boundary candidates, the step of selecting the sentence boundary code includes the probability that the set left engram comes to the left of the provisional sentence boundary candidate, the probability that the right engram comes to the right of the provisional sentence boundary candidate, and the probability that the set left engram comes to the right of the provisional sentence boundary candidate This can be achieved by using feature information including at least one of a star engram candidate score and a weighted candidate score calculated based on voice information.

또한, 문장 경계 후보들을 분석함으로써, 문장 경계 부호를 선택하는 단계는 SVM(support vector machine) 및 MEM(maximum entropy model)을 포함하는 분류 모델을 통해 이루어질 수 있다.In addition, the step of selecting sentence boundary codes by analyzing sentence boundary candidates may be performed through a classification model including a support vector machine (SVM) and a maximum entropy model (MEM).

또한, 엔그램은 유니그램(unigram), 바이그램(bigram) 및 트라이그램(trigram)을 포함할 수 있다.Also, engrams may include unigrams, bigrams, and trigrams.

본 발명의 문장 경계 인식 장치 및 방법에 따르면, 일반 텍스트를 활용함으로써 학습코퍼스 확보가 용이하기에, 음성 입력 등 문장경계 정보가 없는 텍스트에 대한 문장경계 인식에서, 문장 경계 후보 선정에 활용 가능한 통계 정보 확보가 용이한 장점을 갖는다.According to the apparatus and method for recognizing sentence boundaries of the present invention, since it is easy to secure a learning corpus by using plain text, statistical information that can be used for selecting sentence boundary candidates in recognizing sentence boundaries for text without sentence boundary information such as voice input It has the advantage of being easy to obtain.

또한, 본 발명은 학습 코퍼스에서 추출 계산한 엔그램의 문장경계 확률 점수와 정음의 문장경계 확률 점수를 이용하여 문장경계 후보를 일차적으로 인식하고, 모든 어휘, 음성 특징 및 확률 점수를 모두 특징으로 하여, 학습기반 분류 방법으로 문장 경계를 인식하는 방법을 사용함으로써 문장경계 인식 정확도가 높은 장점을 갖는다.In addition, the present invention primarily recognizes sentence boundary candidates using the sentence boundary probability score of the engram extracted and calculated from the learning corpus and the sentence boundary probability score of the jeongeum, and all vocabulary, voice characteristics, and probability scores are characterized by , it has the advantage of high sentence boundary recognition accuracy by using a method for recognizing sentence boundaries as a learning-based classification method.

또한, 본 발명의 일 실시예에 따른 문장 경계 인식 장치 및 방법에 따르면, 음성 입력에서 매번 한 문장씩 발화해야 하는 불편한 사용자 인터페이스 개선이 가능하며, 자연스러운 발화에 대한 문장경계 정확도가 향상되어, 자동 통번역 시스템, 대화 시스템, 휴먼-컴퓨터 인터페이스 등 많은 언어 이해가 필요한 분야에서 활용 가능하다.In addition, according to the apparatus and method for recognizing sentence boundaries according to an embodiment of the present invention, it is possible to improve an inconvenient user interface in which one sentence must be uttered each time in voice input, and the accuracy of sentence boundaries for natural speech is improved, so that automatic interpretation and translation can be performed. It can be used in fields that require a lot of language understanding, such as systems, conversation systems, and human-computer interfaces.

도 1은 본 발명의 일 실시예에 따른 문장 경계 인식 장치에 대한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 처리부에 대한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 문장 경계 인식 방법에 대한 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 학습 코퍼스를 구축하는 방법에 대한 흐름도이다.1 is a block diagram of an apparatus for recognizing sentence boundaries according to an embodiment of the present invention.
2 is a block diagram of a processing unit according to an embodiment of the present invention.
3 is a flowchart of a sentence boundary recognition method according to an embodiment of the present invention.
4 is a flowchart of a method for building a learning corpus according to an embodiment of the present invention.

본 발명을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다. 여기서, 반복되는 설명, 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능, 및 구성에 대한 상세한 설명은 생략한다. 본 발명의 실시형태는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.The present invention will be described in detail with reference to the accompanying drawings. Here, repeated descriptions, well-known functions that may unnecessarily obscure the subject matter of the present invention, and detailed descriptions of configurations are omitted. Embodiments of the present invention are provided to more completely explain the present invention to those skilled in the art. Accordingly, the shapes and sizes of elements in the drawings may be exaggerated for clarity.

이하, 본 발명의 실시예에 따른 문장 경계 인식 장치 및 방법에 대하여 설명하도록 한다. 본 발명의 일 실시예에 따른 문장 경계 인식 장치 및 방법은 스마트 폰과 같은 휴대용 컴퓨팅 단말기 또는 PC의 마이크 등을 통한 문장경계 표시가 없는 음성 인식 결과 텍스트, 및 사용자가 휴대용 또는 기타 컴퓨팅 기기의 키보드 등 장치를 통해 문장 경계 부호 없이 입력한 텍스트에 대해, 문장 경계를 자동으로 인식할 수 있는 것을 특징으로 한다. Hereinafter, an apparatus and method for recognizing sentence boundaries according to an embodiment of the present invention will be described. An apparatus and method for recognizing sentence boundaries according to an embodiment of the present invention is a portable computing terminal such as a smart phone or a voice recognition result text without a sentence boundary display through a microphone of a PC, and a user using a keyboard of a portable or other computing device. It is characterized in that sentence boundaries can be automatically recognized for text input without sentence boundary marks through the device.

이를 위해, 본 발명의 일 실시예에 따른 문장 경계 인식 장치 및 방법은 대량 확보 가능한 일반 텍스트 코퍼스를 학습 코퍼스로 활용하고, 문장 경계 임시 후보 별로 엔그램 후보 점수와 가중 후보 점수를 산출함으로써 문장 경계 후보를 선택하며, 특징 기반 분류 기법을 이용하여 해당 위치가 문장 경계가 맞는지의 여부를 확인하는 문장 경계 인식 방법 및 장치를 제공할 수 있다. To this end, the sentence boundary recognition apparatus and method according to an embodiment of the present invention utilizes a large-capacity plain text corpus as a learning corpus and calculates engram candidate scores and weighted candidate scores for each sentence boundary tentative candidate sentence boundary candidate It is possible to provide a sentence boundary recognition method and apparatus for selecting and checking whether a corresponding position matches a sentence boundary using a feature-based classification technique.

도 1은 본 발명의 일 실시예에 따른 문장 경계 인식 장치(100)에 대한 블록도이다. 본 발명의 일 실시예에 따른 문장 경계 인식 장치(100)는 제어부(110) 및 처리부(120)를 포함하여 구성될 수 있다. 1 is a block diagram of an apparatus 100 for recognizing sentence boundaries according to an embodiment of the present invention. The apparatus 100 for recognizing sentence boundaries according to an embodiment of the present invention may include a control unit 110 and a processing unit 120.

제어부(110)는 사용자(10)로부터 입력부(20)를 통해 입력된 텍스트를 처리부(120)로 전달하고, 이하에서 설명되는 처리부(120)를 제어하는 기능을 한다. 여기서, 입력부(20)를 통해 입력되는 텍스트는 문자로 입력된 텍스트뿐만 아니라, 사용자의 발화를 음성 인식함으로써 인식된 텍스트도 포함할 수 있다. 또한, 사용자의 발화를 음성 인식한 경우, 제어부(110)는 정음 및 정음 길이 등과 같은 음성 정보도 함께 처리부(120)로 전달할 수 있다. 또한, 사용자(10)를 통해 입력된 텍스트(이하, 입력 테스트)에는 문장 경계 부호가 없거나, 또는 영어의 경우 대소문자 구분도 없을 수 있다.The control unit 110 functions to transfer text input from the user 10 through the input unit 20 to the processing unit 120 and controls the processing unit 120 described below. Here, the text input through the input unit 20 may include not only text input as text but also text recognized by voice recognition of the user's utterance. In addition, when the user's utterance is voice recognized, the controller 110 may transmit voice information such as silence and silence length to the processor 120 together. In addition, the text input through the user 10 (hereinafter referred to as an input test) may not have sentence boundary marks or may not be case-sensitive in the case of English.

처리부(120)는 제어부(110)를 통해 전달된 입력 텍스트와 저장부(30)에 저장된 학습 코퍼스를 근거로, 입력 텍스트의 문장 경계를 인식하는 기능을 한다. 처리부(120)를 통한 문장 경계 인식 방법은 다음과 같다.The processing unit 120 performs a function of recognizing sentence boundaries of the input text based on the input text transmitted through the control unit 110 and the learning corpus stored in the storage unit 30 . A sentence boundary recognition method through the processing unit 120 is as follows.

먼저, 처리부(120)는 입력 텍스트의 엔그램(n-gram) 정보를 포함하는 특징 정보를 추출한다. 또한, 위에서 언급한 바와 같이, 입력 텍스트가 음성 인식을 통해 생성된 경우, 정음 및 정음 길이를 포함하는 음성 정보도 특징 정보에 포함시킬 수 있다.First, the processor 120 extracts feature information including n-gram information of the input text. Also, as mentioned above, when the input text is generated through voice recognition, voice information including silence and silence length may also be included in the feature information.

그 후, 처리부(120)는 저장부(30)에 저장된 학습 코퍼스를 근거로 입력 텍스트에 대한 문장 경계 임시 후보들과, 이들 문장 경계 임시 후보에 대한 후보 확률들을 추출한다. 여기서, 학습 코퍼스는 특정 분야의 텍스트뿐만 아니라, 대량 확보 가능한 일반 텍스트를 기반으로 사전에 생성되어 저장부(30)에 저장된 코퍼스로 정의된다. 또한, 학습 코퍼스에는 문장 경계 부호의 이전 및 이후에 존재할 확률이 높은 엔그램 데이터와, 각 엔그램 데이터 별로 문장 경계 부호의 이전 및 이후에 존재할 수 있는 확률 정보를 포함할 수 있다. 이에 따라, 처리부(120)는 저장부(30)에 저장된 학습 코퍼스를 이용하여 입력 텍스트에 대한 문장 경계 임시 후보들과, 임시 후보에 대한 후보 확률들을 추출할 수 있다. 여기서, 학습 코퍼스의 생성 방법은 이하에서 다시 설명되므로, 이에 대한 설명은 생략된다.Then, the processing unit 120 extracts sentence boundary tentative candidates for the input text and candidate probabilities for these sentence boundary tentative candidates based on the learning corpus stored in the storage unit 30 . Here, the learning corpus is defined as a corpus generated in advance based on text in a specific field as well as general text that can be secured in large quantities and stored in the storage unit 30 . In addition, the learning corpus may include engram data with a high probability of existing before and after sentence boundary marks, and probability information that may exist before and after sentence boundary marks for each engram data. Accordingly, the processing unit 120 may extract sentence boundary temporary candidates for the input text and candidate probabilities for the temporary candidates using the learning corpus stored in the storage unit 30 . Here, since the method of generating the learning corpus will be described again below, the description thereof will be omitted.

그 후, 처리부(120)는 학습 코퍼스를 근거로 도출된 후보 확률들을 근거로 문장 경계 임시 후보들에 대한 후보 점수들을 산출한다. 구체적으로, 처리부(120)는 좌우 문맥에 대한 파악을 위해 문장 경계 임시 후보 별로 좌측 엔그램과 우측 엔그램을 설정하고, 설정된 좌측 엔그램이 문장 경계 임시 후보의 좌측에 올 확률과, 우측 엔그램이 문장 경계 임시 후보의 우측에 올 확률을 포함하는 후보 확률을 근거로 문장 경계 임시 후보 별로 엔그램 후보 점수를 산출할 수 있다. 또한, 처리부(120)는 입력 텍스트에 대한 음성 정보가 존재하는 경우, 정음 및 정음 길이 정보를 근거로 가중 후보 점수를 산출할 수 있다. 그리고, 처리부(120)는 문장 경계 임시 후보 별로 엔그램 후보 점수와 가중 후보 점수를 합산함으로써 후보 점수를 산출할 수 있다. 여기서, 후보 점수들에 대한 산출 방법은 아래의 수학식 1과 같이 표현될 수 있다.Then, the processing unit 120 calculates candidate scores for sentence boundary tentative candidates based on the candidate probabilities derived based on the learning corpus. Specifically, the processing unit 120 sets left engrams and right engrams for each tentative sentence boundary candidate in order to identify left and right contexts, and determines the probability that the set left engram is on the left side of the tentative sentence boundary candidate, and the right engram Based on the candidate probabilities including the probabilities of coming to the right of the provisional sentence boundary candidates, engram candidate scores can be calculated for each sentence boundary provisional candidate. In addition, the processing unit 120 may calculate a weighted candidate score based on the static sound and the static sound length information when there is voice information about the input text. Then, the processing unit 120 may calculate the candidate score by adding the engram candidate score and the weighted candidate score for each tentative sentence boundary candidate. Here, the calculation method for the candidate scores can be expressed as Equation 1 below.

수학식 1에서 score는 경계 부호에 대한 후보 점수를, a는 가중치를, Score_i _-gram은 엔그램의 좌측 및 우측에 경계 부호가 올 확률 점수를, 그리고 Score_pause는 음성 정보가 있는 경우, 음성 정보를 근거로 산출된 가중 후보 점수(해당 위치에 정음이 있을 경우, 이에 대한 확률 점수)를 나타낸다. 그리고, 수학식 1에서, i-gram에서 i는 1,...,n개를 나타내며, 예를 들어, 유니그램(unigram), 바이그램(bigram) 및 트라이그램(trigram)을 포함할 수 있다. 여기서, Score_i _-gram의 산출 방법은 아래의 수학식 2에, 그리고 Score_pause에 대한 산출 방법은 아래의 수학식 3과 같이 표현될 수 있다.In Equation 1, score is the candidate score for the boundary code, a is the weight, Score _i _-gram is the probability score of the boundary code to the left and right of the engram, and Score _pause is the voice information if there is voice information. It represents a weighted candidate score calculated based on the information (probability score for Jeongeum in the corresponding location). And, in Equation 1, i in i-gram represents 1, ..., n, and may include, for example, unigram, bigram, and trigram. Here, the calculation method for score _i _-gram can be expressed as in Equation 2 below, and the calculation method for score _pause can be expressed as in Equation 3 below.

수학식 2에서 /는 문장 경계를, Pr(n-gram_left)는 특정 엔그램의 좌측에 문장 경계가 올 확률을, 그리고 Pr(n-gram_right)는 특정 엔그램의 우측에 문장 경계가 올 확률을 나타낸다.In Equation 2, / is the sentence boundary, Pr(n-gram _left ) is the probability that the sentence boundary comes to the left of a specific engram, and Pr(n-gram _right ) is the sentence boundary that comes to the right of the specific engram. represents the probability.

그리고, 수학식 3에서 c와 d는 가중치를, Pr(pause)와 Pr(silence_length)는 정음이 있을 경우의 확률을 나타내는데, Pr(pause)는 예를 들어, 한 문장 사이의 띄어 쓰기에 해당하는 정음을, 그리고 Pr(silence_length)는 한 문장과 다음 문장 사이의 발화 사이의 정음을 나타낸다.And, in Equation 3, c and d represent weights, and Pr(pause) and Pr(silence_length) represent the probability of silence. Pr(pause) corresponds to, for example, spacing between sentences silence, and Pr(silence_length) represents silence between utterances between one sentence and the next sentence.

앞서 언급한 처리부(120)를 통해 후보 점수를 산출하는 예시는 다음과 같다. 여기서, 입력 텍스트는 "i like it very much how about you"인 것으로 가정된다. 또한, 이하의 설명에서는 상기 입력 텍스트가 음성 정보 없이 입력된 것으로 가정된다. 해당 입력 텍스트가 입력된 경우, 처리부(120)는 저장부(30)에 저장된 학습 코퍼스를 근거로 문장 경계 임시 후보(/)를 "i like it / very much / how about you /"로 선정할 수 있다. 앞서 언급한 바와 같이, 저장부(30)에 저장된 학습 코퍼스는 문장 경계 부호의 이전 및 이후에 존재할 확률이 높은 엔그램 데이터와, 각 엔그램 데이터 별로 문장 경계 부호의 이전 및 이후에 존재할 수 있는 확률 정보를 포함할 수 있다. 이에 따라, 상대적으로 문장 경계 부호의 이전 및 이후에 존재할 확률이 낮은 i와 like 사이, 그리고 very와 much 사이가 아닌, it과 very 사이, much와 how 사이, 그리고 you 이후에 문장 경계 임시 후보가 설정될 수 있다. An example of calculating candidate scores through the aforementioned processing unit 120 is as follows. Here, the input text is assumed to be "i like it very much how about you". Also, in the following description, it is assumed that the input text is input without voice information. When the corresponding input text is input, the processing unit 120 may select "i like it / very much / how about you /" as a tentative sentence boundary candidate (/) based on the learning corpus stored in the storage unit 30. there is. As mentioned above, the learning corpus stored in the storage unit 30 includes engram data with a high probability of existing before and after sentence boundary marks, and probabilities of existence before and after sentence boundary marks for each engram data. information may be included. Accordingly, provisional sentence boundary candidates are set between it and very, between much and how, and after you, rather than between i and like, and between very and much, which have a relatively low probability of existing before and after sentence boundary marks. It can be.

이에 따라, 위의 예시에서, 좌측 엔그램(문장 경계 /의 좌측에 위치한 엔그램)은 "it", "like it", "i like it", "very much"이고, 우측 엔그램(문장 경계 /의 우측에 위치한 엔그램)은 "how", "how about", "how about you"으로 설정될 확률이 높다. 엔그램은 유니그램, 바이그램 및 트라이그램을 포함하는 것으로 가정되었으므로, 각 문장 경계 임시 후보 별로 상기와 같이 나타날 수 있다. 또한, 위의 예시에서 very much는 우측 엔그램으로 있을 확률이 매우 낮기에, very much의 왼쪽에 문장 경계 임시 후보를 허용할지에 대한 사항은 구체적인 후보 선정 점수 계산 방법 및 후보 선정 점수 임계값(threshold)을 어떻게 정하냐에 따라 달라 질 수 있다.Accordingly, in the example above, the left engram (the engram located to the left of the sentence boundary /) is "it", "like it", "i like it", "very much", and the right engram (the sentence boundary / is located on the left). Engrams located to the right of /) are more likely to be set to "how", "how about", or "how about you". Since engrams are assumed to include unigrams, bigrams, and trigrams, each tentative sentence boundary candidate may appear as described above. In addition, in the above example, very much has a very low probability of being a right engram, so whether to allow a sentence boundary temporary candidate to the left of very much depends on the specific candidate selection score calculation method and candidate selection score threshold ), depending on how you set it.

또한, 본 발명의 설명을 위해 제시한 예에서는 영어를 사용하였으나 본 발명의 방법은 언어와 무관하게 실시 가능하다는 점이 이해되어야 한다.In addition, although English is used in the example presented for the description of the present invention, it should be understood that the method of the present invention can be implemented regardless of language.

그 후, 처리부(120)는 위의 수학식 1 내지 3을 근거로 도출된 후보 점수와 기설정된 임계 점수를 비교하는 기능을 수행한다. 여기서, 후보 점수가 기설정된 임계 점수인 문장 경계 임시 후보들은 문장 경계 후보들로 선택될 수 있다.After that, the processing unit 120 performs a function of comparing the candidate score derived based on Equations 1 to 3 above with a preset threshold score. Here, provisional sentence boundary candidates whose candidate scores are preset critical scores may be selected as sentence boundary candidates.

그 후, 처리부(120)는 문장 경계 후보들을 분석함으로써, 문장 경계 부호를 선택할 수 있다. 구체적으로, 처리부(120)는 문장 경계 임시 후보들을 분석함으로써 특징 정보에 해당 분석 정보를 추가할 수 있다. 여기서 분석 정보는 좌측 엔그램이 문장 경계 임시 후보의 좌측에 올 확률, 우측 엔그램이 문장 경계 임시 후보의 우측에 올 확률 및 문장 경계 임시 후보 별 엔그램 후보 점수를 포함할 수 있다. 또한, 특징 정보는 음성 정보가 존재하는 경우, 음성 정보를 근거로 산출된 가중 후보 점수를 포함할 수 있다. Then, processing unit 120 may select a sentence boundary code by analyzing sentence boundary candidates. Specifically, the processing unit 120 may add corresponding analysis information to feature information by analyzing sentence boundary tentative candidates. Here, the analysis information may include a probability that a left engram comes to the left of a provisional sentence boundary candidate, a probability that a right engram comes to the right of a sentence boundary provisional candidate, and an engram candidate score for each sentence boundary provisional candidate. In addition, the feature information may include a weighted candidate score calculated based on voice information when voice information exists.

즉, 처리부(120)는 상술한 과정을 통해 이미 문장 경계 후보들이 선택되었으나, 보다 높은 정확도로 문장 경계 부호들을 선택할 수 있도록 앞서 언급된 문장 경계 임시 후보들의 좌우 문맥, 전체 문맥 분석을 수행하기 위해 특징 정보에 분석 정보를 추가한다. 여기서, 특징 정보는 문장 경계 임시 후보의 좌우 문맥 파악을 위해 어휘 또는 어간 엔그램과, 품사 엔그램도 포함할 수 있다. 또한, 학습 코퍼스 중 일부가 음성 정보가 있는 코퍼스이고, 입력은 음성 정보가 없는 텍스트인 경우, 정음 특징 관련 파라미터는 0으로, 또는 기설정된 디폴트 값으로 이용될 수 있다.That is, although sentence boundary candidates have already been selected through the above-described process, the processing unit 120 analyzes the left and right contexts and the entire context of the above-mentioned tentative sentence boundary candidates so that sentence boundary codes can be selected with higher accuracy. Add analysis information to information. Here, the feature information may also include vocabulary or stem engrams and part-of-speech engrams in order to grasp the left and right contexts of the tentative sentence boundary candidates. In addition, when a part of the learning corpus is a corpus with voice information and the input is text without voice information, 0 or a preset default value may be used for the static feature-related parameter.

상술한 분석 정보를 특징 정보에 추가한 이후, 처리부(120)는 상술한 특징 정보를 SVM(support vector machine) 및 MEM(maximum entropy model)을 포함하는 분류 모델에 적용시킴으로써 문장 경계 부호를 선택한다. 여기서, 분류 모델을 상기 2개의 종류로 기재한 이유는 상기 2개의 모델이 다른 분류 모델들에 비해 가볍게 동작될 수 있기 때문이다. 즉, 본 발명은 상기 2개의 모델로 제한되지 않고 다양한 분류 모델이 적용될 수 있다.After adding the above-described analysis information to the feature information, the processing unit 120 selects sentence boundary codes by applying the above-described feature information to a classification model including a support vector machine (SVM) and a maximum entropy model (MEM). Here, the reason why the classification models are described as the above two types is that the above two models can be operated lightly compared to other classification models. That is, the present invention is not limited to the above two models and various classification models may be applied.

처리부(120)는 특징 정보를 상술한 분류 모델에 적용하여, 문장 경계 후보들이 실제 문장 경계 부호인지를 분류한다. 분류 카테고리는 문장경계인 경우는 yes, 아닌 경우는 no로 이진분류(binary classification)에 해당된다. 또한 그 분류 결과로, 문장경계 후보 중 yes로 분류된 후보는 해당 텍스트에 문장 경계 후보를 남겨두고, no로 분류된 후보 위치는 해당 텍스트에서 문장 경계 후보를 삭제하게 된다. The processing unit 120 applies the feature information to the above-described classification model to classify sentence boundary candidates as actual sentence boundary codes. The classification category corresponds to binary classification, with yes if it is a sentence boundary and no if it is not. Also, as a result of the classification, among sentence boundary candidates classified as yes, sentence boundary candidates are left in the corresponding text, and candidate positions classified as no are deleted from the corresponding text.

예를 들어, 위에서 언급된 문장 "i like it / very much / how about you /"에서, 처리부(120)를 통한 분류 결과는 첫 번째 후보가 no로, 두 번째와 세 번째 후보가 yes로 분류될 수 있으며, 분류 결과에 따라 문장 경계 후보를 처리한 결과, "i like it very much / how about you /"로 문장경계 정보 있는 텍스트로 출력될 수 있다.For example, in the sentence “i like it / very much / how about you /” mentioned above, the classification result through the processing unit 120 is that the first candidate is classified as no, and the second and third candidates are classified as yes. And as a result of processing the sentence boundary candidates according to the classification result, "i like it very much / how about you /" can be output as text with sentence boundary information.

또한, 상기 과정에서, 출력 텍스트의 가독성을 위해, 문장 경계에 대한 정보가 있는 텍스트를 출력시, 영어의 경우 대소문자 처리, 경계 생성 등 후처리로 "I like it very much. How about you?"와 같은 가독성이 높은 텍스트로 출력할 수 있다. In addition, in the above process, for the readability of the output text, when outputting the text with information on the sentence boundary, post-processing such as uppercase and lowercase processing and boundary generation in the case of English, "I like it very much. How about you?" It can be output as highly readable text such as

그리고, 처리부(120)는 문장 경계 인식에 이용되는 학습 코퍼스의 구축을 수행할 수 있다. 앞서 언급한 바와 같이, 학습 코퍼스는 특정 분야의 텍스트뿐만 아니라, 일반 분야의 텍스트를 근거로 생성되는 것을 특징으로 한다. 구체적으로, 학습 코퍼스는 음성 전사 코퍼스, 또는 사람에 의해 문장경계 부호가 추가된 음성 인식 결과 코퍼스일 수도 있지만, 음성 정보가 전혀 없는 일반 텍스트 일수도 있으며, 분류 대상 텍스트와 동일 분야인 특정 분야 텍스트 일부와, 일반 분야 대용량 텍스트를 결합한 텍스트 코퍼스일 수도 있다. 이에 따라, 본 발명의 처리부(120)는 다양한 텍스트들을 근거로 학습 코퍼스를 구축할 수 있다. 처리부(120)를 통해 학습 코퍼스를 구축하는 방법은 다음과 같다. In addition, the processing unit 120 may construct a learning corpus used for sentence boundary recognition. As mentioned above, the learning corpus is characterized in that it is generated based on texts in a general field as well as texts in a specific field. Specifically, the learning corpus may be a voice transcription corpus or a voice recognition result corpus to which sentence boundary codes are added by a human, but may also be plain text without voice information at all, or a part of text in a specific field that is the same as the text to be classified. It may also be a text corpus combining large-capacity text in the general field. Accordingly, the processing unit 120 of the present invention may construct a learning corpus based on various texts. A method of constructing a learning corpus through the processing unit 120 is as follows.

먼저, 처리부(120)는 학습 코퍼스의 구축을 위한 학습 코퍼스의 텍스트에서 엔그램 빈도를 추출한다. 그 후, 처리부(120)는 이들 엔그램이 문장 경계 부호의 좌측에 올 확률 Pr(/|n-gram left), 문장 경계부호의 우측에 위치할 확률 Pr(/|n-gram right)를 계산한다. 또한 학습 코퍼스에 음성 정보 있는 경우, 정음도 엔그램에 포함 시킬 수 있으며, 정음 또는 정음 길이 별로 해당 위치에 문장경계 부호가 올 확률 Pr(/|pause), Pr(/|silence_length)를 계산한다. 처리부(120)는 이렇게 계산된 점수를 저장부(30)에 저장할 수 있다.First, the processing unit 120 extracts the engram frequency from the text of the learning corpus for constructing the learning corpus. After that, the processing unit 120 calculates the probability Pr(/|n-gram left) of these engrams to the left of the sentence boundary code and the probability Pr(/|n-gram right) of the engram to be located to the right of the sentence boundary code. do. In addition, if there is voice information in the learning corpus, silence can also be included in the engram, and the probability Pr(/|pause) and Pr(/|silence_length) of the sentence boundary code appearing at the corresponding position for each silence or silence length are calculated. The processing unit 120 may store the calculated scores in the storage unit 30 .

그 후, 처리부(120)는 저장부(30)에 저장된 후보 확률을 이용하여, 학습 코퍼스에 대하여 가능한 모든 문장 경계 후보를 인식한다. 문장경계 후보 인식 방법은 수학식 1 내지 3을 참조로 위에서 언급한 입력 텍스트에 대한 문장 경계 후보를 도출하는 방법과 동일하므로, 이에 대한 추가적인 설명은 생략한다. 또한, 앞서 언급한 바와 같이, 이렇게 인식된 문장 경계 후보는 실제로 문장 경계인 경우도 있으나, 아닌 경우도 존재할 수 있다.Then, processing unit 120 uses the candidate probabilities stored in storage unit 30 to recognize all possible sentence boundary candidates for the training corpus. Since the method for recognizing sentence boundary candidates is the same as the method for deriving sentence boundary candidates for the input text mentioned above with reference to Equations 1 to 3, an additional description thereof is omitted. In addition, as mentioned above, the sentence boundary candidate recognized in this way may actually be a sentence boundary, but may not exist.

그 후, 처리부(120)는 상기 과정을 통해 도출된 문장 경계 후보를 분석함으로써, 문장 경계 부호를 선택할 수 있다. 구체적으로, 처리부(120)는 문장 경계 후보를 분석함으로써 분석 정보를 특징 정보에 추가하고, 특징 정보를 근거로 분류 과정을 수행함으로써, 문장 경계 부호를 선택한다. 이러한 문장 경계 부호를 선택하는 방법도 위에서 상세히 설명하였으므로, 이에 대한 추가적인 설명은 생략한다. 다만, 문장 경계 후보가 실제로 문장 경계인 경우는 분류 카테고리에서 yes인 사건 또는 예문에 해당되며, 실제 문장경계가 아닌 경우 이는 분류카테고리에서 no인 사건 또는 예제에 해당된다. 학습 코퍼스의 일부가 음성 정보가 있는 코퍼스고, 일부가 텍스트인 경우, 일반 텍스트의 정음 특징 관련 파라미터는 0으로, 또는 디폴트 값으로 정할 수 있다. 이들 예제는 분류기의 학습 과정을 통하여 저장부(30)에 저장될 수 있다.Then, the processing unit 120 may select a sentence boundary code by analyzing the sentence boundary candidates derived through the above process. Specifically, the processing unit 120 adds analysis information to feature information by analyzing sentence boundary candidates, and selects sentence boundary codes by performing a classification process based on the feature information. Since the method for selecting such a sentence boundary code has been described in detail above, an additional description thereof will be omitted. However, if a sentence boundary candidate is actually a sentence boundary, it corresponds to an event or example sentence that is yes in the classification category, and if it is not an actual sentence boundary, it corresponds to an event or example that is no in the classification category. If part of the training corpus is a corpus with voice information and part is text, parameters related to the quiet characteristics of plain text may be set to 0 or a default value. These examples may be stored in the storage unit 30 through the learning process of the classifier.

또한, 위에서 설명한 학습 코퍼스의 구축 방법은, 단지 예시일 뿐이고 다양한 방식으로 구축될 수 있다는 점이 이해되어야 한다.In addition, it should be understood that the method of constructing the learning corpus described above is only an example and may be constructed in various ways.

도 2는 본 발명의 일 실시예에 따른 처리부(120)에 대한 블록도이다. 위에서 언급한 바와 같이, 본 발명의 일 실시예에 따른 처리부(120)는 사용자로부터 입력된 입력 텍스트와, 저장부에 저장된 학습 코퍼스를 근거로 입력 텍스트의 문장 경계를 인식하는 기능을 한다. 이를 위해, 처리부(120)는 특징 정보 추출 모듈(121), 후보 확률 추출 모듈(122), 문장 경계 후보 선정 모듈(123), 분석 모듈(124) 및 문장 경계 부호 선택 모듈(125)을 포함하여 구성될 수 있다. 여기서, 처리부(120)에 포함된 각 구성은 명세서의 이해를 돕기 위해 도 1을 참조로 언급한 특징들을 기능별로 나열한 것이므로, 처리부(120)가 이들 구성 모두를 포함하는 것으로 이해되지 않아야 한다. 또한, 아래에서는 위에서 언급된 처리부(120)에 대한 구체적인 설명과 중복되는 사항은 생략되어 서술된다.2 is a block diagram of a processing unit 120 according to an embodiment of the present invention. As mentioned above, the processing unit 120 according to an embodiment of the present invention functions to recognize sentence boundaries of the input text based on the input text input from the user and the learning corpus stored in the storage unit. To this end, the processing unit 120 includes a feature information extraction module 121, a candidate probability extraction module 122, a sentence boundary candidate selection module 123, an analysis module 124, and a sentence boundary code selection module 125. can be configured. Here, since each component included in the processing unit 120 is a list of features mentioned with reference to FIG. 1 by function to help understanding of the specification, it should not be understood that the processing unit 120 includes all of these components. In addition, in the following, details overlapping with the detailed description of the processing unit 120 mentioned above are omitted and described.

특징 정보 추출 모듈(121)은 입력 텍스트의 엔그램(n-gram) 정보를 포함하는 특징 정보를 추출하는 기능을 한다. 또한, 위에서 언급한 바와 같이, 입력 텍스트가 음성 인식을 통해 생성된 경우, 정음 및 정음 길이를 포함하는 음성 정보도 특징 정보에 포함시킬 수 있다.The feature information extraction module 121 functions to extract feature information including n-gram information of input text. Also, as mentioned above, when the input text is generated through voice recognition, voice information including silence and silence length may also be included in the feature information.

후보 확률 추출 모듈(122)은 저장부(30)에 저장된 학습 코퍼스를 근거로 입력 텍스트에 대한 문장 경계 임시 후보들과, 이들 문장 경계 임시 후보에 대한 후보 확률들을 추출하는 기능을 한다. The candidate probability extraction module 122 functions to extract provisional sentence boundary candidates for the input text and candidate probabilities for these sentence boundary provisional candidates on the basis of the learning corpus stored in the storage unit 30 .

문장 경계 후보 선정 모듈(123)은 후보 확률 추출 모듈(122)을 통해 추출된 문장 경계 임시 후보들에서, 기설정된 알고리즘을 통해 문장 경계 후보를 선정하는 기능을 한다. 구체적으로, 문장 경계 후보 선정 모듈(123)은 후보 확률 추출 모듈(122)에 저장된 후보 확률들을 근거로 문장 경계 임시 후보들에 대한 후보 점수들을 산출한다. 여기서, 후보 점수는 엔그램의 좌측 및 우측에 경계 부호가 올 확률 점수와, 음성 정보가 있는 경우 음성 정보를 근거로 산출된 가중 후보 점수를 포함할 수 있다. 여기서, 후보 점수에 대한 산출 방법은 위에서 수학식 1 내지 수학식 3을 참조로 상세히 설명하였으므로, 이에 대한 추가적인 설명은 생략한다.The sentence boundary candidate selection module 123 functions to select a sentence boundary candidate from the sentence boundary temporary candidates extracted through the candidate probability extraction module 122 through a preset algorithm. Specifically, the sentence boundary candidate selection module 123 calculates candidate scores for sentence boundary tentative candidates based on the candidate probabilities stored in the candidate probability extraction module 122 . Here, the candidate scores may include probability scores of boundary codes coming to the left and right sides of the engram, and weighted candidate scores calculated based on the voice information when there is voice information. Here, since the calculation method for the candidate score has been described in detail with reference to Equations 1 to 3 above, additional description thereof will be omitted.

분석 모듈(124)은 문장 경계 후보 선정 모듈(123)을 통해 선정된 문장 경계 후보를 분석하는 기능을 한다. 구체적으로, 분석 모듈(124)은 문장 경계 후보를 분석함으로써 분석 정보를 특징 정보에 추가하는 기능을 한다. 이에 따라, 특징 정보는 좌측 엔그램이 문장 경계 임시 후보의 좌측에 올 확률, 우측 엔그램이 문장 경계 임시 후보의 우측에 올 확률 및 문장 경계 임시 후보 별 엔그램 후보 점수를 더 포함할 수 있다. 또한, 특징 정보는 음성 정보가 존재하는 경우, 음성 정보를 근거로 산출된 가중 후보 점수를 포함할 수 있다. The analysis module 124 functions to analyze the sentence boundary candidates selected through the sentence boundary candidate selection module 123 . Specifically, the analysis module 124 functions to add analysis information to feature information by analyzing sentence boundary candidates. Accordingly, the feature information may further include a probability that the left engram comes to the left of the temporary candidate sentence boundary, a probability that the right engram comes to the right of the temporary candidate sentence boundary, and engram candidate scores for each temporary candidate sentence boundary. In addition, the feature information may include a weighted candidate score calculated based on voice information when voice information exists.

문장 경계 부호 선택 모듈(125)은 특징 정보를 분류 모델에 적용시킴으로써 문장 경계 부호를 선택하는 기능을 한다. 여기서, 분류 모델은 SVM 및 MEM을 포함할 수 있다. The sentence boundary code selection module 125 functions to select sentence boundary codes by applying feature information to a classification model. Here, the classification model may include SVM and MEM.

또한, 처리부(120)는 도 2에는 도시되지 않았으나, 학습 코퍼스를 구축하는 기능을 수행할 수 있다. 여기서 학습 코퍼스를 구축하는 방법에 대한 설명은 도 1을 참조로 상세히 설명되었으므로, 이에 대한 추가적인 설명은 생략한다.Also, although not shown in FIG. 2 , the processing unit 120 may perform a function of constructing a learning corpus. Since the description of the method of constructing the learning corpus has been described in detail with reference to FIG. 1, further description thereof will be omitted.

도 3은 본 발명의 일 실시예에 따른 문장 경계 인식 방법에 대한 흐름도이다. 이하에서는 위에서 설명된 부분과 중복되는 사항은 생략하여 설명이 이루어진다.3 is a flowchart of a sentence boundary recognition method according to an embodiment of the present invention. Hereinafter, descriptions are made by omitting items overlapping with those described above.

먼저, 입력 텍스트의 엔그램 정보를 포함하는 특징 정보를 추출하는 단계(S110)가 수행된다. 여기서, 입력 텍스트는 문자로 입력된 텍스트뿐만 아니라, 사용자의 발화를 음성 인식함으로써 인식된 텍스트도 포함할 수 있다. S110 단계는 사용자의 발화를 음성 인식한 경우, 정음 및 정음 길이와 같은 음성 정보도 함께 추출할 수 있다.First, a step of extracting feature information including engram information of the input text (S110) is performed. Here, the input text may include not only text input as text but also text recognized by voice recognition of the user's utterance. In step S110, when the user's utterance is voice-recognized, voice information such as silence and silence length may also be extracted.

그 후, 입력 텍스트에 대한 문장 경계 임시 후보들과 이에 대한 후보 확률들을 추출하는 단계(S120)가 이루어진다. 구체적으로, S120 단계는 문장 경계 정보를 포함하는 학습 코퍼스를 근거로, 입력 텍스트에 대한 문장 경계 임시 후보들과 문장 경계 임시 후보들에 대한 후보 확률들을 추출하는 기능을 한다. Thereafter, a step of extracting sentence boundary tentative candidates for the input text and candidate probabilities thereof is performed (S120). Specifically, step S120 serves to extract provisional sentence boundary candidates for the input text and candidate probabilities for the sentence boundary provisional candidates for the input text, based on the learning corpus including sentence boundary information.

그 후, S120 단계에서 추출된 후보 확률들을 근거로 문장 경계 임시 후보들에 대한 후보 점수들을 산출하는 단계(S130)가 이루어진다. 앞서 언급한 바와 같이, 후보 점수는 엔그램의 좌측 및 우측에 경계 부호가 올 확률 점수와, 음성 정보가 있는 경우 음성 정보를 근거로 산출된 가중 후보 점수를 포함할 수 있다. 이를 위해, S130 단계는 문장 경계 임시 후보 별로 좌측 엔그램과 우측 엔그램을 설정하는 단계 및 설정된 좌측 엔그램이 문장 경계 임시 후보의 좌측에 올 확률과, 우측 엔그램이 문장 경계 임시 후보의 우측에 올 확률을 포함하는 후보 확률을 근거로 문장 경계 임시 후보 별로 엔그램 후보 점수를 산출하는 단계를 포함할 수 있다. 또한, S130 단계는 입력 텍스트에 대한 음성 정보가 존재하는 경우, 정음 및 정음 길이 정보를 근거로 가중 후보 점수를 산출하는 단계를 포함할 수 있다. 여기서, 후보 점수에 대한 산출 방법은 위에서 수학식 1 내지 수학식 3을 참조로 상세히 설명하였으므로, 이에 대한 추가적인 설명은 생략한다.Thereafter, a step S130 of calculating candidate scores for provisional sentence boundary candidates based on the candidate probabilities extracted in step S120 is performed. As mentioned above, the candidate scores may include probability scores for border codes coming to the left and right of the engram, and weighted candidate scores calculated based on the voice information when there is voice information. To this end, step S130 is a step of setting left engrams and right engrams for each sentence boundary temporary candidate, and the probability that the set left engram is on the left side of the sentence boundary temporary candidate, and the right engram is on the right side of the sentence boundary temporary candidate. A step of calculating engram candidate scores for each tentative sentence boundary candidate based on the candidate probability including the probability of coming may be included. Further, step S130 may include calculating a weighted candidate score based on static sound and static sound length information when there is voice information about the input text. Here, since the calculation method for the candidate score has been described in detail with reference to Equations 1 to 3 above, additional description thereof will be omitted.

그 후, 후보 점수와 기설정된 임계 점수를 비교하는 단계가 수행된다. 비교 결과, 후보 점수가 기설정된 임계 점수를 초과하는 문장 경계 임시 후보들을 문장 경계 후보들로 선택하는 단계(S140)가 이루어진다.After that, a step of comparing the candidate score with a predetermined threshold score is performed. As a result of the comparison, provisional sentence boundary candidates whose candidate scores exceed a predetermined threshold score are selected as sentence boundary candidates (S140).

그 후, 문장 경계 후보들을 분석하고(S150), 문장 경계 부호를 선택하는 단계(S160)가 이루어진다. 구체적으로, S160 단계는 문장 경계 후보들을 분석함으로써 분석 정보를 특징 정보에 추가하고, 특징 정보를 근거로 문장 경계 부호를 선택하는 단계이다. 여기서, 특징 정보는 설정된 좌측 엔그램이 문장 경계 임시 후보의 좌측에 올 확률, 우측 엔그램이 문장 경계 임시 후보의 우측에 올 확률, 문장 경계 임시 후보 별 엔그램 후보 점수, 그리고 음성 정보를 근거로 산출된 가중 후보 점수 중 적어도 하나를 더 포함할 수 있다.Thereafter, a step of analyzing sentence boundary candidates (S150) and selecting a sentence boundary code (S160) is performed. Specifically, step S160 is a step of adding analysis information to feature information by analyzing sentence boundary candidates, and selecting a sentence boundary code based on the feature information. Here, the feature information is based on the probability that the set left engram comes to the left of the provisional sentence boundary candidate, the probability that the right engram comes to the right of the sentence boundary provisional candidate, the engram candidate score for each sentence boundary provisional candidate, and voice information. At least one of the calculated weighted candidate scores may be further included.

S160 단계는 이렇게 생성된 특징 정보를 분류 모델에 적용함으로써 문장 경계 부호를 선택한다. 여기서, 분류 모델은 예를 들어, SVM 및 MEM을 포함할 수 있다. In step S160, a sentence boundary code is selected by applying the generated feature information to a classification model. Here, the classification model may include, for example, SVM and MEM.

도 4는 본 발명의 일 실시예에 따른 학습 코퍼스를 구축하는 방법에 대한 흐름도이다.4 is a flowchart of a method for building a learning corpus according to an embodiment of the present invention.

먼저, 학습 코퍼스의 구축을 위한 학습 코퍼스의 텍스트에서 엔그램 빈도를 추출하는 단계(S210)가 수행된다.First, an engram frequency is extracted from the text of the learning corpus for construction of the learning corpus (S210).

그 후, 엔그램 후보 점수를 산출하는 단계(S220)가 이루어진다. 구체적으로, S220 단계는 엔그램이 문장 경계 부호의 좌측에 올 확률 Pr(/|n-gram left), 문장 경계부호의 우측에 위치할 확률 Pr(/|n-gram right)를 계산하는 단계를 포함한다. 또한, 학습 코퍼스에 음성 정보 있는 경우, S220 단계는 정음도 엔그램에 포함시킬 수 있으며, 정음 또는 정음 길이 별로 해당 위치에 문장경계 부호가 올 확률 Pr(/|pause), Pr(/|silence_length)를 계산하는 단계를 포함할 수 있다.After that, a step (S220) of calculating engram candidate scores is performed. Specifically, in step S220, the probability Pr(/|n-gram left) of engrams to come to the left of the sentence boundary code and the probability Pr(/|n-gram right) to be located to the right of the sentence boundary code are calculated. include In addition, if there is voice information in the learning corpus, in step S220, silence may also be included in the engram, and the probability that a sentence boundary code comes at the corresponding position for each silence or silence length Pr(/|pause), Pr(/|silence_length) It may include the step of calculating .

S230 단계는 S220 단계를 통해 도출된 엔그램 후보 점수를 저장부에 저장하는 단계이다.Step S230 is a step of storing the engram candidate scores derived through step S220 in a storage unit.

그 후, 저장부에 저장된 후보 확률을 근거로, 학습 코퍼스에 대해 가능한 모든 문장 경계 후보들을 추출하는 단계(S240)가 이루어진다, 여기서 S240 단계는 위에서 수학식 1 내지 3을 참조로 설명한 입력 텍스트에 대한 문장 경계 후보를 도출하는 방법과 동일하므로, 이에 대한 추가적인 설명은 생략한다.Thereafter, based on the candidate probabilities stored in the storage unit, a step (S240) of extracting all possible sentence boundary candidates for the learning corpus is performed. Here, step S240 is performed for the input text described with reference to Equations 1 to 3 above. Since it is the same as the method of deriving sentence boundary candidates, an additional description thereof is omitted.

그 후, S240 단계를 통해 도출된 문장 경계 후보들을 분석하여(S250), 특징 정보에 분석 정보를 추가하는 단계가 이루어진다, 그 후, 특징 정보를 분류 모델에 적용함으로써, 문장 경계 부호를 선택하고, 이를 저장부에 저장하는 단계(S260)가 이루어진다. Thereafter, the sentence boundary candidates derived through step S240 are analyzed (S250), and analysis information is added to feature information. Then, by applying the feature information to a classification model, a sentence boundary code is selected, A step (S260) of storing this in the storage unit is performed.

이상에서와 같이 도면과 명세서에서 최적의 실시예가 개시되었다. 여기서 특정한 용어들이 사용되었으나, 이는 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.As described above, the optimal embodiment has been disclosed in the drawings and specifications. Although specific terms have been used herein, they are only used for the purpose of describing the present invention and are not used to limit the scope of the present invention described in the claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

100 : 문장 경계 인식 장치 110 : 제어부
120 : 처리부100: sentence boundary recognition device 110: control unit
120: processing unit

Claims

In the sentence boundary recognizing method of the sentence boundary recognizing device,
extracting, by a feature information extraction module, feature information including engram information of the input text;
extracting, by a candidate probability extraction module, sentence boundary tentative candidates for the input text and candidate probabilities for the sentence boundary tentative candidates for the input text, based on a learning corpus including sentence boundary information;
calculating, by a sentence boundary candidate selection module, candidate scores for the sentence boundary tentative candidates based on the candidate probabilities;
selecting, by the sentence boundary candidate selection module, sentence boundary tentative candidates whose candidate scores are greater than or equal to a preset threshold score as sentence boundary candidates; and
analyzing, by an analysis module, the sentence boundary candidates based on the feature information and candidate scores for the sentence boundary tentative candidates, and adding the analyzed analysis information to the feature information; and
Selecting, by a sentence boundary code selection module, a sentence boundary code by applying the feature information to a preset classification model;
The learning corpus is
Engram data with more than a predetermined probability to exist before and after the sentence boundary code, and probability information that may exist before and after the sentence boundary code for each engram data;
The analysis information
A sentence boundary recognizing method, characterized in that it includes a probability that a left engram comes to the left of a provisional sentence boundary candidate, a probability that a right engram comes to the right of a provisional sentence boundary candidate, and an engram candidate score for each sentence boundary provisional candidate.

The method of claim 1,
Calculating the candidate scores
setting left engrams and right engrams for each tentative sentence boundary candidate; and
Calculating engram candidate scores for each sentence boundary provisional candidate based on candidate probabilities including the probability that the set left engram is on the left side of the sentence boundary provisional candidate and the right engram is on the right side of the sentence boundary provisional candidate step;
Characterized in that it comprises a, sentence boundary recognition method.

The method of claim 2,
The sentence boundary recognition method
extracting, by the candidate probability extraction module, static sound and static sound length information included in the speech information, when the speech information of the input text exists;
Characterized in that it further comprises, sentence boundary recognition method.

The method of claim 3,
Calculating the candidate scores
and calculating a weighted candidate score based on the static sound and static sound length information when there is voice information for the input text.

The method of claim 4,
Calculating the candidate scores
Sentence boundary recognizing method characterized in that for each sentence boundary provisional candidate, the engram candidate score and the weighted candidate score are summed.

delete

The method of claim 5,
The step of selecting the sentence boundary code is
A sentence boundary recognition method, characterized in that for selecting the sentence boundary code using a classification model including a support vector machine (SVM) and a maximum entropy model (MEM).

The method of claim 1,
The method of recognizing sentence boundaries, characterized in that the engram includes a unigram, a bigram, and a trigram.

A feature information extraction module for extracting feature information including engram information of input text;
a candidate probability extraction module that extracts sentence boundary tentative candidates for the input text and candidate probabilities for the sentence boundary tentative candidates based on the learning corpus including sentence boundary information;
a sentence boundary candidate selection module that calculates candidate scores for the sentence boundary tentative candidates based on the candidate probabilities, and selects sentence boundary tentative candidates whose candidate scores are equal to or greater than a preset threshold score as sentence boundary candidates; and
an analysis module for adding analyzed analysis information to the feature information by analyzing the sentence boundary candidates based on the feature information and candidate scores for the sentence boundary tentative candidates; and
A sentence boundary code selection module for selecting a sentence boundary code by applying the feature information to a preset classification model;
The learning corpus is
Engram data with more than a predetermined probability to exist before and after the sentence boundary code and probability information that may exist before and after the sentence boundary code for each engram data;
The analysis information
A sentence boundary recognizing apparatus, characterized in that it includes a probability that a left engram comes to the left of a provisional sentence boundary candidate, a probability that a right engram comes to the right of a provisional sentence boundary candidate, and an engram candidate score for each sentence boundary provisional candidate.

The method of claim 9,
The sentence boundary candidate selection module
Set left engrams and right engrams for each tentative sentence boundary candidate,
Calculating engram candidate scores for each sentence boundary provisional candidate based on candidate probabilities including the probability that the set left engram is on the left side of the sentence boundary provisional candidate and the right engram is on the right side of the sentence boundary provisional candidate Characterized in that, sentence boundary recognizing device.